There is no shortage of opinions about AI in medicine, and most of them talk past each other. Part of the problem is the word itself. "AI" now covers a huge range of different technologies, but most people, doctors included, picture one thing when they hear it: ChatGPT, or something like it. That is where the trouble starts. Medicine does not run on a single kind of thinking. The AI everyone is picturing does.
Most of the time, medicine is probabilistic: a doctor weighs how likely each diagnosis is and keeps updating as the evidence comes in. But parts of it are deterministic: fixed rules that hold the same way every time, whatever the odds. This post is about why a general-purpose language model is the wrong fit for both, and what a better system looks like. It builds on an earlier post on what clinical reasoning actually is.
When people say "AI" in medicine, what do they actually mean?
Almost always, they mean a large language model. Tools like these generate text by predicting the next word, over and over, from patterns in the huge amount of writing they were trained on. It works remarkably well. They produce fluent, confident writing about almost anything, medicine included.
But look at what sits underneath that fluency. It is not medicine. A general-purpose model has no real account of how the body works, how common a disease is, or which symptom should move a diagnosis up the list. What it has is a sense of how words tend to follow other words. It is very good at sounding right, with nothing underneath to tell it whether it is right.
What does it mean that medicine is probabilistic in its reasoning?
Medicine is probabilistic in its reasoning. That means a doctor rarely works towards one fixed answer. They hold several possible diagnoses in mind at once, weigh how likely each one is, and keep shifting those odds as new information arrives.
Take a patient with chest pain. A doctor is thinking about several things at once: a heart problem, a muscle strain, reflux, a clot on the lung. Each new detail moves the order around. Pain that spreads to the jaw lifts the heart problem up the list; pain that flares when you press on the chest wall lifts the muscle strain instead. The formal name for this is Bayesian reasoning, but the idea is simple: start from what is common, then update with every symptom and sign. That is clinical reasoning, and it is hard because you are weighing all of it against itself at the same time.
The thing to notice is where those odds come from. They are tied to real medical facts: how common a condition is, how strongly a given sign points towards it or away. And they are shaped by experience, the thousands of cases a clinician has seen before this one. The probability means something because it is anchored to the world, not just to the words used to describe it.
What does it mean that medicine is also deterministic?
Deterministic means fixed: the same inputs always give the same result, like a rule or a calculation. Not all of medicine is a matter of odds. Some of it is simply fixed. A patient who has had their appendix out cannot have appendicitis, however well the symptoms seem to fit. A drug someone is allergic to is ruled out, every time. These rules do not bend, whoever is on shift and however the case is phrased.
These are not probabilities, and they are not meant to be. You do not want a likelihood here. You want certainty. A system that can only ever say "more likely" or "less likely" is no use for the parts of medicine that turn on "always" and "never". Good reasoning still has to obey the rules, and the two jobs need different tools.
Why is an ungrounded LLM flawed on both counts?
Put the two together and the problem with a general-purpose model is clear. The problem is not that it is probabilistic. Good clinical reasoning is probabilistic too. The problem is that its probability is ungrounded: it is not tied to any real model of how the body works or how illness behaves. This is the word the whole argument turns on. It is weighing which words are likely to come next, not which diagnoses are likely to be true. So it can produce a differential that reads like careful reasoning but is built on nothing more than the shape of medical language. This is the same reason an earlier post argued that a general chatbot is the wrong thing to lean on while you are still learning.
Take the reasoning first. The model's rankings are not real clinical likelihoods, so they can be confidently wrong, and they can change if you ask the same question twice. The research shows it. One JAMA Network Open study changed the answer options on medical questions in ways that should not matter to real reasoning, and leading models lost up to a third of their accuracy: a sign they were matching patterns, not reasoning through the medicine. A second study tested models on full clinical-reasoning tasks rather than multiple-choice exams, and found they do not hold up well enough to be relied on. A randomised trial found that giving doctors a language model did not measurably improve their diagnostic reasoning over the usual resources.
The fixed rules are worse, because the model has none. Tell it the appendix is already out, and it might drop appendicitis or it might not, and it might handle the same line differently next time. Anyone who has used one of these tools for real work knows the feeling: you spell out a rule, it holds for a while, then quietly forgets it a few messages later. It cannot give you the "always" and "never" those rules depend on. Guaranteeing anything is not what a next-word predictor does.
What does a grounded, hybrid system look like?
The fix is not a bigger language model. It is to stop asking one tool to do every job. Each part of the problem needs the kind of system that suits it, and that is the direction serious clinical-AI work is taking.
It has three parts. First, a structured base of medical knowledge: conditions, their symptoms, how common they are, and how strongly each finding points one way or another. Second, reasoning that runs over that knowledge, the same weighing a doctor does but tied to real likelihoods instead of word patterns, with the fixed rules sitting alongside it for the things that must never bend. Third, the language model, doing the one thing it is genuinely good at: holding a natural conversation and explaining the reasoning back in plain English. So if you are weighing up an AI tool for medicine, that is what to look for: ask where the reasoning actually happens, whether it is grounded in real medical knowledge, and whether the fixed rules are guaranteed rather than left to the model's discretion. The knowledge does the reasoning, the rules guarantee the actions, and the model does the talking. That is the framework Gestalt is built on.
And this is not only our view. AIPatient, published in Communications Medicine in 2025, pairs a structured knowledge base with a language model for exactly this reason. Work in NEJM AI has shown that grounding a model in a trusted source of knowledge beats the model on its own. And a recent New England Journal of Medicine review treats these tools as something to be supervised, not handed the keys.
Can't you just fix this with a better prompt?
The usual objection is that a good prompt closes the gap: tell the model to think like an experienced clinician and it will. Take that seriously for a moment. The model could read everything ever written about clinical reasoning and work out that the right move is to weigh the diagnoses by probability. But working out the right move is not the same as being able to make it. A prompt can ask the model to reason like a clinician; it cannot hand it the grounded likelihoods a clinician reasons with. So what tends to come back is a confident, tidy differential that reads as if it were grounded, whether or not it is. A prompt can ask for clinical reasoning. It cannot supply what clinical reasoning runs on.
None of this is to talk the models down. The people who built them have been genuinely surprised by how far they have come. A few years ago, few expected software to hold a real conversation or pull the sense out of a mass of text the way it now can, and much of it arrived faster than the researchers predicted. But that progress has been in language. In some fields, talking well is most of the job. Medicine is not one of them: it needs facts underneath, and rules that hold. That is why the grounding has to come from somewhere other than the words alone.
This is what people are really asking when they ask whether AI is safe in healthcare. Not whether it is clever; it plainly is. Whether it can be trusted. Medicine runs on trust, and trust is earned.
What does this mean for learning medicine?
Everything above holds wherever a general-purpose model is used in medicine. But education is where it bites hardest, for a reason specific to learning: the student is still working out how to get to the answer. They do not yet have the reasoning to judge what the model hands them, so they cannot tell when it has made something up, overweighted a rare diagnosis, or quietly dropped a rule it was told to follow. An experienced doctor catches those things. A student takes them on board.
That cuts both ways. In the moment, it makes the tool far less useful than it looks, because the person least able to spot the mistakes is the one leaning on it. Over the longer term it does the opposite of what learning is for. A wrong fact, or a piece of reasoning that was never really reasoning, gets absorbed and built on, and works its way into everything the student does next.
An education is not meant to fill someone with answers; it is meant to build the reasoning that lets them weigh an answer in the first place. That is what a student has to take inside themselves: how to think probabilistically, holding several possibilities and weighing them as the evidence comes in, while also learning the deterministic parts, the rules that simply hold. A grounded system can teach that, because the reasoning it shows is real and a student can follow it.
Get that architecture right and AI becomes one of the most powerful tools medical education has ever had. Reach for the chatbot instead, because it sounds like a doctor, and you teach the next generation to lean on the one part that was never built to carry that weight on its own.