Why medical students shouldn't rely on ChatGPT to get through medical school

ChatGPT can help medical students quickly access facts and test ideas. But clinical reasoning is built through supervised practice: learning the structure, forming differentials, choosing the next question, and working through uncertainty.

"ChatGPT could tell you the sky is red and then also tell you that your opinion about something that's incredibly controversial is correct. It's a kind of people pleaser. And when you're learning, having a people pleaser around you isn't necessarily great - you do need a little bit of constructive criticism."

Sixth-year medical student, University of Otago.

A senior clinician can one look at a patient and land on the right diagnosis in seconds. Medical educators call that ability clinical gestalt. A medical student typing a clinical vignette into ChatGPT and watching a confident, fluent answer come back is a different kind of glance, one that can look the same on the surface but is produced by completely different machinery underneath.

The answers can look alike. The route to each of them is not the same, and the route is the part that matters when you are still learning.

This post is about that distinction. The previous post in this series argued that structured clinical reasoning is the part of medical education hardest to teach traditionally. This post turns to the other half of the same problem: why a medical student should not rely on a general-purpose chatbot to get through medical school.

What is clinical gestalt?

Clinical gestalt is the hyper-compressed pattern recognition senior clinicians develop after thousands of patient encounters: the ability to integrate signals into a confident, structured assessment in seconds, grounded in real clinical experience.

In the diagnostic reasoning literature, Pat Croskerry and others describe clinical reasoning through a dual-process model. Type 1 thinking is fast, intuitive, and pattern-based. Type 2 thinking is slower, analytical, and explicit. Experienced clinicians move between the two constantly: the fast impression, then the deliberate check; the pattern match, then the question of what would change the picture.

That matters because clinical gestalt can look like a shortcut from the outside. Sometimes it is. But at its best, it is not reasoning bypassing structure. It is structure that has been practised so many times it starts to operate below conscious awareness.

Gestalt™ is built around the structured route: simulated consultations, supervised reasoning practice, repeated until the structure becomes more fluent. The endpoint is clinical gestalt, which is what the platform is named after. The route is the work the platform takes students through to get there.

If reasoning remains explicit, it can be supervised, taught, and improved. If it cannot, it becomes a black box, even to the person relying on it. That matters for what comes next, because a general-purpose LLM is a different kind of black box.

What is the LLM equivalent, and why is it structurally different?

A general-purpose LLM is a different kind of glance: fluent, plausible, and built from language rather than clinical experience.

Models like ChatGPT, Claude, Gemini, and DeepSeek are trained to predict the next word across vast bodies of text. Any medical education or clinical tool built purely on that approach inherits the same fundamental issue: no amount of clever prompting, persona design, or wrapper engineering changes what the underlying model is doing.

A senior clinician shaped through supervised clinical practice has a reasoning route underneath the output that can, in principle, be made explicit: where the pattern recognition came from, what cases it generalised from, and what red flags would override it.

A general-purpose LLM has no clinical route underneath the output at all. The output is the route. There is no structured medical model that produced it. A senior clinician compresses experience caring for actual patients. A general LLM compresses language about that experience. The two outputs can look strikingly similar. They are produced by completely different machinery.

That distinction matters when a reasoning model shows its working. What it shows is still generated reasoning text: language about reasoning, not a verified path through medical knowledge.

Where ChatGPT, Claude and Gemini perform well, and where they don't

General-purpose LLMs perform well for low-stakes recall. They can also be useful as cognitive sparring partners: a way to test an idea, generate alternatives, or ask what you might be missing. But for learning a structured process like clinical reasoning, where the cost of a confident omission is real, they are the wrong tool.

Putting aside the obvious concerns about a student feeding real patient data into a general LLM, these models are good at recall-without-context tasks: looking up causes of chest pain, explaining a murmur, or summarising a case write-up. AI now beats humans on many knowledge-recall tasks, and rote recall is no longer a scarce or differentiating skill. The scarce thing is consultation practice and reasoning in context. That is the part a general LLM does not supervise well.

Medical students already use these tools at scale. Shi and colleagues, surveying 428 medical students in JMIR Human Factors earlier this year, found over 90% routinely use two or more AI platforms, and over 60% use three or more. A Canadian national survey published in the same window put the figure at 96.5% of medical students using at least one large language model. The question is not whether AI is in medical education. It is what that AI is grounded in.

What that AI is grounded in matters because using a general-purpose LLM as a substitute for supervised clinical reasoning practice is not without risk. The Stanford-Harvard NOHARM benchmark, published this year by the ARISE network, evaluated 31 large language models on 100 real primary-care-to-specialist consultation cases, with 12,747 expert annotations across ten specialties. Severe-harm potential, meaning the kind of error that could plausibly hurt a patient, was found in up to 22.2% of LLM clinical recommendations.

The striking finding is the shape of those errors: 76.6% were errors of omission. The patient who was not asked about the red-flag symptom. The differential that quietly excluded the dangerous one. The investigation that was never recommended because the model did not think to surface it.

A model can be confidently, fluently, plausibly wrong by leaving something out. A student who is still building their own clinical model does not yet have the experience to catch when the AI is confidently wrong by omission. That is exactly the gap supervised clinical practice is meant to close.

What goes wrong when you try to use the wrong tool for the job?

The risk is not mainly that ChatGPT gives a wrong answer. It is that the student stops doing the reasoning.

Clinical reasoning is not built by receiving a fluent answer. It is built by working through uncertainty: deciding what matters, forming a differential, choosing the next question, testing assumptions, and being corrected when the reasoning route is incomplete.

That is the part general-purpose LLMs are weakest at. A JAMA Network Open study evaluating 21 large language models across 29 standardised clinical vignettes found that differential diagnosis was consistently the weakest part of the clinical workflow. Models performed better on final diagnosis and management than on the earlier reasoning steps that get a clinician there.

That distinction matters for medical students. If a tool is strongest at producing the endpoint, but weaker at supervising the route, it can give the student the thing they most want while bypassing the thing they most need to practise.

A 2025 paper in Advances in Physiology Education described the student-side mechanism directly: AI shortcut, no differential built, no reasoning reps. Turney and colleagues gave the broader problem a name: upskilling inhibition.

This is not an argument for using less AI in medical education. It is an argument for using the right AI: one designed to make the student do the reasoning, not one designed to produce a fluent answer they can paste into a study-group chat.

What is Gestalt™, and what does it do differently?

Gestalt™ is a clinical reasoning practice platform for medical students, built around an anchored simulated consultation on top of structured medical knowledge.

The LLM does the language work - the part it does best - but the difference is what the system draws on, what the interaction is shaped by, and what the student is asked to practise.

Capability	Gestalt	ChatGPT, Claude, Gemini, DeepSeek
Designed for clinical reasoning practice	Yes. Built around simulated consultations.	No. Designed for general dialogue.
Student role	Works through the consultation and reasoning route.	Usually receives or co-produces a fluent answer.
Anchored in a curated clinical knowledge model	Yes. Guided by the Gestalt knowledge graph.	No. General-purpose models are trained for broad language tasks.
Reproducible clinical priorities across sessions	Same case, same clinical priorities	Variable; prompt- and conversation-dependent.
Feedback guided by structured medical knowledge	Yes	Not designed to be
Built for the local clinical context	Yes. Aligned to local clinical resources.	Not specifically aligned to any clinical context.
Governance posture	Purpose-built for medical education.	Not designed for clinical safety.

Won't a better model, or better prompting, fix this?

A better model may produce a more accurate answer. That does not make it a better educational tool.

For clinical decision support, the answer is the product. For medical education, it cannot be. The product is the reasoning route: forming a differential, choosing what to ask next, testing assumptions, and learning to notice what is missing.

That is why LLM model accuracy alone cannot solve the problem. If a system gives the student the answer without making them build the route, it can be impressive and still the wrong tool educationally.

The students who come out of medical school best prepared to practise alongside AI from day one will not be the ones who used generic AI models to get to answers fastest. They will be the ones who built the strongest internal clinical model, and who therefore know what good looks like when the AI in front of them is confidently wrong.

That is the endpoint Gestalt™ is named for, and built around.