What to look for in an AI OSCE platform - the architecture under the hood

Search for an AI OSCE platform and every product sounds the same. What separates them is architecture, not features: whether the clinical content is grounded in a structured knowledge layer a clinician can stand behind, or improvised by a general model. Here is how to tell which is which.

A car with its hood up, exposing the engine underneath, the working parts normally hidden beneath a smooth exterior. — What sets AI OSCE platforms apart is not the interface students see, but the layers underneath it.

AI OSCE platforms are a young, fast-moving idea. The products that have appeared so far tend to describe themselves in much the same language: somewhere to practise your consultations, prepare for your OSCEs, and get instant feedback, some with a friendly avatar on the front. The feature lists are near-identical, which is no help at all when what you really need to know is which one to trust with a student's preparation.

The interest behind them is real. The supervised practice that builds consultation skills has never been harder to provide at the scale now being asked of it, as the first post in this series set out, and well-built AI is one of the few things that can meet that demand. Researchers have been studying language models for teaching and assessing clinical skills since soon after the technology arrived, with systematic reviews in Medical Education and studies of language models in OSCE-style skills practice in JMIR Medical Education. The direction of travel is clear: this will become part of how the next generation of doctors will learn.

As that happens, the bar rises. It is no longer enough for one of these tools to hold a plausible conversation; it has to be built well enough to be trusted with the job, and how it is built matters far more than how it looks. The same underlying model, configured differently, can produce clinical content of very different quality, as JMIR Medical Education found in a comparison of model configurations.

So the question worth asking is not which platform has the longest feature list, but how each one is built underneath. A platform's architecture is what sits below the chat interface, and it sets the ceiling on what the tool can be trusted to do: a checklist wrapped around a general model can score how a consultation was conducted, while a platform built on a structured layer of clinical knowledge can judge whether the clinical thinking inside it was sound. That distinction, between a tool that makes up its medicine as it goes and one grounded in knowledge a clinician can stand behind, is what separates the serious platforms from the rest. Almost everything else worth comparing follows from it.

What are the main types of AI OSCE tool?

Students have long prepared for consultations by sharing notes and worked examples, and by practising on each other. Practising on a fellow student has obvious limits: a classmate already knows how the case turns out and can only play what they themselves understand. Students notice the difference, rating practice with a trained simulated patient as more realistic than role-play with a peer, as BioMed Research International reported, and unlike a peer a simulated patient can be standardised and reproduced from one student to the next. Static station libraries, banks of pre-written scenarios and model answers, added breadth but still could not respond to what you actually said.

Then language models arrived, and students and researchers began asking a chatbot to play a patient. It talks back, which makes it a reasonable way to rehearse the rhythm of a consultation, and students embraced it quickly. Today's medical students are AI natives, and surveys already show most of them using several AI tools across their studies, reported in JMIR Human Factors. That fluency cuts both ways: because they know these tools well, they also know the limits, and few trust a raw chatbot as fit for clinical practice, the case made in full in a companion piece on why students should not lean on ChatGPT to get through medical school. The clinical details are improvised from plausible language, so they drift and are sometimes wrong. That is structural rather than a matter of prompting: across standardised clinical vignettes, current models are weakest at the reasoning steps of a case, not the final answer, as JAMA Network Open reported.

An obvious early step was to layer a communication rubric over a general model, a checklist for things like signposting, building rapport, and safety-netting. It is a thin addition: handling the flow of a conversation is something general models already do, so the checklist mostly formalises what the model was doing anyway. What it does not add is any grounding in clinical knowledge, so there is still no way to tell whether the clinical substance of the consultation was right.

The next step is different in kind. Rather than prompt a general model and hope, it uses a structured layer of clinical knowledge to constrain the model: a map of how findings point to conditions and how conditions connect to investigations and management, which the model can draw on but cannot override. The simulated patient stays clinically consistent from one encounter to the next, and the feedback afterwards reviews the consultation across multiple competencies. This is the direction clinical-grade AI is heading, in the clinic and in the lecture theatre alike. The goals differ: clinical AI is built to reach the correct answer as quickly as possible, while educational AI is built to teach the reasoning that gets you there. But the standard should not. Medical education deserves the same rigour we would expect of any tool used at the bedside.

What does a knowledge-grounded platform look like under the hood?

When people say a platform is grounded in clinical knowledge, it helps to be concrete about what that means, because the phrase is easy to claim and much harder to build. A platform like this has three distinct layers underneath it, and the separation between them is the whole point.

At the bottom is a knowledge layer: a structured representation of clinical content, where findings connect to the conditions they point to, and conditions connect to the investigations and management they call for. This is closer to a map of how clinical reasoning fits together than to a pile of documents, and it is the part a general model does not have. Building and maintaining it is most of the work, and there is a reason most providers avoid it.

Above that sits an orchestration layer, which decides how the knowledge is used in a given encounter: what a simulated patient should disclose and when, which findings are there to be uncovered if the student thinks to look, and what a sound consultation on this case would have covered. This is what keeps the patient consistent from one encounter to the next, and what gives the feedback something stable to mark against.

On top is the interaction layer, the part the student actually sees: the conversation, the voice, the interface. It is the most visible layer and the least decisive. A general model can produce a fluent interaction layer on its own, which is exactly why a fluent conversation tells you so little about what is holding it up.

The order is hard to reverse. A platform that starts at the interaction layer, a chatbot with a clinical prompt, cannot grow a knowledge layer by adding more prompts, whereas a platform built knowledge-first can always put a better conversation on top. That difference in where a platform starts, more than any single feature, is what really sets these tools apart.

How can you tell which is which?

The simplest thing is to ask. A tool that is essentially a wrapper around a general model will rarely claim to be more than that, and the teams who have done the deeper work are usually glad to explain how it is built.

The table below sets out what to compare once you start asking.

What to look at	Static library	General model + prompt	Rubric + model	Knowledge-grounded
How patients are made	Pre-written, fixed	Improvised each run	Improvised each run	From a knowledge layer
What feedback uses	A fixed answer	Whatever the model says	A comms checklist	Multiple competencies
Clinically verified content	Only as written	No	No	Yes
Governance	Depends on author	None inherent	Limited to rubric	Built for oversight
Cross-mode tracking	Limited	None	Limited	Supported
High-stakes use	Limited	Low	Not designed for it	Designed for it

What can a knowledge-grounded platform do that the others cannot?

One place the difference shows is in the feedback. A structured knowledge layer gives a platform something firmer to measure a consultation against than the surface of how it was conducted, so the feedback can speak to the clinical reasoning behind a consultation, not only the manner of it. That matters because reasoning is what this kind of practice improves most: virtual patient simulations produce their strongest effects on clinical reasoning rather than communication, reported in the Journal of Medical Internet Research and BMC Medical Education, and it is the harder thing to do well without that structure underneath.

Then there is who can stand behind the content. Someone practising on their own can still get something out of a general chatbot. An institution is held to a higher bar: it has to be able to say what its teaching tools contain, where that content came from, and who is accountable for it. A structured knowledge layer can be reviewed, corrected, and signed off by the clinicians who carry that responsibility, where an improvising model leaves them nothing fixed to review. A student practising privately can overlook that; a medical school putting a tool in front of a whole cohort cannot.

None of this follows from a better prompt, which is also why a general model cannot be trusted to run a safe consultation on its own, however well written the prompt.

Isn't this just a marketing frame?

There is an obvious objection here: a company that has built its product around a clinical knowledge layer is hardly neutral when it argues that the knowledge layer is what matters. Fair enough, so do not take it on trust, including from us.

The point holds regardless of who makes it, because it is not really a matter of opinion. Either a platform constrains its clinical content to a structured layer of knowledge that clinicians can review and correct, or it generates that content from a general model each time. That is a fact about how the system is built, and you can check it by asking. The architecture is either there or it is not.

What should you ask before choosing an AI OSCE platform?

A few questions will quickly show you which kind of tool you are really looking at.

Where does the clinical content come from? If the patient and the feedback are generated by the model as it goes, the clinical substance is only as good as the model on the day. If there is a structured clinical knowledge base underneath, ask what that knowledge is based on.
Can a clinician review and correct what the platform teaches? Structured knowledge can be inspected, edited, and signed off. A model improvising each encounter leaves a reviewer nothing fixed to check.
Does the same case stay consistent if you run it more than once? A platform grounded in a structured base of clinical knowledge keeps it stable from one run to the next. A wrapper around a general model has nothing holding the detail in place, so it tends to drift or contradict itself, which a couple of runs will quickly expose.

None of these is a trick question, and a provider that has done the real work will happily talk through all three. If the answers keep returning to the feature list rather than what sits underneath, that is worth noticing.