Comparing AI to live practitioners: what have we learned so far?
“You know how sometimes when you make a copy of a copy it’s not quite as sharp as… well, the original…” Doug Kinney (played by Michael Keaton), Multiplicity, 1996
By Rachael Doherty (CCH)
Comparison of live vs automated acute care
The Institute for the Advancement of Homeopathy’s (HOHM Foundation) research team has spent the last two years working to better understand how artificial intelligence (AI) will affect the homeopathy community and the practice of this unique healing art. These efforts are now coming to fruition in an emerging series of peer-reviewed papers that aim to shed light on this evolving phenomenon. Our central research question is whether automated remedy finders are able to reliably replace live practitioners.
So far we have investigated one purpose-built, non-large language model (LLM) homeopathic remedy finder (Homeopathic HouseCall) and four general purpose LLM-driven AI chatbots (ChatGPT, Grok, Claude, and DeepSeek), comparing each of them to a dataset of 100 live acute cases from the Homeopathy Help Network, a low-cost online homeopathy clinic operated in partnership with the Academy for Homeopathy Education.
Our research focus to date has been limited to the initial intake interview and has compared only the first remedy recommendation from the live practitioner to automated remedy recommendations. While we have not yet analyzed how live clients responded to the first remedy given, we do know that of the 100 cases analyzed, 81.6% were either “resolved” or “much better” by the end of treatment (Doherty et al 2025).
Because our investigations to date have been based on intake notes from practitioners, we have not yet begun to estimate how well automated tools interface directly with consumers. Although they would presumably be better suited to answer directed questions from a purpose-built remedy finder, we also must assume that unless they are very familiar with homeopathy principles, consumers would be less likely to provide complete symptom information, such as modalities, sensations, etiology, and concomitants into an unstructured LLM-driven AI chatbot.
Remedy finders vs LLMs – different levels of sophistication with similar results
During our investigation of the purpose-built (non-LLM) remedy finder, we answered the automated, complaint-tailored questions for each case as best as we could based on information obtained from initial intake notes. We recognized that this was an inherent weakness in the research design because many answers (63%) were either “no” or “not applicable” (Doherty et al 2025). While this was due in part to the questions asked, questions were also answered in this way when we could not find the applicable answers in the case notes. This likely decreased the overall remedy match rate, but helped us to understand the sheer scope of the task before us.
For our investigation of the LLM-driven AI chatbots (publication pending), there was no such limitation because we were able to simply paste the (anonymized) intake notes into the chatbots and ask for remedy recommendations, justifications, and the sources used. In cases for which there was not a top remedy match, we asked the AI chatbot why the practitioner-recommended remedy was not chosen.
Despite the varying degrees of sophistication in the automated tools and the structural limitations inherent in the non-LLM remedy finder, we found that both the purpose-built (non-LLM) remedy finder and the four LLM-driven AI chatbots had similar top remedy match rates with live practitioners, ranging between 17-24%.
LLMs – inconsistent and flip-floppy
However, despite being provided with identical input, the AI chatbots were not consistent in their remedy recommendations. This was seen not only when compared to one another, but also at times on the same AI platform when using identical information for multiple queries. This even happened when multiple queries were performed on the same platform a few minutes apart.
For these reasons alone we are confident that automated tools are not yet at the point where they can replace a live practitioner, even for acute cases and in spite of the inherent design limitations of our studies. But there are additional reasons for this conclusion that go beyond low remedy match rates and inconsistent output.
One AI chatbot in particular (Claude) consistently flip-flopped when asked why its top recommendation was at odds with the practitioner’s recommendation. We pushed back on Claude’s top recommendation 75 times when it did not match the practitioner’s recommendation, asking why the practitioner-recommended remedy was not chosen, and Claude held its ground with the original recommendation in only 10 cases. In the remaining 65 cases, Claude either revised its recommendation or, in 44 cases, flipped its recommendation outright to the practitioner-recommended remedy.
There were also times when there was consensus among all four AI chatbots on the top remedy recommendation, but there was only agreement with the live practitioner in 6% of the cases. In the 10% of cases for which there was a consensus of all four AI chatbots on a top remedy recommendation that was at odds with the live practitioner’s recommendation, all but 3% (3 cases) were rated by the practitioner as “resolved” by the end of the live-managed case. Of these remaining three cases, two cases were “much better” and one case was “somewhat better.”
Dodgy sourcing and flawed logic
We also found that AI chatbots made some recommendations based on dodgy sources. In one of the 100 cases investigated, DeepSeek gave a remedy recommendation at odds with the practitioner’s recommendation (Ipecachuana) based in part on its purported absence from the rubric for bloody expectoration in James Tyler Kent’s Repertory of the Homeopathic Materia Medica. When we asked for sources for that (incorrect) assertion, DeepSeek provided multiple sources. When we tried to track these sources down, we found not only dead internet links but also an incorrect version of this rubric online. In real life, the client’s symptoms resolved using only Ipecachuana for four days (symptom onset was 11 days before the initial intake).
We found that LLM-driven AI chatbots sometimes used flawed logic to justify recommendations. We found that they misrepresented the action of some remedies (e.g., the kinds of nasal discharge that Pulsatilla can produce, Lycopodium’s relationship with warm drinks) and the way that fifty millesimal (“Q” or “LM”) potencies work. We found a failure to usefully apply acute/chronic remedy relationships, and we saw remedies excluded from consideration because AI chatbots categorized them as “constitutional” and therefore inappropriate for use in acute complaints. We also found posology recommendations that were at odds with homeopathic philosophy, at times unnecessarily recommending the use of multiple remedies at the same time.
In one case, we found a philosophical concept – novelly dubbed “remedy saturation” by ChatGPT – was applied to justify a choice at odds with the live practitioner. In real life, the “saturated” remedy was Pulsatilla, which the client had been taking chronically in an LM3 potency. Although ChatGPT ruled out Pulsatilla altogether because of the “remedy saturation” issue, in real life the practitioner switched to Pulsatilla 30c, which started moving the case toward resolution immediately after lingering for nine days prior to consultation.
What’s next for AI and homeopathy?
So what do these findings mean as the homeopathy profession stands at the brink of the age of AI?
Validating AI algorithms – the process an automated model uses to “learn” based on real-world feedback to its recommendations – is one potential way to make remedy selection algorithms more accurate. The purpose-built (non-LLM) remedy finder we investigated, which was designed by homeopaths, provides a mechanism for this by asking the user to manually provide feedback on remedies used and whether they helped to alleviate symptoms. Because this is a voluntary task, however, it will take time to accumulate sufficient data, and the data will likely be uneven across complaints as it accumulates. For example, in our 100 case dataset, we only queried 22 of the remedy finder’s 74 possible acute complaints, sometimes with only one case in a given complaint category (Doherty et al 2025). The LLM-driven AI chatbots, on the other hand, appear to be “trained” on a range of sources collected from all over the internet, not necessarily from cured clinical cases. Although this is in line with what would be expected from a non-purpose-built LLM, it does little to create a robust tool based on real-world data.
The creation of such feedback loops is more likely to be found in purpose-built LLM platforms (which we have yet to investigate) than general purpose LLM platforms, but this exercise creates potential logistical, ethical, and even philosophical dilemmas. For a complaint to be validated, sufficient (private) health information will be needed to train the model. But how many cured cases are needed to validate an algorithm for a certain complaint? What privacy protections are in place for platforms that accumulate health information, who will have access to them, and how will this information be used? And how accurate is the health information that is being fed into the models in the first place?
The biggest LLM-driven AI chatbots have a poor track record in regard to privacy protection (https://news.stanford.edu/stories/2025/10/ai-chatbot-privacy-concerns-risks-research) and it is unclear how LLM-driven AI chatbots would be able to accumulate a sufficient number of cases to validate any model unless it is for a limited set of acute complaints. But all of these questions are in addition to the underlying philosophical quandary of how to structure an algorithm that looks at health in the holistic way that homeopathy requires, with a potentially infinite array of symptom presentations and a variety of ways to identify a potentially supportive remedy.
The importance of solid research
In the literature review for our second peer-reviewed paper, we encountered an unexpected but, in hindsight, unsurprising problem fully in keeping with the AI zeitgeist. Three of the journal articles we found in the initial search that purported to describe the AI landscape in homeopathy were, upon closer inspection, fake. Not just a little fake. Really fake. Fake studies, fake references, and fake findings, sometimes citing real authors and real journals but with fake articles. This phenomenon is not limited to the field of homeopathy (https://www.nature.com/articles/d41586-025-03341-9), but the fact that it exists in our profession means that we must proceed with the caution, skepticism, and rigor that Samuel Hahnemann himself would have brought to bear when faced with the same situation.
So what is the difference between a live practitioner and an authoritative-sounding AI chatbot if the top remedy recommendations are not consistent? It is a well-worn trope that if you put five homeopaths in a room to look at the same case, you will get five different remedy recommendations. This is essentially what we encountered while investigating automated remedy recommendations against live practitioners: one set of symptoms could result in as many as 5 top remedy recommendations (or 6, if you count the non-LLM remedy finder), not to mention the secondary recommendations, which were even more numerous (sometimes as many as 20 potential remedies for a single complaint).
The proof of the pudding will have to be in the eating. In order to judge whether a remedy recommendation is correct, the response of the client must be observed and, whenever possible, recorded. Although voluntary user-initiated feedback would be ideal, practitioner-based feedback may be more realistic. However, it is not yet clear whether this algorithm-improvement mission will be embraced by enough users and practitioners to be effective.
Regardless of whether practitioner-recorded results are ultimately used to train AI algorithms, gathering reliable information about case outcomes will nevertheless be critical as homeopaths continue the effort to help this unique healing art find its rightful place in modern medicine. The more data we collect that points to beneficial clinical outcomes, the closer we get to that goal. The HOHM Foundation’s research office (https://advancehomeopathy.org/research/) is dedicated to this effort, and we welcome input and participation from the homeopathy community as we proceed.
Reference: Doherty, R., Pracjek, P., Luketic, C. D., Straiges, D., & Gray, A. C. (2025). The Application of Artificial Intelligence in Acute Prescribing in Homeopathy: A Comparative Retrospective Study. Healthcare, 13(15), 1923. https://doi.org/10.3390/healthcare13151923
Author: Rachael Doherty (CCH) is a research consultant at the Institute for the Advancement of Homeopathy (HOHM Foundation), an acute clinic supervisor at the Homeopathy Help network, and the owner of Nomad Homeopathy LLC. She is a 2023 graduate of the Academy of Homeopathy Education and Associate Member of the North American Society of Homeopaths, holds a Bachelor of Arts in Russian from The Ohio State University, and a Master of Arts in Law and Diplomacy from the Fletcher School at Tufts University. She is also a former U.S. naval officer and a retired U.S. Foreign Service Officer.