2025-05-04
Review of: Heinz et al. 2025. Randomized Trial of a Generative AI Chatbot for Mental Health Treatment. NEJM AI. link
NEJM AI just published an 8-week RCT of Therabot, an app‑based conversational agent that delivers brief CBT, ACT, and DBT exercises.
The authors start with a GPT‑3‑class, decoder‑only transformer hosted on AWS. They use Falcon-7B and LLaMA-2-70B models, but they don't really concern themselves with the model design. Instead, their secret sauce was the data curation. They fine‑tune it on a hand‑curated corpus of therapist–patient transcripts (>100 k human‑hours) and freeze the weights. They used QLoRA for efficient fine-tuning. Again, basically normal stuff.
The paper reports Cohen’s d = 0.845 – 0.903 for MDD. Looking closer:
If you instead divide the raw 3.5‑point PHQ‑9 gap by the pooled SD (~6.1) you get classic d ≈ 0.58. That is still a respectable, but not huge, effect and in line with past internet‑CBT trials that use wait‑list controls.
Open‑weight viability. The prototype ran on open Llama‑2 weights, not a proprietary GPT—proof that careful data curation + low‑rank fine‑tuning can yield clinically persuasive behaviour.
Safety is a product spec. Guard‑rails, a crisis classifier, and human monitoring were baked into the architecture before a single patient was enrolled.
QLoRA at clinical scale. The team reports no GPU blow‑ups fine‑tuning 70B parameters thanks to quantised adapters—useful pattern for other health‑AI builders.
Wait‑list ≠ placebo. Participants know they’re getting nothing, inflating expectancy effects. An active digital control would almost certainly shrink d to 0.2 – 0.4.
Short follow‑up. Four‑week post‑treatment data capture the biggest divergence; longer horizons are needed to test durability.
Precision limits. Each diagnostic stratum has ~70 per arm; 95 % CIs on d span roughly ±0.25, so replication could easily land lower.
Ordinal‑to‑d conversion. Useful when raw means are unavailable, but readers should translate back to more intuitive metrics (e.g., PHQ‑9 points, remission rates).
Therabot shows that an expert‑tuned, open‑weight LLM can deliver a moderate clinical benefit over no treatment in just four weeks, which is a big symbolic win for generative AI in healthcare. The statistics, however, are best read as an upper bound: swap in an active comparator or run the trial for six months and the effect will likely attenuate.
For practitioners the lesson is clear: quality data + safety engineering > fancy architecture, while for analysts the study is a reminder that how you compute effect size can be almost as important as the effect itself.