AI Therapy: What the First RCT of a Generative‑AI Therapist Really Tells Us

2025-05-04

Review of: Heinz et al. 2025. Randomized Trial of a Generative AI Chatbot for Mental Health Treatment. NEJM AI. link

NEJM AI just published an 8-week RCT of Therabot, an app‑based conversational agent that delivers brief CBT, ACT, and DBT exercises.

1. Trial Design, in brief

Design: 210 U.S. adults with major‑depressive (MDD), anxiety (GAD) or eating‑disorder risk (CHR‑FED) symptoms were randomised 1:1 to four weeks of Therabot or a wait‑list (no‑treatment) control.
Outcomes: Standard depression, anxiety, eating-disorder risk measures (PHQ‑9, GAD‑7 or EDE‑Q scores at Week 4 and Week 8).
Engagement: 95 % of assigned users talked to the bot; median 260 messages (~6 h total use).

2. Under the hood:

The authors start with a GPT‑3‑class, decoder‑only transformer hosted on AWS. They use Falcon-7B and LLaMA-2-70B models, but they don't really concern themselves with the model design. Instead, their secret sauce was the data curation. They fine‑tune it on a hand‑curated corpus of therapist–patient transcripts (>100 k human‑hours) and freeze the weights. They used QLoRA for efficient fine-tuning. Again, basically normal stuff.

3. Effect sizes

The paper reports Cohen’s d = 0.845 – 0.903 for MDD. Looking closer:

They fit an ordinal‑logistic mixed model to each outcome. (Appropriate for skewed symptom scales).
Converted the log‑odds ratio to an "unbounded" d with d = ln(OR)·√3/π. (Adds ~40 % to the number compared with a classic pooled‑SD d.)
Compared against no treatment after only four weeks. (Maximises between‑group spread.)

If you instead divide the raw 3.5‑point PHQ‑9 gap by the pooled SD (~6.1) you get classic d ≈ 0.58. That is still a respectable, but not huge, effect and in line with past internet‑CBT trials that use wait‑list controls.

4. What Matters for LLM Folks

Open‑weight viability. The prototype ran on open Llama‑2 weights, not a proprietary GPT—proof that careful data curation + low‑rank fine‑tuning can yield clinically persuasive behaviour.
Safety is a product spec. Guard‑rails, a crisis classifier, and human monitoring were baked into the architecture before a single patient was enrolled.
QLoRA at clinical scale. The team reports no GPU blow‑ups fine‑tuning 70B parameters thanks to quantised adapters—useful pattern for other health‑AI builders.

5. What Matters for Statisticians

Wait‑list ≠ placebo. Participants know they’re getting nothing, inflating expectancy effects. An active digital control would almost certainly shrink d to 0.2 – 0.4.
Short follow‑up. Four‑week post‑treatment data capture the biggest divergence; longer horizons are needed to test durability.
Precision limits. Each diagnostic stratum has ~70 per arm; 95 % CIs on d span roughly ±0.25, so replication could easily land lower.
Ordinal‑to‑d conversion. Useful when raw means are unavailable, but readers should translate back to more intuitive metrics (e.g., PHQ‑9 points, remission rates).

6. Bottom Line

Therabot shows that an expert‑tuned, open‑weight LLM can deliver a moderate clinical benefit over no treatment in just four weeks, which is a big symbolic win for generative AI in healthcare. The statistics, however, are best read as an upper bound: swap in an active comparator or run the trial for six months and the effect will likely attenuate.
For practitioners the lesson is clear: quality data + safety engineering > fancy architecture, while for analysts the study is a reminder that how you compute effect size can be almost as important as the effect itself.