AI Therapy: What the First RCT of a Generative‑AI Therapist Really Tells Us

2025-05-04

Review of: Heinz et al. 2025. Randomized Trial of a Generative AI Chatbot for Mental Health Treatment. NEJM AI. link

NEJM AI just published an 8-week RCT of Therabot, an app‑based conversational agent that delivers brief CBT, ACT, and DBT exercises.

1. Trial Design, in brief

  • Design: 210 U.S. adults with major‑depressive (MDD), anxiety (GAD) or eating‑disorder risk (CHR‑FED) symptoms were randomised 1:1 to four weeks of Therabot or a wait‑list (no‑treatment) control.
  • Outcomes: Standard depression, anxiety, eating-disorder risk measures (PHQ‑9, GAD‑7 or EDE‑Q scores at Week 4 and Week 8).
  • Engagement: 95 % of assigned users talked to the bot; median 260 messages (~6 h total use).

2. Under the hood:

The authors start with a GPT‑3‑class, decoder‑only transformer hosted on AWS. They use Falcon-7B and LLaMA-2-70B models, but they don't really concern themselves with the model design. Instead, their secret sauce was the data curation. They fine‑tune it on a hand‑curated corpus of therapist–patient transcripts (>100 k human‑hours) and freeze the weights. They used QLoRA for efficient fine-tuning. Again, basically normal stuff.

3. Effect sizes

The paper reports Cohen’s d = 0.845 – 0.903 for MDD. Looking closer:

  1. They fit an ordinal‑logistic mixed model to each outcome. (Appropriate for skewed symptom scales).
  2. Converted the log‑odds ratio to an "unbounded" d with d = ln(OR)·√3/π. (Adds ~40 % to the number compared with a classic pooled‑SD d.)
  3. Compared against no treatment after only four weeks. (Maximises between‑group spread.)

If you instead divide the raw 3.5‑point PHQ‑9 gap by the pooled SD (~6.1) you get classic d ≈ 0.58. That is still a respectable, but not huge, effect and in line with past internet‑CBT trials that use wait‑list controls.

4. What Matters for LLM Folks

  • Open‑weight viability. The prototype ran on open Llama‑2 weights, not a proprietary GPT—proof that careful data curation + low‑rank fine‑tuning can yield clinically persuasive behaviour.

  • Safety is a product spec. Guard‑rails, a crisis classifier, and human monitoring were baked into the architecture before a single patient was enrolled.

  • QLoRA at clinical scale. The team reports no GPU blow‑ups fine‑tuning 70B parameters thanks to quantised adapters—useful pattern for other health‑AI builders.

5. What Matters for Statisticians

  • Wait‑list ≠ placebo. Participants know they’re getting nothing, inflating expectancy effects. An active digital control would almost certainly shrink d to 0.2 – 0.4.

  • Short follow‑up. Four‑week post‑treatment data capture the biggest divergence; longer horizons are needed to test durability.

  • Precision limits. Each diagnostic stratum has ~70 per arm; 95 % CIs on d span roughly ±0.25, so replication could easily land lower.

  • Ordinal‑to‑d conversion. Useful when raw means are unavailable, but readers should translate back to more intuitive metrics (e.g., PHQ‑9 points, remission rates).

6. Bottom Line

  • Therabot shows that an expert‑tuned, open‑weight LLM can deliver a moderate clinical benefit over no treatment in just four weeks, which is a big symbolic win for generative AI in healthcare. The statistics, however, are best read as an upper bound: swap in an active comparator or run the trial for six months and the effect will likely attenuate.

  • For practitioners the lesson is clear: quality data + safety engineering > fancy architecture, while for analysts the study is a reminder that how you compute effect size can be almost as important as the effect itself.