I tried to beat XGBoost with a decision tree and a prompt. On fraud data, that’s a suicide mission.

Yandex Research disagrees. Dudarov & Prokhorenkova at Yandex Research built a method called Talking Trees that loops an LLM through a decision tree: audit, find one flaw, write the fix, repeat. On small UCI datasets, it works.

I ran it on two fraud datasets. It does not.


Method Glossary

MethodExplanation
Talking TreesDudarov & Prokhorenkova’s agent loop that uses an LLM to inspect and rewrite a decision tree iteratively
DecisionTreeClassifierscikit-learn’s greedy CART implementation — we cap it at max_depth=5 with balanced class weights
XGBoostA gradient boosting library that ensembles weak trees with regularization — the industry default for tabular fraud detection
CatBoostYandex’s gradient booster that handles categorical features natively — our second strong baseline
Kimi K2.6Moonshot AI’s large language model, accessed via OpenRouter, that acts as the tree-refinement agent
GPT-5.5OpenAI’s large language model that acts as an alternative agent to test reliability

What Talking Trees Does

Talking Trees is an agent loop where an LLM interactively refines a DecisionTreeClassifier. The steps are:

  1. Train an initial tree on the training set.
  2. Prompt the LLM to inspect the fitted tree, find one specific improvement, and return Python code that implements it.
  3. Execute the code, evaluate on validation data, and keep the best tree seen so far.
  4. Repeat for a fixed number of steps.

The paper reports strong results on UCI benchmarks — often matching or beating gradient boosters on small datasets. The authors’ intuition is that LLM reasoning about feature interactions and split quality can discover trees that greedy CART misses.

The question is whether that intuition survives fraud data.


What I Changed

I started from the official Talking Trees repository1 and made the following modifications:

ChangeReason
Added Kimi K2.6 supportThe original code supports OpenAI and local models; I added simpleaichat backend for Kimi K2.6 via OpenRouter
Fixed categorical handlingMy fraud-detection and ieee-cis datasets have mixed types; I added .astype(str) before tree inspection to prevent pd.Categorical serialization errors
Removed simple-parsing dependencyIt conflicted with my environment; I replaced it with direct argparse
Added fallback mechanismIf the LLM produces invalid code or times out, I fall back to the initial sklearn tree and log the failure
Metrics loggingI added per-run JSON output with train/val/test AUC, AP, Recall@1%FPR, tree depth, node count, and wall time

All other hyperparameters match the paper: max_depth=5, class_weight='balanced', max 15 agent steps, ROC AUC as the optimization metric.


Setup

I benchmarked on two fraud datasets, three random seeds, two LLMs, and three baselines.

Kimi K2.6 is Moonshot AI’s model, accessed via OpenRouter. GPT-5.5 is OpenAI’s model. Both serve as the agent in the Talking Trees loop.

The baselines are sklearn’s DecisionTreeClassifier, XGBoost, and CatBoost. CatBoost is Yandex’s own gradient booster that handles categorical features natively.

The data:

  • ieee-cis — Kaggle fraud data, subsampled to N=1k/5k
  • fraud-detection — Amazon FDB at N=500/1k/2k

I used seeds 42, 43, and 44, with a 60/20/20 stratified split. I subsampled to keep API costs manageable — a full-scale run would have cost hundreds of dollars per seed.


Results

Accuracy

I averaged test AUC across three random seeds for each configuration. The results:

ModelTest AUCvs Sklearn d5vs XGBoostFallback RateTime
Kimi K2.60.676 ± 0.058+0.041−0.09440% (6/15)184s
GPT-5.50.657 ± 0.061+0.035−0.1137% (1/15)80s
Sklearn d50.634 ± 0.148−0.1360%<0.1s
XGBoost0.770 ± 0.088+0.1360%<1s
CatBoost0.774 ± 0.077+0.140+0.0040%<1s
Talking Trees comparison Talking Trees comparison

The left panel shows direct same-config comparisons. Talking Trees edges out sklearn d5 but is crushed by gradient boosters. The gap is not marginal — it is roughly 10 percentage points of AUC, which in fraud detection is the difference between a usable model and a random guesser.

The right panel compares the two LLMs. Kimi achieves higher peak accuracy but fails on 40% of runs, falling back to the initial tree. GPT-5.5 is more reliable — only 7% fallback — but slightly less accurate on average.

Generalization Gap

Train vs Test AUC Train vs Test AUC

The diagonal on the plot marks perfect generalization. Points below it are overfitting. Talking Trees clusters far below the diagonal.

I computed the train-test gap by subtracting test AUC from train AUC for each run, then averaging:

  • Kimi: train-test gap = 0.187
  • GPT-5.5: train-test gap = 0.231
  • sklearn d5: train-test gap ≈ 0.05 — typical for depth-5 trees
  • XGBoost: train-test gap ≈ 0.08

The LLM fine-tunes splits to the training data. It creates thresholds that separate the training set beautifully but do not generalize. This makes sense — the LLM is optimizing validation AUC at each step, but the tree structure it produces memorizes training patterns rather than learning robust decision boundaries.

Cost Analysis

I estimated cost by counting tokens per API call at roughly $0.03–0.05 per call and 5–15 steps per run:

ModelCost/runΔ vs sklearnCost per +0.01 AUCΔ vs XGB
Kimi~$5+0.041$1.22−0.094
GPT-5.5~$3+0.035$0.86−0.113
XGBoost$0+0.136$0

The economics are brutal. Talking Trees costs dollars per run for a modest improvement over sklearn. XGBoost costs nothing and delivers +0.11 AUC more than Talking Trees.

I got the cost-per-0.01-AUC figure by dividing the total run cost by the AUC improvement over sklearn d5. Even if you only care about beating sklearn d5, that ratio is $0.86–1.22. For a production pipeline scoring millions of transactions, that is unaffordable. The fallback rate means two in five Kimi runs are wasted entirely.


Why Fraud Data Breaks the Method

The paper reports strong results on UCI datasets. Fraud data breaks the method for four reasons.

High dimensionality. The ieee-cis dataset has 455 features after minimal preprocessing — far more than the UCI benchmarks. The LLM cannot reason about interactions in a 455-dimensional space from a text serialization of the tree.

Adversarial signal. Fraud patterns are deliberately hidden. Legitimate transactions look like fraud and vice versa. The LLM’s common sense about what makes a good split is actively misleading — it suggests splits that would separate obvious classes, but fraud is not obvious.

Overfitting to validation. The LLM optimizes val AUC at each step. On small val sets — 100–200 rows for our subsampled configs — this is noisy. The LLM chases spurious correlations and the tree memorizes them.

No regularization mechanism. Unlike gradient boosters — which use shrinkage, subsampling, and early stopping — Talking Trees has no way to penalize complexity. The tree grows to fit whatever the LLM suggests. The paper caps depth at 5, but even depth-5 trees can overfit badly when splits are chosen adversarially.


Practical Verdict

ApproachUse CaseVerdict
Talking Trees for fraudProduction scoring❌ No — too slow, too unreliable, too expensive
Talking Trees for fraudResearch / prototyping⚠️ Maybe — useful as a baseline or teaching tool
Talking Trees for UCI benchmarksPaper replication✅ Yes — the authors’ results hold on their chosen data
XGBoost / CatBoostAny fraud task✅ Start here — faster, cheaper, more accurate

Raw Data

Analysis scripts and our fork of the original code are available on request — not published.


Conclusion

I spent $5 to generate a single decision tree that XGBoost beat by 10 percentage points in 0.1 seconds. Talking Trees looks like a breakthrough on paper, and on clean, small datasets, it actually works. It uses LLM reasoning to find splits that greedy algorithms miss, and the research results are legitimate.

But fraud detection is not a UCI benchmark. Between the adversarial signals and the extreme class imbalance, the core assumption of the method collapses. The LLM does not reason through the complexity — it just overfits and falls back.

The lesson here is not that LLMs cannot handle tabular ML. It is that LLM-guided methods are dangerously dataset-dependent. If your data is pristine, give it a shot.

If you are fighting fraud, stick with XGBoost.


My implementation is a fork of the official Talking Trees repository1 with modifications for Kimi K2.6 and my fraud datasets. Experiments run on airig2. GPU time and API credits funded by Maxime Guerreiro.