Replicating Talking Trees: LLMs for fraud detection

I tried to beat XGBoost with a decision tree and a prompt. On fraud data, that’s a suicide mission.

Yandex Research disagrees. Dudarov & Prokhorenkova at Yandex Research built a method called Talking Trees that loops an LLM through a decision tree: audit, find one flaw, write the fix, repeat. On small UCI datasets, it works.

I ran it on two fraud datasets. It does not.

Method Glossary

Method	Explanation
Talking Trees	Dudarov & Prokhorenkova’s agent loop that uses an LLM to inspect and rewrite a decision tree iteratively
`DecisionTreeClassifier`	scikit-learn’s greedy CART implementation — we cap it at `max_depth=5` with balanced class weights
XGBoost	A gradient boosting library that ensembles weak trees with regularization — the industry default for tabular fraud detection
CatBoost	Yandex’s gradient booster that handles categorical features natively — our second strong baseline
Kimi K2.6	Moonshot AI’s large language model, accessed via OpenRouter, that acts as the tree-refinement agent
GPT-5.5	OpenAI’s large language model that acts as an alternative agent to test reliability

What Talking Trees Does

Talking Trees is an agent loop where an LLM interactively refines a DecisionTreeClassifier. The steps are:

Train an initial tree on the training set.
Prompt the LLM to inspect the fitted tree, find one specific improvement, and return Python code that implements it.
Execute the code, evaluate on validation data, and keep the best tree seen so far.
Repeat for a fixed number of steps.

The paper reports strong results on UCI benchmarks — often matching or beating gradient boosters on small datasets. The authors’ intuition is that LLM reasoning about feature interactions and split quality can discover trees that greedy CART misses.

The question is whether that intuition survives fraud data.

What I Changed

I started from the official Talking Trees repository¹ and made the following modifications:

Change	Reason
Added Kimi K2.6 support	The original code supports OpenAI and local models; I added `simpleaichat` backend for Kimi K2.6 via OpenRouter
Fixed categorical handling	My `fraud-detection` and `ieee-cis` datasets have mixed types; I added `.astype(str)` before tree inspection to prevent `pd.Categorical` serialization errors
Removed `simple-parsing` dependency	It conflicted with my environment; I replaced it with direct `argparse`
Added fallback mechanism	If the LLM produces invalid code or times out, I fall back to the initial sklearn tree and log the failure
Metrics logging	I added per-run JSON output with train/val/test AUC, AP, Recall@1%FPR, tree depth, node count, and wall time

All other hyperparameters match the paper: max_depth=5, class_weight='balanced', max 15 agent steps, ROC AUC as the optimization metric.

Setup

I benchmarked on two fraud datasets, three random seeds, two LLMs, and three baselines.

Kimi K2.6 is Moonshot AI’s model, accessed via OpenRouter. GPT-5.5 is OpenAI’s model. Both serve as the agent in the Talking Trees loop.

The baselines are sklearn’s DecisionTreeClassifier, XGBoost, and CatBoost. CatBoost is Yandex’s own gradient booster that handles categorical features natively.

The data:

ieee-cis — Kaggle fraud data, subsampled to N=1k/5k
fraud-detection — Amazon FDB at N=500/1k/2k

I used seeds 42, 43, and 44, with a 60/20/20 stratified split. I subsampled to keep API costs manageable — a full-scale run would have cost hundreds of dollars per seed.

Results

Accuracy

I averaged test AUC across three random seeds for each configuration. The results:

Model	Test AUC	vs Sklearn d5	vs XGBoost	Fallback Rate	Time
Kimi K2.6	0.676 ± 0.058	+0.041	−0.094	40% (6/15)	184s
GPT-5.5	0.657 ± 0.061	+0.035	−0.113	7% (1/15)	80s
Sklearn d5	0.634 ± 0.148	—	−0.136	0%	<0.1s
XGBoost	0.770 ± 0.088	+0.136	—	0%	<1s
CatBoost	0.774 ± 0.077	+0.140	+0.004	0%	<1s

The left panel shows direct same-config comparisons. Talking Trees edges out sklearn d5 but is crushed by gradient boosters. The gap is not marginal — it is roughly 10 percentage points of AUC, which in fraud detection is the difference between a usable model and a random guesser.

The right panel compares the two LLMs. Kimi achieves higher peak accuracy but fails on 40% of runs, falling back to the initial tree. GPT-5.5 is more reliable — only 7% fallback — but slightly less accurate on average.

Generalization Gap

The diagonal on the plot marks perfect generalization. Points below it are overfitting. Talking Trees clusters far below the diagonal.

I computed the train-test gap by subtracting test AUC from train AUC for each run, then averaging:

Kimi: train-test gap = 0.187
GPT-5.5: train-test gap = 0.231
sklearn d5: train-test gap ≈ 0.05 — typical for depth-5 trees
XGBoost: train-test gap ≈ 0.08

The LLM fine-tunes splits to the training data. It creates thresholds that separate the training set beautifully but do not generalize. This makes sense — the LLM is optimizing validation AUC at each step, but the tree structure it produces memorizes training patterns rather than learning robust decision boundaries.

Cost Analysis

I estimated cost by counting tokens per API call at roughly $0.03–0.05 per call and 5–15 steps per run:

Model	Cost/run	Δ vs sklearn	Cost per +0.01 AUC	Δ vs XGB
Kimi	~$5	+0.041	$1.22	−0.094
GPT-5.5	~$3	+0.035	$0.86	−0.113
XGBoost	$0	+0.136	$0	—

The economics are brutal. Talking Trees costs dollars per run for a modest improvement over sklearn. XGBoost costs nothing and delivers +0.11 AUC more than Talking Trees.

I got the cost-per-0.01-AUC figure by dividing the total run cost by the AUC improvement over sklearn d5. Even if you only care about beating sklearn d5, that ratio is $0.86–1.22. For a production pipeline scoring millions of transactions, that is unaffordable. The fallback rate means two in five Kimi runs are wasted entirely.

Why Fraud Data Breaks the Method

The paper reports strong results on UCI datasets. Fraud data breaks the method for four reasons.

High dimensionality. The ieee-cis dataset has 455 features after minimal preprocessing — far more than the UCI benchmarks. The LLM cannot reason about interactions in a 455-dimensional space from a text serialization of the tree.

Adversarial signal. Fraud patterns are deliberately hidden. Legitimate transactions look like fraud and vice versa. The LLM’s common sense about what makes a good split is actively misleading — it suggests splits that would separate obvious classes, but fraud is not obvious.

Overfitting to validation. The LLM optimizes val AUC at each step. On small val sets — 100–200 rows for our subsampled configs — this is noisy. The LLM chases spurious correlations and the tree memorizes them.

No regularization mechanism. Unlike gradient boosters — which use shrinkage, subsampling, and early stopping — Talking Trees has no way to penalize complexity. The tree grows to fit whatever the LLM suggests. The paper caps depth at 5, but even depth-5 trees can overfit badly when splits are chosen adversarially.

Practical Verdict

Approach	Use Case	Verdict
Talking Trees for fraud	Production scoring	❌ No — too slow, too unreliable, too expensive
Talking Trees for fraud	Research / prototyping	⚠️ Maybe — useful as a baseline or teaching tool
Talking Trees for UCI benchmarks	Paper replication	✅ Yes — the authors’ results hold on their chosen data
XGBoost / CatBoost	Any fraud task	✅ Start here — faster, cheaper, more accurate

Raw Data

Talking Trees JSONs (30 LLM + 45 baseline runs)

Analysis scripts and our fork of the original code are available on request — not published.

Conclusion

I spent $5 to generate a single decision tree that XGBoost beat by 10 percentage points in 0.1 seconds. Talking Trees looks like a breakthrough on paper, and on clean, small datasets, it actually works. It uses LLM reasoning to find splits that greedy algorithms miss, and the research results are legitimate.

But fraud detection is not a UCI benchmark. Between the adversarial signals and the extreme class imbalance, the core assumption of the method collapses. The LLM does not reason through the complexity — it just overfits and falls back.

The lesson here is not that LLMs cannot handle tabular ML. It is that LLM-guided methods are dangerously dataset-dependent. If your data is pristine, give it a shot.

If you are fighting fraud, stick with XGBoost.

My implementation is a fork of the official Talking Trees repository¹ with modifications for Kimi K2.6 and my fraud datasets. Experiments run on airig². GPU time and API credits funded by Maxime Guerreiro.

Official Talking Trees repository ↩︎ ↩︎
AMD 9900X + RTX 5090 ↩︎

This post was researched and drafted with AI assistance. I picked the ideas, reviewed every claim, and verified all benchmarks.

Method Glossary#

What Talking Trees Does#

What I Changed#

Setup#

Results#

Accuracy#

Generalization Gap#

Cost Analysis#

Why Fraud Data Breaks the Method#

Practical Verdict#

Raw Data#

Conclusion#

Related posts