Replicating Talking Trees: LLMs for fraud detection

Machine: airig — AMD 9900X, RTX 5090 (24 GB), 64 GB RAM, Debian
Paper: Dudarov & Prokhorenkova, Talking Trees: Conversational Decision Tree Learning, 2025. arXiv:2509.21465
Original code: github.com/yandex-research/TalkingTrees
Our code: (available on request — not published)
Related: See our companion post on soft distillation from tabular LLMs into gradient boosters for a broader benchmark of 52 methods across 5 datasets.

What Is Talking Trees?

Dudarov & Prokhorenkova (Yandex Research) propose a method where an LLM interactively refines a decision tree. The agent loop is:

Train an initial DecisionTreeClassifier on the training set.
Prompt the LLM to inspect the fitted tree, find one specific improvement, and return Python code that implements it.
Execute the code, evaluate on validation data, and keep the best tree seen so far.
Repeat for a fixed number of steps.

The paper reports strong results on UCI benchmarks — often matching or beating gradient boosters on small datasets. The intuition is that the LLM’s reasoning about feature interactions and split quality can discover trees that greedy CART misses.

We wanted to know: does this hold up on fraud data?

What We Changed

We started from the official Talking Trees repository and made the following modifications:

Change	Reason
Added Kimi K2.6 support	The original code supports OpenAI and local models; we added `simpleaichat` backend for Kimi K2.6 via OpenRouter
Fixed categorical handling	Our `fraud-detection` and `ieee-cis` datasets have mixed types; added `.astype(str)` before tree inspection to prevent `pd.Categorical` serialization errors
Removed `simple-parsing` dependency	Conflicted with our environment; replaced with direct `argparse`
Added fallback mechanism	If the LLM produces invalid code or times out, we fall back to the initial sklearn tree and log the failure
Metrics logging	Added per-run JSON output with train/val/test AUC, AP, Recall@1%FPR, tree depth, node count, and wall time

All other hyperparameters match the paper: max_depth=5, class_weight='balanced', max 15 agent steps, ROC AUC as the optimization metric.

Setup

2 datasets: ieee-cis (Kaggle fraud, subsampled to N=1k/5k) and fraud-detection (Amazon FDB, N=500/1k/2k)
3 random seeds: 42, 43, 44
2 LLMs: Kimi K2.6 and GPT-5.5
Baselines: sklearn DecisionTreeClassifier(max_depth=5), XGBoost, CatBoost (from our V4 benchmark)
Split: 60% train / 20% val / 20% test, stratified

Results

Accuracy

Model	Test AUC	vs Sklearn d5	vs XGBoost	Fallback Rate	Time
Kimi K2.6	0.676 ± 0.058	+0.041	−0.094	40% (6/15)	184s
GPT-5.5	0.657 ± 0.061	+0.035	−0.113	7% (1/15)	80s
Sklearn d5	0.634 ± 0.148	—	−0.136	0%	<0.1s
XGBoost	0.770 ± 0.088	+0.136	—	0%	<1s
CatBoost	0.774 ± 0.077	+0.140	+0.004	0%	<1s

Left panel: Direct same-config comparisons. Talking Trees (purple) edges out sklearn d5 (gray) but is crushed by gradient boosters (blue/orange). The gap is not marginal — it is ~10pp AUC, which in fraud detection is the difference between a usable model and a random guesser.

Right panel: LLM comparison. Kimi achieves higher peak accuracy but fails on 40% of runs (fallback to initial tree). GPT-5.5 is more reliable (7% fallback) but slightly less accurate on average.

Generalization Gap

The diagonal is perfect generalization. Points below it are overfitting. Talking Trees clusters far below the diagonal:

Kimi: train-test gap = 0.187
GPT-5.5: train-test gap = 0.231
sklearn d5: train-test gap ≈ 0.05 (typical for depth-5 trees)
XGBoost: train-test gap ≈ 0.08

The LLM fine-tunes splits to the training data. It creates thresholds that separate the training set beautifully but do not generalize. This makes sense — the LLM is optimizing validation AUC at each step, but the tree structure it produces memorizes training patterns rather than learning robust decision boundaries.

Cost Analysis

At ~$0.03–0.05 per API call and 5–15 steps per run:

Model	Cost/run	Δ vs sklearn	Cost per +0.01 AUC	Δ vs XGB
Kimi	~$5	+0.041	$1.22	−0.094
GPT-5.5	~$3	+0.035	$0.86	−0.113
XGBoost	$0	+0.136	$0	—

The economics are brutal. Talking Trees costs dollars per run for a +0.04 AUC improvement over sklearn. XGBoost costs nothing and delivers +0.11 AUC more than Talking Trees.

Even if you only care about beating sklearn d5 (not XGBoost), the cost per +0.01 AUC is $0.86–1.22. For a production pipeline scoring millions of transactions, that is unaffordable — and the fallback rate means 2 in 5 Kimi runs are wasted entirely.

Why Does It Fail on Fraud Data?

The paper reports strong results on UCI datasets. We see four reasons fraud data breaks the method:

High dimensionality: ieee-cis has 455 features after minimal preprocessing — far more than the几十-feature UCI benchmarks. The LLM cannot reason about interactions in a 455-dimensional space from a text serialization of the tree.
Adversarial signal: Fraud patterns are deliberately hidden. Legitimate transactions look like fraud and vice versa. The LLM’s “common sense” about what makes a good split is actively misleading — it suggests splits that would separate obvious classes, but fraud is not obvious.
Overfitting to validation: The LLM optimizes val AUC at each step. On small val sets (100–200 rows for our subsampled configs), this is noisy. The LLM chases spurious correlations and the tree memorizes them.
No regularization mechanism: Unlike gradient boosters (shrinkage, subsampling, early stopping), Talking Trees has no way to penalize complexity. The tree grows to fit whatever the LLM suggests. The paper caps depth at 5, but even depth-5 trees can overfit badly when splits are chosen adversarially.

Practical Verdict

Approach	Use Case	Verdict
Talking Trees for fraud	Production scoring	❌ No — too slow, too unreliable, too expensive
Talking Trees for fraud	Research / prototyping	⚠️ Maybe — useful as a baseline or teaching tool
Talking Trees for UCI benchmarks	Paper replication	✅ Yes — the authors’ results hold on their chosen data
XGBoost / CatBoost	Any fraud task	✅ Start here — faster, cheaper, more accurate

Raw Data

Talking Trees JSONs (30 LLM + 45 baseline runs)

Analysis scripts and our fork of the original code are available on request — not published.

Conclusion

Talking Trees is a clever idea and a impressive research artifact. On small clean datasets, an LLM can absolutely improve a decision tree by reasoning about splits that greedy algorithms miss. The paper’s results are real — on their benchmarks.

But fraud detection is not a UCI benchmark. The high dimensionality, adversarial signal, and extreme class imbalance break the method’s core assumption: that an LLM’s reasoning about feature splits transfers to complex real-world data. It does not. The LLM overfits. It falls back. It costs $5 to produce a tree that XGBoost beats by 10pp in 0.1 seconds.

The right takeaway is not “LLMs can’t do tabular ML.” It is “LLM-guided methods are dataset-dependent in ways that are hard to predict without running the experiment.” On clean data, try it. On fraud data, use XGBoost.

Our implementation is a fork of the official Talking Trees repository with modifications for Kimi K2.6 and our fraud datasets. Experiments run on airig (AMD 9900X + RTX 5090). GPU time and API credits funded by Maxime Guerreiro.

This post was researched and drafted with AI assistance. I picked the ideas, reviewed every claim, and verified all benchmarks.

What Is Talking Trees?#

What We Changed#

Setup#

Results#

Accuracy#

Generalization Gap#

Cost Analysis#

Why Does It Fail on Fraud Data?#

Practical Verdict#

Raw Data#

Conclusion#

Related posts