Machine: airig — AMD 9900X, RTX 5090 (24 GB), 64 GB RAM, Debian
Paper: Dudarov & Prokhorenkova, Talking Trees: Conversational Decision Tree Learning, 2025. arXiv:2509.21465
Original code: github.com/yandex-research/TalkingTrees
Our code: (available on request — not published)
Related: See our companion post on soft distillation from tabular LLMs into gradient boosters for a broader benchmark of 52 methods across 5 datasets.
What Is Talking Trees?
Dudarov & Prokhorenkova (Yandex Research) propose a method where an LLM interactively refines a decision tree. The agent loop is:
- Train an initial
DecisionTreeClassifieron the training set. - Prompt the LLM to inspect the fitted tree, find one specific improvement, and return Python code that implements it.
- Execute the code, evaluate on validation data, and keep the best tree seen so far.
- Repeat for a fixed number of steps.
The paper reports strong results on UCI benchmarks — often matching or beating gradient boosters on small datasets. The intuition is that the LLM’s reasoning about feature interactions and split quality can discover trees that greedy CART misses.
We wanted to know: does this hold up on fraud data?
What We Changed
We started from the official Talking Trees repository and made the following modifications:
| Change | Reason |
|---|---|
| Added Kimi K2.6 support | The original code supports OpenAI and local models; we added simpleaichat backend for Kimi K2.6 via OpenRouter |
| Fixed categorical handling | Our fraud-detection and ieee-cis datasets have mixed types; added .astype(str) before tree inspection to prevent pd.Categorical serialization errors |
Removed simple-parsing dependency | Conflicted with our environment; replaced with direct argparse |
| Added fallback mechanism | If the LLM produces invalid code or times out, we fall back to the initial sklearn tree and log the failure |
| Metrics logging | Added per-run JSON output with train/val/test AUC, AP, Recall@1%FPR, tree depth, node count, and wall time |
All other hyperparameters match the paper: max_depth=5, class_weight='balanced', max 15 agent steps, ROC AUC as the optimization metric.
Setup
- 2 datasets:
ieee-cis(Kaggle fraud, subsampled to N=1k/5k) andfraud-detection(Amazon FDB, N=500/1k/2k) - 3 random seeds: 42, 43, 44
- 2 LLMs: Kimi K2.6 and GPT-5.5
- Baselines: sklearn
DecisionTreeClassifier(max_depth=5), XGBoost, CatBoost (from our V4 benchmark) - Split: 60% train / 20% val / 20% test, stratified
Results
Accuracy
| Model | Test AUC | vs Sklearn d5 | vs XGBoost | Fallback Rate | Time |
|---|---|---|---|---|---|
| Kimi K2.6 | 0.676 ± 0.058 | +0.041 | −0.094 | 40% (6/15) | 184s |
| GPT-5.5 | 0.657 ± 0.061 | +0.035 | −0.113 | 7% (1/15) | 80s |
| Sklearn d5 | 0.634 ± 0.148 | — | −0.136 | 0% | <0.1s |
| XGBoost | 0.770 ± 0.088 | +0.136 | — | 0% | <1s |
| CatBoost | 0.774 ± 0.077 | +0.140 | +0.004 | 0% | <1s |
Left panel: Direct same-config comparisons. Talking Trees (purple) edges out sklearn d5 (gray) but is crushed by gradient boosters (blue/orange). The gap is not marginal — it is ~10pp AUC, which in fraud detection is the difference between a usable model and a random guesser.
Right panel: LLM comparison. Kimi achieves higher peak accuracy but fails on 40% of runs (fallback to initial tree). GPT-5.5 is more reliable (7% fallback) but slightly less accurate on average.
Generalization Gap
The diagonal is perfect generalization. Points below it are overfitting. Talking Trees clusters far below the diagonal:
- Kimi: train-test gap = 0.187
- GPT-5.5: train-test gap = 0.231
- sklearn d5: train-test gap ≈ 0.05 (typical for depth-5 trees)
- XGBoost: train-test gap ≈ 0.08
The LLM fine-tunes splits to the training data. It creates thresholds that separate the training set beautifully but do not generalize. This makes sense — the LLM is optimizing validation AUC at each step, but the tree structure it produces memorizes training patterns rather than learning robust decision boundaries.
Cost Analysis
At ~$0.03–0.05 per API call and 5–15 steps per run:
| Model | Cost/run | Δ vs sklearn | Cost per +0.01 AUC | Δ vs XGB |
|---|---|---|---|---|
| Kimi | ~$5 | +0.041 | $1.22 | −0.094 |
| GPT-5.5 | ~$3 | +0.035 | $0.86 | −0.113 |
| XGBoost | $0 | +0.136 | $0 | — |
The economics are brutal. Talking Trees costs dollars per run for a +0.04 AUC improvement over sklearn. XGBoost costs nothing and delivers +0.11 AUC more than Talking Trees.
Even if you only care about beating sklearn d5 (not XGBoost), the cost per +0.01 AUC is $0.86–1.22. For a production pipeline scoring millions of transactions, that is unaffordable — and the fallback rate means 2 in 5 Kimi runs are wasted entirely.
Why Does It Fail on Fraud Data?
The paper reports strong results on UCI datasets. We see four reasons fraud data breaks the method:
High dimensionality:
ieee-cishas 455 features after minimal preprocessing — far more than the几十-feature UCI benchmarks. The LLM cannot reason about interactions in a 455-dimensional space from a text serialization of the tree.Adversarial signal: Fraud patterns are deliberately hidden. Legitimate transactions look like fraud and vice versa. The LLM’s “common sense” about what makes a good split is actively misleading — it suggests splits that would separate obvious classes, but fraud is not obvious.
Overfitting to validation: The LLM optimizes val AUC at each step. On small val sets (100–200 rows for our subsampled configs), this is noisy. The LLM chases spurious correlations and the tree memorizes them.
No regularization mechanism: Unlike gradient boosters (shrinkage, subsampling, early stopping), Talking Trees has no way to penalize complexity. The tree grows to fit whatever the LLM suggests. The paper caps depth at 5, but even depth-5 trees can overfit badly when splits are chosen adversarially.
Practical Verdict
| Approach | Use Case | Verdict |
|---|---|---|
| Talking Trees for fraud | Production scoring | ❌ No — too slow, too unreliable, too expensive |
| Talking Trees for fraud | Research / prototyping | ⚠️ Maybe — useful as a baseline or teaching tool |
| Talking Trees for UCI benchmarks | Paper replication | ✅ Yes — the authors’ results hold on their chosen data |
| XGBoost / CatBoost | Any fraud task | ✅ Start here — faster, cheaper, more accurate |
Raw Data
Analysis scripts and our fork of the original code are available on request — not published.
Conclusion
Talking Trees is a clever idea and a impressive research artifact. On small clean datasets, an LLM can absolutely improve a decision tree by reasoning about splits that greedy algorithms miss. The paper’s results are real — on their benchmarks.
But fraud detection is not a UCI benchmark. The high dimensionality, adversarial signal, and extreme class imbalance break the method’s core assumption: that an LLM’s reasoning about feature splits transfers to complex real-world data. It does not. The LLM overfits. It falls back. It costs $5 to produce a tree that XGBoost beats by 10pp in 0.1 seconds.
The right takeaway is not “LLMs can’t do tabular ML.” It is “LLM-guided methods are dataset-dependent in ways that are hard to predict without running the experiment.” On clean data, try it. On fraud data, use XGBoost.
Our implementation is a fork of the official Talking Trees repository with modifications for Kimi K2.6 and our fraud datasets. Experiments run on airig (AMD 9900X + RTX 5090). GPU time and API credits funded by Maxime Guerreiro.

