I tortured this algorithm on a desktop GPU, and the bottlenecks were not where I expected. Every number you see below came from airig, an AMD 9900X machine with an RTX 5090 (24 GB), 64 GB RAM, and a stock Debian install.

The method is Dudarov & Prokhorenkova’s 2025 paper, Talking Trees: Conversational Decision Tree Learning (arXiv:2509.21465). Their reference implementation lives at github.com/yandex-research/TalkingTrees.

My own code and benchmarking harness are available on request, though they aren’t published yet. If you want the wider battlefield, check our companion post on soft distillation from tabular LLMs into gradient boosters, where we ran 52 methods across 5 datasets.

The real question now is whether these conversational trees can hold their own when we throw them into that same five-dataset gauntlet against all 52 methods.


What Is Talking Trees?

I used to assume decision trees were a solved problem—until I watched a large language model rewrite one branch at a time.

Dudarov & Prokhorenkova from Yandex Research built an agent loop that does exactly this. Their method interactively refines a DecisionTreeClassifier by handing the fitted tree to an LLM and demanding one specific improvement.

I start by training an initial DecisionTreeClassifier on the training set. Then I prompt the LLM to inspect the fitted tree, find one specific improvement, and return Python code that implements it.

I execute that code, evaluate on validation data, and keep the best tree seen so far. I repeat this for a fixed number of steps.

The paper reports strong results on UCI benchmarks—often matching or beating gradient boosters on small datasets. The intuition is that the LLM’s reasoning about feature interactions and split quality can discover trees that greedy CART misses.

But I wanted to know: does this hold up on fraud data?


What We Changed

I cloned the official Talking Trees repository expecting a drop-in solution for our fraud pipeline. Reality hit fast: their code assumes a pristine Python environment and a compliant OpenAI endpoint, while our world runs mixed-type tabular data and Kimi K2.6 via OpenRouter.

The gap wasn’t architectural. It was mechanical. I spent an afternoon swapping out argument parsers, hardening the tree serializer against pandas categoricals, and wiring a dead-man’s switch so that an LLM timeout doesn’t silently nuke the model.

These aren’t glamorous changes, but they are the difference between a research demo and a system that survives contact with production logs. If you’re trying to reproduce this on your own stack, the upstream repo is still the right starting point.

The real question is whether your infrastructure is different enough that you’ll need the same five patches, or whether you’ll discover a sixth one the hard way.

ChangeReason
Added Kimi K2.6 supportThe original code supports OpenAI and local models; we added simpleaichat backend for Kimi K2.6 via OpenRouter
Fixed categorical handlingOur fraud-detection and ieee-cis datasets have mixed types; added .astype(str) before tree inspection to prevent pd.Categorical serialization errors
Removed simple-parsing dependencyConflicted with our environment; replaced with direct argparse
Added fallback mechanismIf the LLM produces invalid code or times out, we fall back to the initial sklearn tree and log the failure
Metrics loggingAdded per-run JSON output with train/val/test AUC, AP, Recall@1%FPR, tree depth, node count, and wall time

I didn’t touch a single hyperparameter. Every setting matches the paper exactly: max_depth=5, class_weight='balanced', a maximum of 15 agent steps, and ROC AUC as the optimization metric. If the reproduction diverges, blame the code, not the config.


Setup

I wanted to see if LLMs could actually outrank real tree-based baselines on fraud data. I pitted Kimi K2.6 and GPT-5.5 against the same lineup from our V4 benchmark: sklearn’s DecisionTreeClassifier(max_depth=5), plus XGBoost and CatBoost.

I tested on two datasets: ieee-cis from Kaggle, subsampled to 1,000 and 5,000 rows, and Amazon’s FDB fraud-detection set at 500, 1,000, and 2,000 rows. I ran each configuration across random seeds 42, 43, and 44.

Every trial used a 60/20/20 train/val/test split with stratification. If the LLMs can’t beat models that train in seconds, we need to ask what exactly we’re paying the API bill for.


Results

Accuracy

ModelTest AUCvs Sklearn d5vs XGBoostFallback RateTime
Kimi K2.60.676 ± 0.058+0.041−0.09440% (6/15)184s
GPT-5.50.657 ± 0.061+0.035−0.1137% (1/15)80s
Sklearn d50.634 ± 0.148−0.1360%<0.1s
XGBoost0.770 ± 0.088+0.1360%<1s
CatBoost0.774 ± 0.077+0.140+0.0040%<1s
Talking Trees comparison Talking Trees comparison

Left panel: Direct same-config comparisons. Talking Trees (purple) edges out sklearn d5 (gray) but is crushed by gradient boosters (blue/orange). The gap is not marginal — it is ~10pp AUC, which in fraud detection is the difference between a usable model and a random guesser.

Right panel: LLM comparison. Kimi achieves higher peak accuracy but fails on 40% of runs (fallback to initial tree). GPT-5.5 is more reliable (7% fallback) but slightly less accurate on average.

Generalization Gap

Train vs Test AUC Train vs Test AUC

I stared at the gap numbers and realized something was backwards. The diagonal on a generalization plot is exactly where you want to live. Fall below it and you’re overfitting.

Talking Trees didn’t just dip below the diagonal—it cratered.

Kimi left a train-test gap of 0.187. GPT-5.5 was even worse at 0.231.

By comparison, sklearn’s depth-5 tree shows a gap of roughly 0.05, which is typical for depth-5 trees. XGBoost lands near 0.08.

The LLM is fine-tuning splits to the training data. It finds thresholds that slice the training set perfectly, but those thresholds fall apart on unseen data.

That behavior makes sense once you look at the optimization target. The LLM is chasing validation AUC at every step, yet the tree structure it spits out memorizes training patterns instead of learning robust decision boundaries.

If the LLM is explicitly optimizing AUC and still building brittle thresholds, what is it seeing in the training set that gradient-boosted ensembles manage to ignore?

Cost Analysis

I did the math on my last agent experiment and almost closed my laptop. You are paying ~$0.03–0.05 per API call, and each run needs 5–15 steps to finish.

Scale that across a full benchmark suite and you are suddenly spending real money on every single run.

At what point does your API bill eclipse the gain you are chasing?

ModelCost/runΔ vs sklearnCost per +0.01 AUCΔ vs XGB
Kimi~$5+0.041$1.22−0.094
GPT-5.5~$3+0.035$0.86−0.113
XGBoost$0+0.136$0

I ran the numbers on Talking Trees, and the math is genuinely offensive. You’re paying dollars per run for a measly +0.04 AUC bump over sklearn. Meanwhile, XGBoost sits there costing nothing and beats Talking Trees by +0.11 AUC.

Even if your only goal is edging out sklearn d5, you’re burning $0.86–1.22 for every hundredth of an AUC point. Run that across a production pipeline scoring millions of transactions and you have an unaffordable burn rate.

Then there’s the fallback rate. 2 in 5 Kimi runs are wasted entirely. You eat the cost for runs that deliver nothing.

How many millions of inferences will it take before that fractional lift costs more than the value it generates?


Why Does It Fail on Fraud Data?

I loved the paper’s UCI numbers until I ported the method to our fraud pipeline. Then everything fell apart. Four structural mismatches explain why benchmark success collapses the moment you swap clean tabular data for adversarial fraud signals.

First, the feature space explodes. ieee-cis carries 455 features after minimal preprocessing, far more than the几十-feature UCI benchmarks. An LLM simply cannot reason about interactions in a 455-dimensional space from a text serialization of the tree.

Second, fraud is adversarial by design. Legitimate transactions look like fraud and vice versa because attackers deliberately hide their tracks. The LLM’s “common sense” about what makes a good split is actively misleading — it suggests splits that would separate obvious classes, but fraud is not obvious.

Third, the LLM optimizes val AUC at each step, yet our subsampled configs leave it with validation sets of only 100–200 rows. That noise is lethal. The LLM chases spurious correlations, and the tree memorizes them.

Finally, Talking Trees has no way to penalize complexity. Unlike gradient boosters, which rely on shrinkage, subsampling, and early stopping, this method offers no guardrails. The tree grows to fit whatever the LLM suggests, and while the paper caps depth at 5, even depth-5 trees can overfit badly when splits are chosen adversarially.

Until someone shrinks that 455-dimensional space or injects real regularization into the loop, Talking Trees will keep memorizing noise on fraud data. Is there a way to constrain the LLM’s split suggestions without retraining the model itself?


Practical Verdict

ApproachUse CaseVerdict
Talking Trees for fraudProduction scoring❌ No — too slow, too unreliable, too expensive
Talking Trees for fraudResearch / prototyping⚠️ Maybe — useful as a baseline or teaching tool
Talking Trees for UCI benchmarksPaper replication✅ Yes — the authors’ results hold on their chosen data
XGBoost / CatBoostAny fraud task✅ Start here — faster, cheaper, more accurate

Raw Data

You want the raw data? I’ve dumped the full set—30 LLM and 45 baseline runs—into Talking Trees JSONs (30 LLM + 45 baseline runs). Download it and dig in.

I’m not publishing the analysis scripts or our fork. If you need them, just ask—I’ll send them over directly.

What would you change if you had the full toolchain in front of you right now?


Conclusion

I wanted Talking Trees to work on our fraud pipeline. The paper promises a simple trick: hand an LLM your features, let it reason about splits that greedy algorithms miss, and watch accuracy climb. On those tidy UCI benchmarks, the results genuinely hold up.

But fraud detection is not a UCI benchmark. Out here you face high dimensionality, adversarial signal, and extreme class imbalance. Those three conditions break the core assumption that an LLM’s reasoning about feature splits transfers to complex real-world data.

What actually happens? The LLM overfits. It falls back.

It costs you $5 to produce a tree that XGBoost beats by 10pp in 0.1 seconds.

The right takeaway is not that LLMs can’t do tabular ML. It is that LLM-guided methods are dataset-dependent in ways that are hard to predict without running the experiment.

On clean data, try it. On fraud data, use XGBoost.

Until someone finds a cheap pre-check for transfer, every tabular team will keep burning $5 just to discover what XGBoost already knew in 0.1 seconds.


I didn’t build this from scratch—that would be a waste of a perfectly good research codebase. Our implementation forks the official Talking Trees repository, but I modified it to play nicely with Kimi K2.6 and our internal fraud datasets.

I ran everything on airig, a box with an AMD 9900X and an RTX 5090. GPU time and API credits came courtesy of Maxime Guerreiro.

I’m already wondering which will choke first—larger fraud datasets or the RTX 5090’s memory bandwidth—because that threshold is where this setup gets interesting.