Machine: airig — AMD Ryzen 9 9900X, NVIDIA RTX 5090 FE, 64 GB RAM, Debian trixie
Software: Python 3.13.5, torch 2.12+cu130, tabpfn 8.0.3, tabicl 2.1.1, xgboost 3.0.3, catboost 1.2.8, lightgbm 4.6.0
Code & Data: v4_raw_results.zip (22 JSON files, 1,188 individual run records)
The Pitch and The Problem
Tabular foundation models promise a seductive value proposition: no hyperparameter tuning, no feature engineering, competitive accuracy out of the box. For data scientists drowning in spreadsheets, this sounds like a dream.
But fraud detection is not a spreadsheet problem. It is an adversarial, high-dimensional, heavily-engineered domain where Vesta’s anonymized feature interactions, device fingerprints, and transaction timestamps create a signal landscape that looks nothing like the clean UCI datasets these models were trained on.
So we ran an experiment. Not on iris or wine quality — on ieee-cis, the Kaggle fraud competition with 590k rows, 455 features, and a 3.5% fraud rate. And on fraud-detection from the Amazon FDB suite. And on four other datasets spanning the easy-to-hard spectrum.
We tested 52 method variants across 22 configurations with 3 random seeds each. The question was simple: when the data gets hard, do foundation models still win?
They do not.
What We Tested
| Category | Methods |
|---|---|
| Foundation models | TabPFN, TabICL, TabICL-FT |
| Gradient boosters | XGBoost, CatBoost, LightGBM (default + fast variants) |
| Soft distillation | Train GBMs on PFN/ICL probability outputs as soft labels |
| Teacher-as-feature | Append PFN/ICL probabilities as extra input features |
| Stacking | LogisticMeta (5-base), XGBMeta (5-base) |
| Ensembles | CV-tuned weighted averages, PFN+ICL fixed-α ensemble |
| Neural nets | MLP on raw features, MLP on teacher-augmented features |
The experiment design:
- ieee-cis: 1k / 2k / 5k / 10k / 20k training rows
- fraud-detection: 500 / 1k / 2k / full training rows
- fake-job, click-small: 1k / 2k / full
- internet-ads: PCA 50/100/200 and SelectK 50/100/200
All metrics: ROC AUC, Average Precision, Recall@1%FPR, Recall@5%FPR, fit time, predict time.
Preprocessing. No domain-specific engineering was added. Numerical columns were median-imputed and z-score standardized. Categoricals were cast to string, filled with "missing", and factorized with a train+test union mapping. Timestamps and card identifiers were kept as-is. This is intentionally minimal — we wanted to test model families, not feature engineering pipelines. GBMs used their library defaults: 200 trees, depth 6, learning rate 0.1, with scale_pos_weight or class_weight="balanced" for imbalance. No hyperparameter search was performed for any method.
One note on fairness. TabPFN and TabICL are zero-shot: no training hyperparameters, no grid search. Our GBMs also use fixed defaults — but those defaults were chosen by XGBoost/CatBoost/LightGBM authors over years of tuning. If you ran a Bayesian optimization sweep on the GBMs, the gap would likely widen further. We did not, to keep the comparison conservative.
Method Glossary
We use short suffixes everywhere to keep tables readable. Here is what each one means in plain English.
| Short name | Full meaning | What it actually does |
|---|---|---|
| TabPFN | Prior-Fitted Network | Transformer pretrained on synthetic tabular tasks; zero-shot inference on your data with no hyperparameter search. |
| TabICL | Tabular In-Context Learning | Similar to TabPFN, but uses in-context meta-learning instead of prior-fitting. |
| TabICL-FT | TabICL Fine-Tuned | TabICL with a small amount of end-to-end fine-tuning on the target dataset. |
| XGB | XGBoost (default) | Gradient booster with 200 trees, max_depth=6, learning_rate=0.1. Uses scale_pos_weight for imbalance. |
| XGB-fast | Shallow XGBoost | A speed-constrained variant: max_depth=3, n_estimators=50. Sacrifices capacity for a 3–7× fit-time speedup. |
| XGB-soft | XGBoost with soft distillation | Trained on soft labels — the teacher’s predicted probability $y_{\text{soft}} = P_{\text{teacher}}(y=1 \mid x)$ — instead of hard ${0,1}$ labels. The student learns from the teacher’s confidence, not just its decisions. |
| CatBoost / CB | CatBoost (default) | Gradient booster with ordered boosting and native categorical handling. |
| CB-fast | Shallow CatBoost | Constrained to depth=4, iterations=100. |
| CB-soft | CatBoost with soft distillation | Same idea as XGB-soft, but CatBoost’s classifier rejects continuous targets, so we train a CatBoostRegressor with RMSE loss and clip predictions to $[0,1]$. |
| LightGBM / LGB | LightGBM (default) | Leaf-wise gradient booster with 200 trees, num_leaves=31. |
| LGB-fast | Shallow LightGBM | num_leaves=7, n_estimators=50. |
| LGB-soft | LightGBM with soft distillation | LGBMRegressor with RMSE on soft targets, same workaround as CB-soft. |
| Teacher-as-feature | Append teacher probs to $X$ | Concatenate the teacher’s predicted probability as an extra column: $X’ = [X ,|, P_{\text{teacher}}(x)]$, then train the GBM on $X’$. This is feature augmentation, not distillation. |
| LogisticMeta_5 | Logistic Regression Stacking | 5 base models (TabPFN, TabICL, XGB, CB, LGB) generate out-of-fold predictions; a logistic regression is trained on those 5 predictions as features. |
| XGBMeta_5 | XGBoost Stacking | Same as LogisticMeta_5, but XGBoost is the meta-learner instead of logistic regression. |
| WeightedAvg | CV-tuned weighted ensemble | A grid search on a small validation split finds weights $w_i$ such that $\hat{y} = \sum_i w_i \hat{y}_i$. The weights change per seed, so the method name is non-reproducible. |
| PFN+ICL ensemble | Fixed-weight blend | Simple convex combination: $\hat{y} = \alpha \cdot \hat{y}{\text{PFN}} + (1-\alpha) \cdot \hat{y}{\text{ICL}}$ with $\alpha=0.8$. |
1. The Hard Truth: ieee-cis
On the real fraud dataset, TabPFN and TabICL collapse. Not by a little — by a lot.
Figure 1: ROC AUC on ieee-cis as training size increases. Error bars show standard deviation across 3 seeds. TabPFN flatlines around 0.65–0.71; gradient boosters rise steadily to 0.87–0.88.
(See the Method Glossary for what each suffix means.)
| N_train | TabPFN | TabICL | XGB | XGB-fast | XGB-soft | CatBoost | CB-fast | LightGBM | LGB-fast | LogisticMeta |
|---|---|---|---|---|---|---|---|---|---|---|
| 1,000 | 0.6528 | 0.5809 | 0.8026 | 0.7984 | 0.7760 | 0.7877 | 0.7815 | 0.7842 | 0.7911 | 0.7583 |
| 2,000 | 0.6755 | 0.6152 | 0.8226 | 0.8218 | 0.8091 | 0.8125 | 0.8103 | 0.8034 | 0.8130 | 0.7991 |
| 5,000 | 0.6473 | 0.6535 | 0.8382 | 0.8425 | 0.8438 | 0.8255 | 0.8360 | 0.8271 | 0.8429 | 0.8295 |
| 10,000 | 0.7067 | 0.7298 | 0.8537 | 0.8598 | 0.8568 | 0.8449 | 0.8515 | 0.8394 | 0.8606 | 0.8556 |
| 20,000 | 0.6924 | 0.7122 | 0.8685 | 0.8702 | 0.8672 | 0.8696 | 0.8677 | 0.8629 | 0.8723 | 0.8760 |
Three patterns jump out:
First, TabPFN does not scale. Its best AUC is 0.7067 at N=10k, then it regresses to 0.6924 at N=20k. Whatever signal it extracts from 10k rows, more data confuses it. TabICL improves monotonically but caps out at 0.7298 — still 12pp behind the best GBM.
Second, the shallow fast models keep up. XGB-fast — the deliberately shallow XGBoost variant (max_depth=3, 50 trees) — is within 0.5pp of XGB-default at every scale. It is not a sacrifice.
Third, soft distillation wins exactly once. XGB-soft — XGBoost trained on TabPFN’s soft probability estimates $y_{\text{soft}} = P_{\text{PFN}}(y=1 \mid x)$ instead of hard ${0,1}$ labels — takes first place at N=5k (0.8438 vs 0.8382 raw XGB). This is the medium-data sweet spot where the GBM benefits from the foundation model’s smoothed pseudo-labels. At N=20k, the gap vanishes — with enough real data, the GBM does not need a teacher.
Why does TabPFN collapse? We can only speculate, but three architectural constraints line up with the failure pattern. First, TabPFN was pretrained on small UCI-style datasets (typically <10k rows, <50 features); ieee-cis has 455 features after minimal preprocessing, far outside the training distribution. Second, the model uses a fixed attention window and positional encodings that may not generalize to the sparse, high-cardinality categorical structure of fraud data (card IDs, device types, address hashes). Third, TabPFN does not handle temporal non-stationarity — Vesta’s TransactionDT is a relative timestamp with strong seasonality that the model has no mechanism to exploit. TabICL improves monotonically because its in-context learning mechanism is less constrained by pretraining scale, though it still caps out well below GBMs.
2. How Big Is “Big”? Effect Sizes
AUC differences are abstract. The raw gap on ieee-cis N=20k is +0.18 AUC (LogisticMeta 0.8760 vs TabPFN 0.6924). To put this in statistical terms, we quantify the gap with Cohen’s $d$, the pooled-standard-deviation-normalized mean difference:
$$ d = \frac{\bar{x}_1 - \bar{x}_2}{s}, \qquad s = \sqrt{\frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2}} $$
Here $\bar{x}_1$ and $\bar{x}_2$ are the mean AUCs of the two methods, and $s$ is the pooled standard deviation across the $n=3$ seeds. Below is $d$ for every method vs TabPFN:
Figure 2: Cohen's d vs TabPFN on ieee-cis N=20k for every method. Positive = better than TabPFN. d > 0.8 is "large"; d > 2.0 is "very large". Every GBM-based method sits above +20.
| Method | ΔAUC vs TabPFN | Cohen’s d | Interpretation |
|---|---|---|---|
| LogisticMeta_5 | +0.1836 | +24.52 | Nearly 25 pooled std better |
| XGB_default | +0.1761 | +23.69 | |
| XGB-fast | +0.1778 | +23.51 | |
| LGB-fast | +0.1799 | +22.72 | |
| CatBoost | +0.1772 | +23.20 | |
| XGB-soft | +0.1748 | +21.42 | |
| MLP-raw | +0.1363 | +14.27 | |
| TabICL | +0.0198 | +1.15 | Marginal improvement |
| PFN+ICL ensemble | +0.0096 | +1.14 | |
| MLP-teacher-feature | −0.1937 | −5.48 | Worse than baseline |
| CB-teacher-feature | −0.3495 | −5.22 | |
| XGB-teacher-feature | −0.4552 | −23.68 | Catastrophic failure |
A note on interpreting these numbers. $d$ above +20 looks absurd because it is calculated from $n=3$ runs. With only three samples, the pooled standard deviation is tiny (~0.01), so even modest AUC gaps inflate to extreme $d$ values. The important takeaway is not “$d = 24.52$” — it is that the AUC gap itself (+0.18) is enormous in fraud-detection terms. A +0.18 AUC improvement is the difference between a model that barely beats random guessing and one that strongly separates classes. Every gradient booster — even a deliberately constrained one — outperforms TabPFN by a margin so large it would be statistically significant even with a single sample.
3. The Speed-Accuracy Inversion
The conventional ML tradeoff is “faster = worse.” On fraud data, that wisdom is inverted — but note that the speedup numbers here refer to fit time, not inference throughput. You fit once and predict millions, so predict latency matters more in production. Our predict-time measurements had inconsistent instrumentation across methods, so we focus on fit time as a conservative lower bound.
Figure 3: Speed-accuracy scatter plot across all 22 configs, aggregated by method family. Lower-left is better (faster fit, higher AUC). The foundation models occupy the worst quadrant: slower *and* less accurate.
| Method | AUC (N=20k) | Fit Time | Speedup vs TabPFN |
|---|---|---|---|
| TabPFN | 0.6924 | 0.62s | 1.0× |
| TabICL | 0.7122 | 2.15s | 0.3× (slower) |
| XGB-default | 0.8685 | 0.85s | 0.7× |
| XGB-fast | 0.8702 | 0.17s | 3.7× faster |
| LogisticMeta | 0.8760 | 0.14s | 4.6× faster |
XGB-fast is simultaneously 3.7× faster to fit and 18pp more accurate. The Pareto frontier contains no foundation model. When your toolkit is gradient boosters, there is no tradeoff — there is strict dominance.
4. The Easy Truth: fraud-detection
Not every dataset is ieee-cis. The fraud-detection benchmark (28 features, ~10% fraud) is a much gentler landscape.
Figure 4: ROC AUC on fraud-detection. All methods cluster within 0.04 AUC. The dataset is too small and too easy to separate families.
(See the Method Glossary for suffix definitions.)
| N_train | TabPFN | TabICL | XGB | XGB-fast | XGB-soft | CatBoost | CB-fast | LightGBM | LGB-fast | LogisticMeta |
|---|---|---|---|---|---|---|---|---|---|---|
| 500 | 0.7421 | 0.7525 | 0.7189 | 0.7253 | 0.7226 | 0.7120 | 0.7072 | 0.7084 | 0.7374 | 0.7512 |
| 1,000 | 0.7580 | 0.7598 | 0.7132 | 0.7428 | 0.7495 | 0.6994 | 0.7271 | 0.7124 | 0.7357 | 0.7554 |
| 2,000 | 0.7580 | 0.7698 | 0.7321 | 0.7621 | 0.7646 | 0.7221 | 0.7601 | 0.7241 | 0.7566 | 0.7652 |
| full | 0.7735 | 0.7724 | 0.7234 | 0.7671 | 0.7820 | 0.7325 | 0.7682 | 0.7210 | 0.7707 | 0.7756 |
At N=500, TabICL actually wins. Foundation models have a genuine advantage when data is extremely scarce — they bring inductive bias that shallow GBMs cannot construct from 500 rows. But by N=2k the gap vanishes, and at full size XGB-soft edges ahead by less than a percentage point.
The lesson: Foundation models are useful for data-starved fraud problems. For anything with >2k rows and decent features, a tuned gradient booster matches or beats them.
5. Teacher-as-Feature: A Catastrophic Failure (Sometimes)
One of our hypotheses was that appending PFN/ICL probabilities to the raw feature matrix would act as a powerful engineered feature. On the hardest scale (ieee-cis N=20k), we were wrong. But the failure is not universal — it depends on teacher quality.
(Reminder: “+ teacher” means concatenating the teacher’s predicted probability as an extra column: $X’ = [X ,|, P_{\text{teacher}}(x)]$. See the Glossary for details.)
Figure 5: Teacher-as-feature performance at three training sizes on ieee-cis. Baselines in purple/teal; teacher-augmented variants in brown. At N=1k and N=20k the gap is severe; at N=5k the teacher is strong enough that the feature is merely mediocre.
| Scale | XGB baseline | XGB + teacher | CB baseline | CB + teacher |
|---|---|---|---|---|
| N=1k | 0.8026 | 0.4844 | 0.7877 | 0.4630 |
| N=5k | 0.8382 | 0.2946 | 0.8255 | 0.3966 |
| N=20k | 0.8685 | 0.2372 | 0.8696 | 0.3429 |
At N=1k and N=20k, the collapse is catastrophic. The teacher probabilities are biased, low-quality features when the teacher itself is weak (TabPFN 0.65–0.69 AUC). When concatenated to the raw feature matrix, they are highly correlated with the label — so the GBM greedily splits on them — but they generalize poorly because they encode the teacher’s specific errors. The student overfits to the teacher’s mistakes.
At N=5k, however, the story changes. The supplementary results (see Section 8) show CB_teacher_as_feature reaching 0.8431 AUC — only 0.005 behind XGB-default. Why? Because the teacher is stronger at this scale (TabPFN 0.8539, TabICL 0.8593). When the teacher is decent, its predictions are less poisonous as features.
6. The Metrics That Actually Matter
In production fraud systems, you do not optimize AUC. You optimize recall at a fixed false-positive budget — typically 1% or 5% FPR. At these operating points the differences are even more stark than AUC suggests.
Figure 6: Recall @ 1% FPR vs ROC AUC across all 1,188 runs. The relationship is sublinear (r ≈ 0.82). A +0.05 AUC gain does not guarantee a +0.05 recall gain. LogisticMeta punches above its AUC weight.
On ieee-cis N=20k:
| Method | ROC AUC | Recall @ 1% FPR | Fraud recovered at 1% FPR |
|---|---|---|---|
| TabPFN | 0.6924 | 0.125 | 12.5% |
| TabICL | 0.7122 | 0.132 | 13.2% |
| XGB-fast | 0.8702 | 0.363 | 36.3% |
| LogisticMeta | 0.8760 | 0.439 | 43.9% |
LogisticMeta improves recall from 12.5% to 43.9% at the same 1% false-positive budget — a 21.4 percentage point lift. Framed as a ratio, that is 3.5× as much fraud caught at that operating point. But the absolute picture matters more: moving from catching 1 in 8 fraudsters to catching nearly 1 in 2 is the difference between a production system that is usable and one that is not.
7. Global Ranking
Mean AUC across all 22 configs (including internet-ads PCA variants where everything hits 0.95+):
Figure 7: Mean ROC AUC across all 22 configurations. Error bars show standard deviation. The top 6 methods are all gradient boosters or ensembles thereof. TabPFN and TabICL sit in the bottom half once hard fraud data is mixed in.
| Rank | Method Family | Mean AUC | Std |
|---|---|---|---|
| 1 | XGB_soft_distill | 0.8081 | 0.114 |
| 2 | Stacking (LogisticMeta) | 0.8066 | 0.122 |
| 3 | LGB_fast | 0.8008 | 0.124 |
| 4 | XGB_fast | 0.7984 | 0.127 |
| 5 | CB_fast | 0.7977 | 0.125 |
| 6 | CatBoost | 0.7887 | 0.132 |
| 7 | LightGBM | 0.7868 | 0.133 |
| 8 | XGB (default + meta) | 0.7774 | 0.141 |
| 9 | TabPFN | 0.7701 | 0.132 |
| 10 | TabICL | 0.7681 | 0.138 |
The pattern is unambiguous: gradient boosters occupy the top tier. Foundation models are competitive only when the dataset is easy (internet-ads, click-small) or tiny (fraud-detection N=500). Throw in a hard fraud dataset and they drop to the bottom half.
A note on the missing contender. You may notice WeightedAvg is absent from both the ranking table and the dominance chart. It wins the most configs outright (10/22) because CV-tuned weights overfit to the tiny validation set — the “best” weights change names between seeds, producing a different method label every run. We exclude it because it is non-reproducible: you cannot deploy a model whose weights depend on a single random split.
Figure 8: Number of configurations (out of 22) where each method family achieves the highest AUC. LGB_fast and XGB_default each win 3 configs when excluding the overfitting WeightedAvg family.
8. What We Actually Learned From The Supplementary Experiment
After locking the initial sweep, we realized we had no apples-to-apples comparison for soft distillation across booster families. We patched the benchmark to add CB_soft_distill_pfn, CB_soft_distill_icl, LGB_soft_distill_pfn, and LGB_soft_distill_icl.
This was harder than expected. Both CatBoostClassifier and LGBMClassifier crash on continuous soft labels with opaque errors about “Target with classes must contain only 2 unique values” and “Unknown label type: continuous.” The fix is to switch to regressors (CatBoostRegressor, LGBMRegressor with RMSE loss) and manually clip predictions back to $[0, 1]$. In practice, CB-soft and LGB-soft minimize
$$ \mathcal{L} = \frac{1}{N}\sum_{i=1}^{N} \bigl(P_{\text{teacher}}(y_i=1 \mid x_i) - \hat{y}_i^{\text{student}}\bigr)^2 $$
with $\hat{y}_i$ clipped to $[0,1]$ at inference. Soft distillation is therefore not a generic “train any classifier on pseudo-labels” technique — the API ergonomics vary wildly by library.
Results from the corrected sweep are rolling in now. On ieee-cis N=5k (the medium-scale sweet spot where soft distillation previously won):
| Method | AUC | Δ vs XGB_default | Δ vs XGB_soft_pfn |
|---|---|---|---|
| LogisticMeta_5 | 0.8535 | +0.0153 | +0.0097 |
| CB_soft_distill_icl | 0.8451 | +0.0069 | +0.0013 |
| CB_soft_distill_pfn | 0.8441 | +0.0059 | +0.0003 |
| XGB_soft_distill_pfn | 0.8438 | +0.0056 | — |
| LGB_fast | 0.8429 | +0.0047 | −0.0009 |
| XGB_fast | 0.8425 | +0.0043 | −0.0013 |
| LGB_soft_distill_pfn | 0.8297 | −0.0085 | −0.0141 |
| LGB_soft_distill_icl | 0.8250 | −0.0132 | −0.0188 |
Observations:
CatBoost soft distillation works. CB_soft_distill_icl edges out XGB_soft_distill_pfn by 0.0013 AUC — basically a tie. CatBoost’s ordered boosting seems to handle soft targets about as well as XGB’s gradient boosting.
LightGBM soft distillation is weak at N=5k. Both LGB soft variants underperform raw XGB_default on this single data point. Whether this is a general property of leaf-wise trees on noisy continuous targets, or merely a hyperparameter sensitivity (our LGB regressor used max_depth=6, same as the classifier), awaits the full sweep. We suspect leaf-wise growth may be less stable on soft targets than XGB/CB’s level-wise approach, but this is a hypothesis, not a conclusion.
LogisticMeta still wins. Even with the new contenders, stacking five base models beats every single-model distillation approach. The ensemble effect dominates the distillation effect.
Teacher-as-feature is scale-dependent. At N=5k, CB_teacher_as_feature achieves 0.8431 AUC — not a catastrophe. Why? Because the teacher itself is stronger at this scale (TabPFN 0.8539, TabICL 0.8593). When the teacher is decent, its predictions are less poisonous as features. At N=20k, where TabPFN collapses to 0.6924, the same technique drops to 0.3429.
The full supplementary sweep (ieee-cis 1k/20k, fraud-detection 500/2k/full) is still running. We will update this table when complete.
9. Conclusions & Recommendations
The central finding is not that soft distillation is magic. It is that foundation models fail catastrophically on high-dimensional, heavily-engineered fraud data, and fast gradient boosters are the correct tool. Soft distillation is a useful but non-essential refinement — the real win comes from choosing the right model family for the data distribution.
If you are building a fraud detection pipeline today, the evidence says: start with a shallow XGBoost, add a CatBoost for diversity, stack them with logistic regression if you have the latency budget, and skip the transformer unless your dataset is tiny or your features are pristine.
10. Raw Data & Reproducibility
- All 22 JSON result files: v4_raw_results.zip
- Benchmark script:
fraud_benchmark_v4.py(airig:~/tabpfn-playground/) - Analysis scripts:
v4_generate_plots_v2.py,v4_deep_analysis.py— available on request - Hardware: All times are wall-clock on RTX 5090. CPU-only TabPFN/TabICL fits will be slower.
- Random seeds: 42, 43, 44. All reported means ± std are across these three seeds.







