Machine: airig — AMD Ryzen 9 9900X, NVIDIA RTX 5090 FE, 64 GB RAM, Debian trixie
Software: Python 3.13.5, torch 2.12+cu130, tabpfn 8.0.3, tabicl 2.1.1, xgboost 3.0.3, catboost 1.2.8, lightgbm 4.6.0
Code & Data: v4_raw_results.zip (22 JSON files, 1,188 individual run records)


The Pitch and The Problem

Tabular foundation models promise a seductive value proposition: no hyperparameter tuning, no feature engineering, competitive accuracy out of the box. For data scientists drowning in spreadsheets, this sounds like a dream.

But fraud detection is not a spreadsheet problem. It is an adversarial, high-dimensional, heavily-engineered domain where Vesta’s anonymized feature interactions, device fingerprints, and transaction timestamps create a signal landscape that looks nothing like the clean UCI datasets these models were trained on.

So we ran an experiment. Not on iris or wine quality — on ieee-cis, the Kaggle fraud competition with 590k rows, 455 features, and a 3.5% fraud rate. And on fraud-detection from the Amazon FDB suite. And on four other datasets spanning the easy-to-hard spectrum.

We tested 52 method variants across 22 configurations with 3 random seeds each. The question was simple: when the data gets hard, do foundation models still win?

They do not.


What We Tested

CategoryMethods
Foundation modelsTabPFN, TabICL, TabICL-FT
Gradient boostersXGBoost, CatBoost, LightGBM (default + fast variants)
Soft distillationTrain GBMs on PFN/ICL probability outputs as soft labels
Teacher-as-featureAppend PFN/ICL probabilities as extra input features
StackingLogisticMeta (5-base), XGBMeta (5-base)
EnsemblesCV-tuned weighted averages, PFN+ICL fixed-α ensemble
Neural netsMLP on raw features, MLP on teacher-augmented features

The experiment design:

  • ieee-cis: 1k / 2k / 5k / 10k / 20k training rows
  • fraud-detection: 500 / 1k / 2k / full training rows
  • fake-job, click-small: 1k / 2k / full
  • internet-ads: PCA 50/100/200 and SelectK 50/100/200

All metrics: ROC AUC, Average Precision, Recall@1%FPR, Recall@5%FPR, fit time, predict time.

Preprocessing. No domain-specific engineering was added. Numerical columns were median-imputed and z-score standardized. Categoricals were cast to string, filled with "missing", and factorized with a train+test union mapping. Timestamps and card identifiers were kept as-is. This is intentionally minimal — we wanted to test model families, not feature engineering pipelines. GBMs used their library defaults: 200 trees, depth 6, learning rate 0.1, with scale_pos_weight or class_weight="balanced" for imbalance. No hyperparameter search was performed for any method.

One note on fairness. TabPFN and TabICL are zero-shot: no training hyperparameters, no grid search. Our GBMs also use fixed defaults — but those defaults were chosen by XGBoost/CatBoost/LightGBM authors over years of tuning. If you ran a Bayesian optimization sweep on the GBMs, the gap would likely widen further. We did not, to keep the comparison conservative.


Method Glossary

We use short suffixes everywhere to keep tables readable. Here is what each one means in plain English.

Short nameFull meaningWhat it actually does
TabPFNPrior-Fitted NetworkTransformer pretrained on synthetic tabular tasks; zero-shot inference on your data with no hyperparameter search.
TabICLTabular In-Context LearningSimilar to TabPFN, but uses in-context meta-learning instead of prior-fitting.
TabICL-FTTabICL Fine-TunedTabICL with a small amount of end-to-end fine-tuning on the target dataset.
XGBXGBoost (default)Gradient booster with 200 trees, max_depth=6, learning_rate=0.1. Uses scale_pos_weight for imbalance.
XGB-fastShallow XGBoostA speed-constrained variant: max_depth=3, n_estimators=50. Sacrifices capacity for a 3–7× fit-time speedup.
XGB-softXGBoost with soft distillationTrained on soft labels — the teacher’s predicted probability $y_{\text{soft}} = P_{\text{teacher}}(y=1 \mid x)$ — instead of hard ${0,1}$ labels. The student learns from the teacher’s confidence, not just its decisions.
CatBoost / CBCatBoost (default)Gradient booster with ordered boosting and native categorical handling.
CB-fastShallow CatBoostConstrained to depth=4, iterations=100.
CB-softCatBoost with soft distillationSame idea as XGB-soft, but CatBoost’s classifier rejects continuous targets, so we train a CatBoostRegressor with RMSE loss and clip predictions to $[0,1]$.
LightGBM / LGBLightGBM (default)Leaf-wise gradient booster with 200 trees, num_leaves=31.
LGB-fastShallow LightGBMnum_leaves=7, n_estimators=50.
LGB-softLightGBM with soft distillationLGBMRegressor with RMSE on soft targets, same workaround as CB-soft.
Teacher-as-featureAppend teacher probs to $X$Concatenate the teacher’s predicted probability as an extra column: $X’ = [X ,|, P_{\text{teacher}}(x)]$, then train the GBM on $X’$. This is feature augmentation, not distillation.
LogisticMeta_5Logistic Regression Stacking5 base models (TabPFN, TabICL, XGB, CB, LGB) generate out-of-fold predictions; a logistic regression is trained on those 5 predictions as features.
XGBMeta_5XGBoost StackingSame as LogisticMeta_5, but XGBoost is the meta-learner instead of logistic regression.
WeightedAvgCV-tuned weighted ensembleA grid search on a small validation split finds weights $w_i$ such that $\hat{y} = \sum_i w_i \hat{y}_i$. The weights change per seed, so the method name is non-reproducible.
PFN+ICL ensembleFixed-weight blendSimple convex combination: $\hat{y} = \alpha \cdot \hat{y}{\text{PFN}} + (1-\alpha) \cdot \hat{y}{\text{ICL}}$ with $\alpha=0.8$.

1. The Hard Truth: ieee-cis

On the real fraud dataset, TabPFN and TabICL collapse. Not by a little — by a lot.

ROC AUC vs training size on ieee-cis ROC AUC vs training size on ieee-cis

Figure 1: ROC AUC on ieee-cis as training size increases. Error bars show standard deviation across 3 seeds. TabPFN flatlines around 0.65–0.71; gradient boosters rise steadily to 0.87–0.88.

(See the Method Glossary for what each suffix means.)

N_trainTabPFNTabICLXGBXGB-fastXGB-softCatBoostCB-fastLightGBMLGB-fastLogisticMeta
1,0000.65280.58090.80260.79840.77600.78770.78150.78420.79110.7583
2,0000.67550.61520.82260.82180.80910.81250.81030.80340.81300.7991
5,0000.64730.65350.83820.84250.84380.82550.83600.82710.84290.8295
10,0000.70670.72980.85370.85980.85680.84490.85150.83940.86060.8556
20,0000.69240.71220.86850.87020.86720.86960.86770.86290.87230.8760

Three patterns jump out:

First, TabPFN does not scale. Its best AUC is 0.7067 at N=10k, then it regresses to 0.6924 at N=20k. Whatever signal it extracts from 10k rows, more data confuses it. TabICL improves monotonically but caps out at 0.7298 — still 12pp behind the best GBM.

Second, the shallow fast models keep up. XGB-fast — the deliberately shallow XGBoost variant (max_depth=3, 50 trees) — is within 0.5pp of XGB-default at every scale. It is not a sacrifice.

Third, soft distillation wins exactly once. XGB-soft — XGBoost trained on TabPFN’s soft probability estimates $y_{\text{soft}} = P_{\text{PFN}}(y=1 \mid x)$ instead of hard ${0,1}$ labels — takes first place at N=5k (0.8438 vs 0.8382 raw XGB). This is the medium-data sweet spot where the GBM benefits from the foundation model’s smoothed pseudo-labels. At N=20k, the gap vanishes — with enough real data, the GBM does not need a teacher.

Why does TabPFN collapse? We can only speculate, but three architectural constraints line up with the failure pattern. First, TabPFN was pretrained on small UCI-style datasets (typically <10k rows, <50 features); ieee-cis has 455 features after minimal preprocessing, far outside the training distribution. Second, the model uses a fixed attention window and positional encodings that may not generalize to the sparse, high-cardinality categorical structure of fraud data (card IDs, device types, address hashes). Third, TabPFN does not handle temporal non-stationarity — Vesta’s TransactionDT is a relative timestamp with strong seasonality that the model has no mechanism to exploit. TabICL improves monotonically because its in-context learning mechanism is less constrained by pretraining scale, though it still caps out well below GBMs.


2. How Big Is “Big”? Effect Sizes

AUC differences are abstract. The raw gap on ieee-cis N=20k is +0.18 AUC (LogisticMeta 0.8760 vs TabPFN 0.6924). To put this in statistical terms, we quantify the gap with Cohen’s $d$, the pooled-standard-deviation-normalized mean difference:

$$ d = \frac{\bar{x}_1 - \bar{x}_2}{s}, \qquad s = \sqrt{\frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2}} $$

Here $\bar{x}_1$ and $\bar{x}_2$ are the mean AUCs of the two methods, and $s$ is the pooled standard deviation across the $n=3$ seeds. Below is $d$ for every method vs TabPFN:

Effect size waterfall for ieee-cis N=20k Effect size waterfall for ieee-cis N=20k

Figure 2: Cohen's d vs TabPFN on ieee-cis N=20k for every method. Positive = better than TabPFN. d > 0.8 is "large"; d > 2.0 is "very large". Every GBM-based method sits above +20.

MethodΔAUC vs TabPFNCohen’s dInterpretation
LogisticMeta_5+0.1836+24.52Nearly 25 pooled std better
XGB_default+0.1761+23.69
XGB-fast+0.1778+23.51
LGB-fast+0.1799+22.72
CatBoost+0.1772+23.20
XGB-soft+0.1748+21.42
MLP-raw+0.1363+14.27
TabICL+0.0198+1.15Marginal improvement
PFN+ICL ensemble+0.0096+1.14
MLP-teacher-feature−0.1937−5.48Worse than baseline
CB-teacher-feature−0.3495−5.22
XGB-teacher-feature−0.4552−23.68Catastrophic failure

A note on interpreting these numbers. $d$ above +20 looks absurd because it is calculated from $n=3$ runs. With only three samples, the pooled standard deviation is tiny (~0.01), so even modest AUC gaps inflate to extreme $d$ values. The important takeaway is not “$d = 24.52$” — it is that the AUC gap itself (+0.18) is enormous in fraud-detection terms. A +0.18 AUC improvement is the difference between a model that barely beats random guessing and one that strongly separates classes. Every gradient booster — even a deliberately constrained one — outperforms TabPFN by a margin so large it would be statistically significant even with a single sample.


3. The Speed-Accuracy Inversion

The conventional ML tradeoff is “faster = worse.” On fraud data, that wisdom is inverted — but note that the speedup numbers here refer to fit time, not inference throughput. You fit once and predict millions, so predict latency matters more in production. Our predict-time measurements had inconsistent instrumentation across methods, so we focus on fit time as a conservative lower bound.

Speed-accuracy scatter across all configs Speed-accuracy scatter across all configs

Figure 3: Speed-accuracy scatter plot across all 22 configs, aggregated by method family. Lower-left is better (faster fit, higher AUC). The foundation models occupy the worst quadrant: slower *and* less accurate.

MethodAUC (N=20k)Fit TimeSpeedup vs TabPFN
TabPFN0.69240.62s1.0×
TabICL0.71222.15s0.3× (slower)
XGB-default0.86850.85s0.7×
XGB-fast0.87020.17s3.7× faster
LogisticMeta0.87600.14s4.6× faster

XGB-fast is simultaneously 3.7× faster to fit and 18pp more accurate. The Pareto frontier contains no foundation model. When your toolkit is gradient boosters, there is no tradeoff — there is strict dominance.


4. The Easy Truth: fraud-detection

Not every dataset is ieee-cis. The fraud-detection benchmark (28 features, ~10% fraud) is a much gentler landscape.

ROC AUC on fraud-detection ROC AUC on fraud-detection

Figure 4: ROC AUC on fraud-detection. All methods cluster within 0.04 AUC. The dataset is too small and too easy to separate families.

(See the Method Glossary for suffix definitions.)

N_trainTabPFNTabICLXGBXGB-fastXGB-softCatBoostCB-fastLightGBMLGB-fastLogisticMeta
5000.74210.75250.71890.72530.72260.71200.70720.70840.73740.7512
1,0000.75800.75980.71320.74280.74950.69940.72710.71240.73570.7554
2,0000.75800.76980.73210.76210.76460.72210.76010.72410.75660.7652
full0.77350.77240.72340.76710.78200.73250.76820.72100.77070.7756

At N=500, TabICL actually wins. Foundation models have a genuine advantage when data is extremely scarce — they bring inductive bias that shallow GBMs cannot construct from 500 rows. But by N=2k the gap vanishes, and at full size XGB-soft edges ahead by less than a percentage point.

The lesson: Foundation models are useful for data-starved fraud problems. For anything with >2k rows and decent features, a tuned gradient booster matches or beats them.


5. Teacher-as-Feature: A Catastrophic Failure (Sometimes)

One of our hypotheses was that appending PFN/ICL probabilities to the raw feature matrix would act as a powerful engineered feature. On the hardest scale (ieee-cis N=20k), we were wrong. But the failure is not universal — it depends on teacher quality.

(Reminder: “+ teacher” means concatenating the teacher’s predicted probability as an extra column: $X’ = [X ,|, P_{\text{teacher}}(x)]$. See the Glossary for details.)

Teacher-as-feature breakdown across scales Teacher-as-feature breakdown across scales

Figure 5: Teacher-as-feature performance at three training sizes on ieee-cis. Baselines in purple/teal; teacher-augmented variants in brown. At N=1k and N=20k the gap is severe; at N=5k the teacher is strong enough that the feature is merely mediocre.

ScaleXGB baselineXGB + teacherCB baselineCB + teacher
N=1k0.80260.48440.78770.4630
N=5k0.83820.29460.82550.3966
N=20k0.86850.23720.86960.3429

At N=1k and N=20k, the collapse is catastrophic. The teacher probabilities are biased, low-quality features when the teacher itself is weak (TabPFN 0.65–0.69 AUC). When concatenated to the raw feature matrix, they are highly correlated with the label — so the GBM greedily splits on them — but they generalize poorly because they encode the teacher’s specific errors. The student overfits to the teacher’s mistakes.

At N=5k, however, the story changes. The supplementary results (see Section 8) show CB_teacher_as_feature reaching 0.8431 AUC — only 0.005 behind XGB-default. Why? Because the teacher is stronger at this scale (TabPFN 0.8539, TabICL 0.8593). When the teacher is decent, its predictions are less poisonous as features.

Lesson: Teacher-as-feature is only safe when the teacher is already good. On hard fraud data where TabPFN/TabICL underperform, appending their predictions as features is destructive. A 0.69-AUC teacher poisons a 0.87-AUC student.

6. The Metrics That Actually Matter

In production fraud systems, you do not optimize AUC. You optimize recall at a fixed false-positive budget — typically 1% or 5% FPR. At these operating points the differences are even more stark than AUC suggests.

Recall@1%FPR vs ROC AUC across all runs Recall@1%FPR vs ROC AUC across all runs

Figure 6: Recall @ 1% FPR vs ROC AUC across all 1,188 runs. The relationship is sublinear (r ≈ 0.82). A +0.05 AUC gain does not guarantee a +0.05 recall gain. LogisticMeta punches above its AUC weight.

On ieee-cis N=20k:

MethodROC AUCRecall @ 1% FPRFraud recovered at 1% FPR
TabPFN0.69240.12512.5%
TabICL0.71220.13213.2%
XGB-fast0.87020.36336.3%
LogisticMeta0.87600.43943.9%

LogisticMeta improves recall from 12.5% to 43.9% at the same 1% false-positive budget — a 21.4 percentage point lift. Framed as a ratio, that is 3.5× as much fraud caught at that operating point. But the absolute picture matters more: moving from catching 1 in 8 fraudsters to catching nearly 1 in 2 is the difference between a production system that is usable and one that is not.


7. Global Ranking

Mean AUC across all 22 configs (including internet-ads PCA variants where everything hits 0.95+):

Global method ranking by mean AUC Global method ranking by mean AUC

Figure 7: Mean ROC AUC across all 22 configurations. Error bars show standard deviation. The top 6 methods are all gradient boosters or ensembles thereof. TabPFN and TabICL sit in the bottom half once hard fraud data is mixed in.

RankMethod FamilyMean AUCStd
1XGB_soft_distill0.80810.114
2Stacking (LogisticMeta)0.80660.122
3LGB_fast0.80080.124
4XGB_fast0.79840.127
5CB_fast0.79770.125
6CatBoost0.78870.132
7LightGBM0.78680.133
8XGB (default + meta)0.77740.141
9TabPFN0.77010.132
10TabICL0.76810.138

The pattern is unambiguous: gradient boosters occupy the top tier. Foundation models are competitive only when the dataset is easy (internet-ads, click-small) or tiny (fraud-detection N=500). Throw in a hard fraud dataset and they drop to the bottom half.

A note on the missing contender. You may notice WeightedAvg is absent from both the ranking table and the dominance chart. It wins the most configs outright (10/22) because CV-tuned weights overfit to the tiny validation set — the “best” weights change names between seeds, producing a different method label every run. We exclude it because it is non-reproducible: you cannot deploy a model whose weights depend on a single random split.

Method dominance count Method dominance count

Figure 8: Number of configurations (out of 22) where each method family achieves the highest AUC. LGB_fast and XGB_default each win 3 configs when excluding the overfitting WeightedAvg family.


8. What We Actually Learned From The Supplementary Experiment

After locking the initial sweep, we realized we had no apples-to-apples comparison for soft distillation across booster families. We patched the benchmark to add CB_soft_distill_pfn, CB_soft_distill_icl, LGB_soft_distill_pfn, and LGB_soft_distill_icl.

This was harder than expected. Both CatBoostClassifier and LGBMClassifier crash on continuous soft labels with opaque errors about “Target with classes must contain only 2 unique values” and “Unknown label type: continuous.” The fix is to switch to regressors (CatBoostRegressor, LGBMRegressor with RMSE loss) and manually clip predictions back to $[0, 1]$. In practice, CB-soft and LGB-soft minimize

$$ \mathcal{L} = \frac{1}{N}\sum_{i=1}^{N} \bigl(P_{\text{teacher}}(y_i=1 \mid x_i) - \hat{y}_i^{\text{student}}\bigr)^2 $$

with $\hat{y}_i$ clipped to $[0,1]$ at inference. Soft distillation is therefore not a generic “train any classifier on pseudo-labels” technique — the API ergonomics vary wildly by library.

Results from the corrected sweep are rolling in now. On ieee-cis N=5k (the medium-scale sweet spot where soft distillation previously won):

MethodAUCΔ vs XGB_defaultΔ vs XGB_soft_pfn
LogisticMeta_50.8535+0.0153+0.0097
CB_soft_distill_icl0.8451+0.0069+0.0013
CB_soft_distill_pfn0.8441+0.0059+0.0003
XGB_soft_distill_pfn0.8438+0.0056
LGB_fast0.8429+0.0047−0.0009
XGB_fast0.8425+0.0043−0.0013
LGB_soft_distill_pfn0.8297−0.0085−0.0141
LGB_soft_distill_icl0.8250−0.0132−0.0188

Observations:

  1. CatBoost soft distillation works. CB_soft_distill_icl edges out XGB_soft_distill_pfn by 0.0013 AUC — basically a tie. CatBoost’s ordered boosting seems to handle soft targets about as well as XGB’s gradient boosting.

  2. LightGBM soft distillation is weak at N=5k. Both LGB soft variants underperform raw XGB_default on this single data point. Whether this is a general property of leaf-wise trees on noisy continuous targets, or merely a hyperparameter sensitivity (our LGB regressor used max_depth=6, same as the classifier), awaits the full sweep. We suspect leaf-wise growth may be less stable on soft targets than XGB/CB’s level-wise approach, but this is a hypothesis, not a conclusion.

  3. LogisticMeta still wins. Even with the new contenders, stacking five base models beats every single-model distillation approach. The ensemble effect dominates the distillation effect.

  4. Teacher-as-feature is scale-dependent. At N=5k, CB_teacher_as_feature achieves 0.8431 AUC — not a catastrophe. Why? Because the teacher itself is stronger at this scale (TabPFN 0.8539, TabICL 0.8593). When the teacher is decent, its predictions are less poisonous as features. At N=20k, where TabPFN collapses to 0.6924, the same technique drops to 0.3429.

The full supplementary sweep (ieee-cis 1k/20k, fraud-detection 500/2k/full) is still running. We will update this table when complete.


9. Conclusions & Recommendations

Max accuracy on hard fraud: Use LogisticMeta_5 or XGB_default. Evidence: 0.876 AUC, 0.439 R@1% on ieee-cis 20k.
Best speed/accuracy tradeoff: Use XGB-fast (depth 3, n_estimators 50). Evidence: 0.870 AUC, 0.17s fit = 3.7× faster than TabPFN with higher accuracy.
Soft distillation worth it? Only at medium data (N ~ 5k). At scale, raw data beats soft labels. Evidence: XGB-soft wins at N=5k, ties at N=20k.
Teacher-as-feature? Never on hard data where the teacher is weak. Evidence: 0.23 AUC on ieee-cis vs 0.87 baseline.
TabPFN/TabICL for fraud? Use only if N < 1k and inference cost is irrelevant. Evidence: Wins at N=500 on fraud-detection; collapses above 2k.

The central finding is not that soft distillation is magic. It is that foundation models fail catastrophically on high-dimensional, heavily-engineered fraud data, and fast gradient boosters are the correct tool. Soft distillation is a useful but non-essential refinement — the real win comes from choosing the right model family for the data distribution.

If you are building a fraud detection pipeline today, the evidence says: start with a shallow XGBoost, add a CatBoost for diversity, stack them with logistic regression if you have the latency budget, and skip the transformer unless your dataset is tiny or your features are pristine.


10. Raw Data & Reproducibility

  • All 22 JSON result files: v4_raw_results.zip
  • Benchmark script: fraud_benchmark_v4.py (airig: ~/tabpfn-playground/)
  • Analysis scripts: v4_generate_plots_v2.py, v4_deep_analysis.py — available on request
  • Hardware: All times are wall-clock on RTX 5090. CPU-only TabPFN/TabICL fits will be slower.
  • Random seeds: 42, 43, 44. All reported means ± std are across these three seeds.