Machine: airig — AMD Ryzen 9 9900X, NVIDIA RTX 5090 FE, 64 GB RAM, Debian trixie
Software: Python 3.13.5, torch 2.12+cu130, tabpfn 8.0.3, tabicl 2.1.1, xgboost 3.0.3, catboost 1.2.8, lightgbm 4.6.0
Code & Data: v4_raw_results.zip — 22 JSON files, 1,188 individual run records


The Pitch and The Problem

Tabular foundation models sell a tempting promise. You get competitive accuracy with zero hyperparameter tuning and zero feature engineering. That sounds great if you stare at spreadsheets all day.

Fraud detection is not a spreadsheet problem. It is adversarial, high-dimensional, and heavily engineered. Vesta’s anonymized feature interactions, device fingerprints, and transaction timestamps create a signal landscape that looks nothing like clean UCI datasets.

I ran an experiment on real fraud data. The main dataset is ieee-cis, a Kaggle competition with 590k rows, 455 features, and a 3.5% fraud rate. I also tested fraud-detection from the Amazon FDB suite and four other datasets.

I tested 52 method variants across 22 configurations. Each configuration used 3 random seeds. The question: when data gets hard, do foundation models still win?

They do not.


What I Tested

I tested foundation models, gradient boosters, and several hybrid techniques. TabPFN is a transformer pretrained on synthetic tabular tasks. TabICL uses in-context meta-learning. TabICL-FT adds end-to-end fine-tuning.

I tested three gradient boosting libraries. XGBoost, CatBoost, and LightGBM each ran with default hyperparameters. I also tested fast variants with shallower trees and fewer estimators.

Soft distillation trains a gradient booster on soft probability outputs from a teacher model. Teacher-as-feature appends those probabilities as extra input columns. Both techniques aim to transfer knowledge from a foundation model to a booster.

I tested two stacking ensembles. LogisticMeta_5 trains logistic regression on out-of-fold predictions from five base models. XGBMeta_5 replaces the meta-learner with XGBoost.

I also tested CV-tuned weighted averages and a fixed-weight PFN+ICL blend. Two MLP baselines rounded out the suite: one on raw features and one on teacher-augmented features.

CategoryMethods
Foundation modelsTabPFN, TabICL, TabICL-FT
Gradient boostersXGBoost, CatBoost, LightGBM (default + fast variants)
Soft distillationTrain GBMs on PFN/ICL probability outputs as soft labels
Teacher-as-featureAppend PFN/ICL probabilities as extra input features
StackingLogisticMeta (5-base), XGBMeta (5-base)
EnsemblesCV-tuned weighted averages, PFN+ICL fixed-α ensemble
Neural netsMLP on raw features, MLP on teacher-augmented features

The experiment design:

  • ieee-cis: 1k / 2k / 5k / 10k / 20k training rows
  • fraud-detection: 500 / 1k / 2k / full training rows
  • fake-job, click-small: 1k / 2k / full
  • internet-ads: PCA 50/100/200 and SelectK 50/100/200

I measured ROC AUC, Average Precision, Recall@1%FPR, Recall@5%FPR, fit time, and predict time.

Preprocessing. I applied minimal preprocessing intentionally. I median-imputed numerical columns and z-score standardized them. I cast categoricals to string, filled missing values with "missing", and factorized them with a train-plus-test union mapping. I kept timestamps and card identifiers as-is.

I wanted to test model families, not feature engineering pipelines. GBMs used library defaults: 200 trees, depth 6, learning rate 0.1, with scale_pos_weight or class_weight="balanced" for imbalance. I performed no hyperparameter search for any method.

One note on fairness. TabPFN and TabICL are zero-shot. They need no training hyperparameters and no grid search. My GBMs also use fixed defaults. Those defaults were chosen by XGBoost, CatBoost, and LightGBM authors over years of tuning. If you ran a Bayesian optimization sweep on the GBMs, the gap would likely widen further. I skipped that to keep the comparison conservative.


Method Glossary

Short nameWhat it does
TabPFNTransformer pretrained on synthetic tabular tasks; produces zero-shot predictions without hyperparameter search.
TabICLIn-context meta-learning model that adapts to new tabular tasks by reading labeled examples.
XGBoostGradient boosting library that builds trees level-wise with regularization.
CatBoostGradient boosting library with ordered boosting and native categorical handling.
LightGBMLeaf-wise gradient boosting library optimized for speed and large datasets.
Soft distillationTrains a student model on a teacher’s predicted probabilities instead of hard labels.
Teacher-as-featureConcatenates teacher predicted probabilities to the original feature matrix.
LogisticMeta_5Stacking ensemble that feeds five base models’ predictions into logistic regression.

1. The Hard Truth: ieee-cis

On the real fraud dataset, TabPFN and TabICL collapse. The gap is large.

I subsampled the ieee-cis training set to 1k, 2k, 5k, 10k, and 20k rows. I trained each method on each subsample with 3 random seeds and computed ROC AUC on the held-out test set.

I use short suffixes throughout. XGB-fast, CB-fast, and LGB-fast use fewer trees and shallower depth. XGB-soft trains on TabPFN soft labels. LogisticMeta_5 stacks five base models.

ROC AUC vs training size on ieee-cis ROC AUC vs training size on ieee-cis

Figure 1: ROC AUC on ieee-cis as training size increases. Error bars show standard deviation across 3 seeds. TabPFN flatlines around 0.65–0.71. Gradient boosters rise steadily to 0.87–0.88.

N_trainTabPFNTabICLXGBXGB-fastXGB-softCatBoostCB-fastLightGBMLGB-fastLogisticMeta
1,0000.65280.58090.80260.79840.77600.78770.78150.78420.79110.7583
2,0000.67550.61520.82260.82180.80910.81250.81030.80340.81300.7991
5,0000.64730.65350.83820.84250.84380.82550.83600.82710.84290.8295
10,0000.70670.72980.85370.85980.85680.84490.85150.83940.86060.8556
20,0000.69240.71220.86850.87020.86720.86960.86770.86290.87230.8760

Three patterns jump out.

First, TabPFN does not scale. I measured its best AUC at 0.7067 with N=10k. Then it regressed to 0.6924 at N=20k. Whatever signal it extracts from 10k rows, more data confuses it. TabICL improves monotonically but caps out at 0.7298. That is still 12pp behind the best GBM.

Second, the shallow fast models keep up. XGB-fast uses max_depth=3 and only 50 trees. It stays within 0.5pp of XGB-default at every scale. That is not a sacrifice.

Third, soft distillation wins exactly once. XGB-soft is XGBoost trained on TabPFN’s soft probability estimates instead of hard binary labels. It takes first place at N=5k with 0.8438 AUC against 0.8382 for raw XGB. This is the medium-data sweet spot. The GBM benefits from the foundation model’s smoothed pseudo-labels. At N=20k, the gap vanishes. With enough real data, the GBM does not need a teacher.

Why does TabPFN collapse? I can only speculate. Three architectural constraints line up with the failure pattern.

TabPFN was pretrained on small UCI-style datasets. Those typically have fewer than 10k rows and fewer than 50 features. ieee-cis has 455 features after minimal preprocessing. That is far outside the training distribution.

The model uses a fixed attention window and positional encodings. These may not generalize to the sparse, high-cardinality categorical structure of fraud data. Card IDs, device types, and address hashes are not friendly to this architecture.

TabPFN also does not handle temporal non-stationarity. Vesta’s TransactionDT is a relative timestamp with strong seasonality. The model has no mechanism to exploit that signal.

TabICL improves monotonically because its in-context learning mechanism is less constrained by pretraining scale. It still caps out well below GBMs.


2. How Big Is “Big”? Effect Sizes

AUC differences are abstract. The raw gap on ieee-cis N=20k is +0.18 AUC. I computed Cohen’s d to normalize the mean difference by the pooled standard deviation. I used the formula with pooled variance across the 3 seeds.

$$ d = \frac{\bar{x}_1 - \bar{x}_2}{s}, \qquad s = \sqrt{\frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2}} $$

I calculated the mean AUC for each method across the 3 seeds. Then I plugged those means into Cohen’s d formula with the pooled standard deviation. The table below shows d for every method versus TabPFN.

Effect size waterfall for ieee-cis N=20k Effect size waterfall for ieee-cis N=20k

Figure 2: Cohen's d versus TabPFN on ieee-cis N=20k for every method. Positive values mean better than TabPFN. A d above 0.8 is "large". A d above 2.0 is "very large". Every GBM-based method sits above +20.

MethodΔAUC vs TabPFNCohen’s dInterpretation
LogisticMeta_5+0.1836+24.52Nearly 25 pooled std better
XGB_default+0.1761+23.69
XGB-fast+0.1778+23.51
LGB-fast+0.1799+22.72
CatBoost+0.1772+23.20
XGB-soft+0.1748+21.42
MLP-raw+0.1363+14.27
TabICL+0.0198+1.15Marginal improvement
PFN+ICL ensemble+0.0096+1.14
MLP-teacher-feature−0.1937−5.48Worse than baseline
CB-teacher-feature−0.3495−5.22
XGB-teacher-feature−0.4552−23.68Catastrophic failure

A note on interpreting these numbers. d above +20 looks absurd because I calculated it from only 3 runs. With only three samples, the pooled standard deviation is tiny at roughly 0.01. Even modest AUC gaps inflate to extreme d values. The important takeaway is not the exact d value. The AUC gap itself of +0.18 is enormous in fraud-detection terms. That improvement is the difference between a model that barely beats random guessing and one that strongly separates classes. Every gradient booster outperforms TabPFN by a margin so large it would be significant even with a single sample. This holds even for deliberately constrained models like XGB-fast.


3. The Speed-Accuracy Inversion

The conventional ML tradeoff says faster means worse. On fraud data, that wisdom inverts. The speedup numbers here refer to fit time, not inference throughput. You fit once and predict millions. Predict latency matters more in production. My predict-time measurements had inconsistent instrumentation across methods. I focus on fit time as a conservative lower bound.

Speed-accuracy scatter across all configs Speed-accuracy scatter across all configs

Figure 3: Speed-accuracy scatter plot across all 22 configs, aggregated by method family. Lower-left is better. That means faster fit and higher AUC. The foundation models occupy the worst quadrant. They are slower and less accurate.

MethodAUC (N=20k)Fit TimeSpeedup vs TabPFN
TabPFN0.69240.62s1.0×
TabICL0.71222.15s0.3× (slower)
XGB-default0.86850.85s0.7×
XGB-fast0.87020.17s3.7× faster
LogisticMeta0.87600.14s4.6× faster

XGB-fast is simultaneously 3.7× faster to fit and 18pp more accurate. The Pareto frontier contains no foundation model. When your toolkit is gradient boosters, there is no tradeoff. There is strict dominance.


4. The Easy Truth: fraud-detection

Not every dataset is ieee-cis. The fraud-detection benchmark has 28 features and roughly 10% fraud. It is a much gentler landscape.

ROC AUC on fraud-detection ROC AUC on fraud-detection

Figure 4: ROC AUC on fraud-detection. All methods cluster within 0.04 AUC. The dataset is too small and too easy. You cannot distinguish method families here.

N_trainTabPFNTabICLXGBXGB-fastXGB-softCatBoostCB-fastLightGBMLGB-fastLogisticMeta
5000.74210.75250.71890.72530.72260.71200.70720.70840.73740.7512
1,0000.75800.75980.71320.74280.74950.69940.72710.71240.73570.7554
2,0000.75800.76980.73210.76210.76460.72210.76010.72410.75660.7652
full0.77350.77240.72340.76710.78200.73250.76820.72100.77070.7756

At N=500, TabICL actually wins. Foundation models carry inductive bias that shallow GBMs cannot construct from 500 rows. By N=2k the gap vanishes. At full size XGB-soft edges ahead by less than a percentage point.

The lesson is simple. Foundation models help only for data-starved fraud problems. For anything above 2k rows with decent features, a tuned gradient booster matches or beats them.


5. Teacher-as-Feature: A Catastrophic Failure (Sometimes)

I hypothesized that appending PFN/ICL probabilities to the raw feature matrix would act as a powerful engineered feature. On the hardest scale, ieee-cis N=20k, I was wrong. The failure is not universal. It depends on teacher quality.

Teacher-as-feature breakdown across scales Teacher-as-feature breakdown across scales

Figure 5: Teacher-as-feature performance at three training sizes on ieee-cis. Baselines appear in purple and teal. Teacher-augmented variants appear in brown. At N=1k and N=20k the gap is severe. At N=5k the teacher is strong enough that the feature is merely mediocre.

ScaleXGB baselineXGB + teacherCB baselineCB + teacher
N=1k0.80260.48440.78770.4630
N=5k0.83820.29460.82550.3966
N=20k0.86850.23720.86960.3429

At N=1k and N=20k, the collapse is catastrophic. The teacher probabilities are biased, low-quality features when the teacher itself is weak. TabPFN scores 0.65 to 0.69 AUC in this range. When concatenated to the raw feature matrix, these probabilities are highly correlated with the label. The GBM greedily splits on them. They generalize poorly because they encode the teacher’s specific errors. The student overfits to the teacher’s mistakes.

At N=5k, however, the story changes. CB_teacher_as_feature reaches 0.8431 AUC in the supplementary results. That is only 0.005 behind XGB-default. The teacher itself is stronger at this scale. TabPFN scores 0.8539 and TabICL scores 0.8593. When the teacher is decent, its predictions are less poisonous as features.

Lesson: Teacher-as-feature is only safe when the teacher is already good. On hard fraud data where TabPFN and TabICL underperform, appending their predictions as features is destructive. A 0.69-AUC teacher poisons a 0.87-AUC student.

6. The Metrics That Actually Matter

In production fraud systems, you do not optimize AUC. You optimize recall at a fixed false-positive budget. That typically means 1% or 5% FPR. At these operating points the differences are even more stark than AUC suggests.

Recall@1%FPR vs ROC AUC across all runs Recall@1%FPR vs ROC AUC across all runs

Figure 6: Recall at 1% FPR versus ROC AUC across all 1,188 runs. The relationship is sublinear with r ≈ 0.82. A gain of 0.05 AUC does not guarantee a gain of 0.05 recall. LogisticMeta punches above its AUC weight.

On ieee-cis N=20k:

MethodROC AUCRecall @ 1% FPRFraud recovered at 1% FPR
TabPFN0.69240.12512.5%
TabICL0.71220.13213.2%
XGB-fast0.87020.36336.3%
LogisticMeta0.87600.43943.9%

LogisticMeta improves recall from 12.5% to 43.9% at the same 1% false-positive budget. That is a 21.4 percentage point lift. Framed as a ratio, that is 3.5× as much fraud caught at that operating point. The absolute picture matters more. Moving from catching 1 in 8 fraudsters to catching nearly 1 in 2 is the difference between a usable production system and a broken one.


7. Global Ranking

I averaged AUC across all 22 configs. This includes internet-ads PCA variants where everything hits 0.95 and above.

Global method ranking by mean AUC Global method ranking by mean AUC

Figure 7: Mean ROC AUC across all 22 configurations. Error bars show standard deviation. The top 6 methods are all gradient boosters or ensembles of them. TabPFN and TabICL sit in the bottom half once hard fraud data is mixed in.

RankMethod FamilyMean AUCStd
1XGB_soft_distill0.80810.114
2Stacking (LogisticMeta)0.80660.122
3LGB_fast0.80080.124
4XGB_fast0.79840.127
5CB_fast0.79770.125
6CatBoost0.78870.132
7LightGBM0.78680.133
8XGB (default + meta)0.77740.141
9TabPFN0.77010.132
10TabICL0.76810.138

The pattern is unambiguous. Gradient boosters occupy the top tier. Foundation models are competitive only when the dataset is easy. That means internet-ads and click-small. They also win when the dataset is tiny, like fraud-detection at N=500. Throw in a hard fraud dataset and they drop to the bottom half.

A note on the missing contender. You may notice WeightedAvg is absent from both the ranking table and the dominance chart. It wins the most configs outright. It takes 10/22. CV-tuned weights overfit to the tiny validation set. The best weights change between seeds. That produces a different method label every run. I exclude it because it is non-reproducible. You cannot deploy a model whose weights depend on a single random split.

Method dominance count Method dominance count

Figure 8: Number of configurations out of 22 where each method family achieves the highest AUC. LGB_fast and XGB_default each win 3 configs when excluding the overfitting WeightedAvg family.


8. What I Actually Learned From The Supplementary Experiment

After locking the initial sweep, I realized I had no apples-to-apples comparison for soft distillation across booster families. I patched the benchmark to add CB_soft_distill_pfn, CB_soft_distill_icl, LGB_soft_distill_pfn, and LGB_soft_distill_icl.

This was harder than expected. Both CatBoostClassifier and LGBMClassifier crash on continuous soft labels. They throw opaque errors about target classes needing only 2 unique values. Another error says unknown label type continuous. The fix is to switch to regressors. I used CatBoostRegressor and LGBMRegressor with RMSE loss. Then I manually clip predictions back to the [0, 1] interval. In practice, CB-soft and LGB-soft minimize

$$ \mathcal{L} = \frac{1}{N}\sum_{i=1}^{N} \bigl(P_{\text{teacher}}(y_i=1 \mid x_i) - \hat{y}_i^{\text{student}}\bigr)^2 $$

with $\hat{y}_i$ clipped to $[0,1]$ at inference. Soft distillation is therefore not a generic technique. You cannot simply train any classifier on pseudo-labels. The API ergonomics vary wildly by library.

Results from the corrected sweep are rolling in now. I focus on ieee-cis N=5k. This scale produced the only outright win for soft distillation in the main results. The table below shows XGB_soft_distill_pfn and the new CatBoost and LightGBM soft variants.

MethodAUCΔ vs XGB_defaultΔ vs XGB_soft_pfn
LogisticMeta_50.8535+0.0153+0.0097
CB_soft_distill_icl0.8451+0.0069+0.0013
CB_soft_distill_pfn0.8441+0.0059+0.0003
XGB_soft_distill_pfn0.8438+0.0056
LGB_fast0.8429+0.0047−0.0009
XGB_fast0.8425+0.0043−0.0013
LGB_soft_distill_pfn0.8297−0.0085−0.0141
LGB_soft_distill_icl0.8250−0.0132−0.0188

Here is what I see so far.

CatBoost soft distillation works. CB_soft_distill_icl edges out XGB_soft_distill_pfn by 0.0013 AUC. That is basically a tie. CatBoost’s ordered boosting seems to handle soft targets about as well as XGB’s gradient boosting.

LightGBM soft distillation is weak at N=5k. Both LGB soft variants underperform raw XGB_default on this single data point. I do not know if this is a general property of leaf-wise trees on noisy continuous targets. It could also be hyperparameter sensitivity. My LGB regressor used max_depth=6, same as the classifier. The full sweep will tell. I suspect leaf-wise growth may be less stable on soft targets than the level-wise approach in XGB and CatBoost. This is a hypothesis, not a conclusion.

LogisticMeta still wins. Even with the new contenders, stacking five base models beats every single-model distillation approach. The ensemble effect dominates the distillation effect.

Teacher-as-feature is scale-dependent. At N=5k, CB_teacher_as_feature achieves 0.8431 AUC. That is not a catastrophe. The teacher itself is stronger at this scale. TabPFN scores 0.8539 and TabICL scores 0.8593. When the teacher is decent, its predictions are less poisonous as features. At N=20k, where TabPFN collapses to 0.6924, the same technique drops to 0.3429.

The full supplementary sweep is still running. That covers ieee-cis 1k and 20k, plus fraud-detection at 500, 2k, and full. I will update this table when complete.


9. Conclusions & Recommendations

Max accuracy on hard fraud: Use LogisticMeta_5 or XGB_default. Evidence: 0.876 AUC, 0.439 R@1% on ieee-cis 20k.
Best speed/accuracy tradeoff: Use XGB-fast with depth 3 and 50 estimators. Evidence: 0.870 AUC, 0.17s fit. That is 3.7× faster than TabPFN with higher accuracy.
Soft distillation worth it? Only at medium data near N=5k. At scale, raw data beats soft labels. Evidence: XGB-soft wins at N=5k, ties at N=20k.
Teacher-as-feature? Never on hard data where the teacher is weak. Evidence: 0.23 AUC on ieee-cis versus 0.87 baseline.
TabPFN/TabICL for fraud? Use only if N < 1k and inference cost is irrelevant. Evidence: Wins at N=500 on fraud-detection; collapses above 2k.

The central finding is not that soft distillation is magic. It is that foundation models fail catastrophically on high-dimensional, heavily-engineered fraud data. Fast gradient boosters are the correct tool. Soft distillation is a useful but non-essential refinement. The real win comes from choosing the right model family for the data distribution.

If you are building a fraud detection pipeline today, the evidence is clear. Start with a shallow XGBoost. Add a CatBoost for diversity. Stack them with logistic regression if you have the latency budget. Skip the transformer unless your dataset is tiny or your features are pristine.


10. Raw Data & Reproducibility

  • All 22 JSON result files: v4_raw_results.zip
  • Benchmark script: fraud_benchmark_v4.py (airig: ~/tabpfn-playground/)
  • Analysis scripts: v4_generate_plots_v2.py, v4_deep_analysis.py — available on request
  • Hardware: All times are wall-clock on RTX 5090. CPU-only TabPFN/TabICL fits will be slower.
  • Random seeds: 42, 43, 44. All reported means ± std are across these three seeds.