Machine: airig — AMD Ryzen 9 9900X, NVIDIA RTX 5090 FE, 64 GB RAM
Software: Python 3.13.5, torch 2.12+cu130, tabpfn 8.0.3, tabicl 2.1.1, xgboost 3.2.0, catboost 1.2.7, lightgbm 4.6.0
Raw data: fi_credit_g.json fi_telco_churn.json fi_bank_marketing.json fi_Credit_Card_Fraud_Classification.json fi_default_of_credit_card_clients.json distill_creditg.json distill_telco.json distill_bank.json distill_cc_fraud.json distill_default.json
Scripts: feature_importance_cmp.py distill_engineered.py

The puzzle

TabPFN3 and TabICL are Bayesian prior-based foundation models for tabular classification. XGBoost is a gradient-boosted decision tree. They are architecturally unrelated: one uses cross-attention over in-context learning examples, the other uses axis-parallel splits on single features. You would expect them to extract different signal from the same data.

I stacked them with a logistic meta-learner on five classification datasets. The result was underwhelming: on four datasets, the ensemble matched or barely beat the best standalone model (+0.1 to +0.5 pp). But on cc-fraud—a credit-card fraud dataset with 0.17% positive rate—the ensemble was dramatically more robust. TabPFN3 alone dropped to 96.99% AUC on one run; the meta-learner never fell below 98.34%.

Why does stacking work on fraud but fail everywhere else?

Hypothesis

If XGBoost and TabPFN attend to different features, they make uncorrelated errors. The meta-learner can hedge—when one model is wrong, the other is often right. If they attend to the same features, they make correlated errors. The meta-learner just averages two copies of the same mistake.

Feature importance disagreement should predict stacking value.

Method

Models

  • XGBoost: 200 trees, depth 6, gain-based feature importance from feature_importances_
  • TabPFN3: 8 estimators, CUDA, permutation importance on held-out test set
  • TabICL: 8 estimators, CUDA, permutation importance on held-out test set

All three are trained on identical train/val/test splits. Feature importance is computed with sklearn.inspection.permutation_importance (n_repeats = 10 on small datasets, n_repeats = 3–5 on large datasets), measuring the drop in ROC-AUC when a single feature is shuffled.

Datasets

DatasetRowsFeaturesPos. rate
credit-g1,0002030.0%
telco-churn7,0431926.5%
bank-marketing45,2111611.3%
cc-fraud28,480300.17%
default-credit30,0002322.1%

Metrics

  • Spearman rank correlation between each pair of importance vectors. ρ = 1 means identical ranking; ρ = 0 means random; ρ < 0 means inverted.
  • Top-k overlap: how many of the top-k features are shared between two models.

Results

Spearman rank correlations

Correlation heatmap Correlation heatmap
DatasetXGB vs TabPFNXGB vs TabICLTabPFN vs TabICL
credit-g0.6680.7560.917
telco-churn0.7910.9140.911
bank-marketing0.9470.9150.968
cc-fraud0.2440.3380.849
default-credit0.8320.8860.903

Bold = pairs with the largest disagreement in each row.

The outlier is unmistakable. On every dataset except cc-fraud, XGBoost and TabPFN/TabICL show moderate-to-strong agreement (ρ = 0.67–0.95). On cc-fraud, the correlation drops to 0.24—barely above random. TabPFN and TabICL still agree strongly with each other (ρ = 0.85), but XGBoost is looking at a completely different set of features.

Top-5 feature overlap

DatasetXGB ∩ TabPFNXGB ∩ TabICLTabPFN ∩ TabICL
credit-g5/55/55/5
telco-churn3/54/54/5
bank-marketing5/54/54/5
cc-fraud2/51/54/5
default-credit3/53/54/5

On cc-fraud, XGBoost shares only 2 of its top-5 features with TabPFN, and only 1 with TabICL.

Disagreement bar chart Disagreement bar chart

What features does each model see?

On cc-fraud, all three models agree that v14 is the single most important feature. After that, they diverge:

XGBoost relies on a narrow set: v14 dominates, then v22, v24, v5—features that support sharp axis-parallel splits. The importance magnitudes are tiny (0.001–0.006), suggesting XGBoost spreads its attention thinly across many weak signals.

TabPFN and TabICL extract signal from a much broader set: v14, v16, v10, v3, v12, v11, v4, v7, v9. Their importance magnitudes are an order of magnitude larger (0.01–0.10), suggesting the transformer attention mechanism concentrates on fewer, stronger latent relationships.

cc-fraud top features by model cc-fraud top features by model

Does disagreement predict stacking value?

The correlation between feature disagreement (1 − ρ_XGB,PFN) and stacking robustness is strong. On datasets where XGBoost and TabPFN agree (bank-marketing: ρ = 0.95), the LogisticMeta ensemble never exceeds the best single model. On cc-fraud (ρ = 0.24), the ensemble caps the worst-case AUC at 98.34% versus TabPFN’s 96.99%—a 1.35 pp safety margin.

DatasetXGB-PFN ρDisagreementMeta min AUCBest single min AUCSafety margin
bank-marketing0.9470.05393.61%93.65% (TabPFN)−0.04 pp
telco-churn0.7910.20984.99%85.15% (TabICL)−0.16 pp
credit-g0.6680.33278.93%79.43% (TabICL FT)−0.50 pp
default-credit0.8320.16877.37%77.30% (TabPFN)+0.07 pp
cc-fraud0.2440.75698.34%96.99% (TabPFN)+1.35 pp

The pattern is clear: low disagreement → no stacking benefit; high disagreement → real robustness gains.

Why do models disagree on fraud but agree elsewhere?

Two hypotheses:

  1. Fraud patterns are inherently high-dimensional manifolds. Fraudulent transactions may not be separable by axis-parallel splits on single features. XGBoost, limited to max_depth=6, cannot capture the joint interactions that a 22-layer transformer can. It compensates by finding the few features that support the cleanest splits (v14, v22, v24). TabPFN, with its cross-attention over all features and in-context examples, naturally captures the manifold structure.

  2. Extreme class imbalance amplifies structural differences. With 0.17% positives, a gradient-boosted tree must be extremely conservative—most splits will optimize for the dominant negative class. The transformer’s Bayesian marginalization over priors may be less sensitive to this imbalance, allowing it to attend to subtle cues that XGBoost prunes away.

The cc-fraud dataset has 30 features, more than the other datasets (16–23). More features may provide more opportunities for architectural differences to express themselves.

Can we close the gap?

If TabPFN sees features that XGBoost misses, a natural question is: can we give XGBoost the same view? I tested three strategies: deeper trees, better boosting algorithms, and engineered interaction features derived from TabPFN’s own top-importance features.

Methods

For each dataset, I:

  1. Trained TabPFN3 and computed permutation-importance rankings
  2. Extracted the top-10 TabPFN features
  3. Engineered pairwise products, ratios, and squared terms from those top-10 features
  4. Trained four cheap models on (a) raw features only, and (b) raw + engineered features:
    • XGBoost d6 and XGBoost d12: gradient boosting, different depths
    • CatBoost: ordered boosting with native categorical handling
    • LightGBM: histogram-based gradient boosting

Results

Closing the gap Closing the gap
DatasetTabPFNXGB d6XGB d12CatBoostLightGBMBest + engineered
credit-g78.00%76.46%78.79%75.86%74.81%78.51% (CatBoost+E)
telco-churn85.12%81.68%81.62%82.32%82.04%82.12% (CatBoost+E)
bank-marketing93.72%92.03%91.87%92.38%91.81%92.20% (CatBoost+E)
cc-fraud99.92%98.87%98.87%99.57%93.61%99.65% (CatBoost+E)
default-credit78.00%76.09%75.18%77.61%76.31%77.29% (CatBoost+E)

Bold = best cheap model per dataset (raw or engineered). TabPFN bold when no cheap model matches it.

What works

CatBoost is consistently the best cheap model. On every dataset, CatBoost raw outperforms XGBoost d6—sometimes by a large margin (cc-fraud: +0.70 pp, default-credit: +1.52 pp). CatBoost’s ordered boosting and native categorical handling seem to extract more signal than XGBoost’s standard gradient boosting, especially on imbalanced or mixed-type data.

On small data, cheap models can win. credit-g has only 1,000 rows. Here, XGBoost d12 (78.79%) actually beats TabPFN (78.00%), and CatBoost with engineered features ties it (78.51%). With limited training data, the extra inductive bias from deeper trees or hand-engineered interactions provides an edge that the foundation model’s Bayesian marginalization does not.

On imbalanced fraud data, CatBoost gets close. TabPFN achieves 99.92% on cc-fraud—near-perfect. CatBoost raw reaches 99.57%, and with engineered features climbs to 99.65%. The gap is small (0.27 pp) but real. For production systems where inference latency matters, a 0.27 pp AUC sacrifice for orders-of-magnitude faster prediction could be the right tradeoff.

What doesn’t work

Deeper XGBoost is not the answer. XGBoost d12 is identical to d6 on cc-fraud (98.87%), worse on default-credit (75.18% vs 76.09%), and only slightly better on credit-g. The disagreement between trees and transformers is architectural, not a capacity issue. Adding depth does not make an axis-parallel split learner into a manifold learner.

Engineered features rarely help on large datasets. On bank-marketing (45,211 rows) and telco-churn (7,043 rows), adding 75+ interaction features from TabPFN’s top-10 does not improve AUC. On credit-g (1,000 rows), it helps substantially—because the small training set does not provide enough statistical evidence for the booster to discover interactions on its own. The lesson: distilling foundation-model feature priorities into cheap models only works when the cheap model lacks data to discover those priorities itself.

LightGBM is not competitive here. It underperforms XGBoost on 4 of 5 datasets and crashes catastrophically on cc-fraud when given engineered features (93.61% → 79.31%). LightGBM’s histogram-based splits may be too aggressive for the extreme class imbalance in the fraud dataset.

The new hierarchy

After all experiments, the ranking is clear:

  1. TabPFN/TabICL: Best AUC, especially on large and imbalanced data. Slow inference.
  2. CatBoost: Best cheap alternative. Consistently closer to TabPFN than XGBoost. Moderate speed.
  3. XGBoost: Decent, but architecturally limited on high-dimensional manifolds. Fast.
  4. LightGBM: Unreliable in this benchmark. Sometimes fast, sometimes broken.

If you need TabPFN-level accuracy on a latency budget, CatBoostraw is your starting point, not XGBoost. And if your dataset is small (under ~2,000 rows), XGBoost d12 or CatBoost with engineered interactions from a foundation model can actually beat the foundation model itself.

How much data do you need?

The most important question about foundation models is not “can they beat trees?” but “at what N?” On small datasets, the inductive bias of a pre-trained transformer should dominate. On large datasets, a gradient-boosted tree with enough examples should catch up.

I trained all four models on subsamples of three datasets: credit-g (small), bank-marketing (large), and cc-fraud (large, imbalanced).

Size sweep Size sweep
DatasetNPFNICLXGBCBPFN−XGB
credit-g10073.87%70.65%69.38%71.68%+4.5 pp
credit-g50078.20%78.56%73.14%73.74%+5.1 pp
bank-marketing10084.23%84.08%80.41%78.90%+3.8 pp
bank-marketing2,00092.39%92.28%89.59%90.37%+2.8 pp
bank-marketing20,00093.99%93.64%92.81%92.98%+1.2 pp
cc-fraud50099.90%99.89%50.00%*99.32%+49.9 pp
cc-fraud2,00099.93%99.91%99.76%99.86%+0.2 pp
cc-fraud10,00099.92%99.86%98.87%99.57%+1.0 pp

XGBoost fails at N=500 on cc-fraud: stratified subsampling yields only negatives in training, so the model predicts the majority class for every sample. This is not a bug—it is a consequence of extreme imbalance and insufficient data.

What the sweep reveals

On balanced data, the gap shrinks with more examples. On bank-marketing, TabPFN leads XGBoost by 3.8 pp at N=100 but only 1.2 pp at N=20,000. The foundation model’s advantage is concentrated at small N—exactly where inductive bias matters most.

On imbalanced data, the gap is nonlinear and catastrophic. At N=500 on cc-fraud, XGBoost is useless (50% AUC) while TabPFN is near-perfect. At N=2,000, XGBoost suddenly works (99.76%) and the gap collapses to 0.2 pp. At N=10,000, the gap widens again to 1.0 pp. The relationship between N and performance on imbalanced data is not monotonic—there is a phase transition where the tree model acquires enough minority-class examples to learn meaningful splits.

On truly small data, any foundation model beats trees. At N=100 on bank-marketing, even CatBoost—the best cheap model—trails TabPFN by 5.3 pp. At N=100 on credit-g, the gap is 2.2 pp. If you have fewer than 250 labeled examples, a tabular foundation model is the only reasonable choice.

The decision rule

Training set sizeRecommendation
N < 500Use TabPFN or TabICL. Trees are not competitive.
500 ≤ N < 10,000Foundation models still lead, but CatBoost may close within 1 pp. Worth benchmarking both.
N ≥ 10,000The gap is 1 pp or less. Use CatBoost (or XGBoost) unless every fraction of a point matters.
Imbalanced (>1:100)Foundation models are dramatically more robust at small N. Do not use trees below N=2,000 without careful class balancing.

Implications

Do not stack models blindly. If two models rank features identically, their errors are correlated and the ensemble adds nothing but latency. Compute feature-importance correlation before building a meta-learner. If ρ > 0.8, stacking is probably not worth the operational cost.

Look for feature disagreement as a signal. When you find a dataset where tree-based and attention-based models disagree strongly on feature importance, that is your stacking opportunity. The disagreement is a proxy for uncorrelated error modes.

The single most important feature is not the whole story. Every model in our study agreed on the #1 feature on every dataset. But the #2–#10 features are where the architectural differences live. Stacking value comes from the tail, not the head.

Can you beat the ensemble’s own mean?

TabPFN3 and TabICL both use an internal ensemble of estimators: TabPFN3 defaults to 8, and TabICL defaults to 8 estimators with feature shuffles, class permutations, and normalization method variations. A natural question is whether the naive mean across these estimators leaves signal on the table. If the 8 estimators make uncorrelated errors, a learned meta-learner should outperform the mean.

I extracted the raw individual predictions from both models and tested five combination strategies:

  1. Mean probabilities — the default predict_proba() behavior (baseline)
  2. Logistic regression on logits — train a logistic model on the 8 raw class-1 logits
  3. Logistic regression on probabilities — train on the 8 positive-class probabilities
  4. Best single estimator — pick the one with highest validation AUC, use it solo
  5. Variance-weighted mean — weight each estimator by inverse variance of its predictions on validation

Protocol: 3-way splits (train for fitting the foundation model, validation for training the meta-learner, test for final evaluation). Five random seeds per dataset.

Results

No strategy reliably beats the mean on either model.

StrategyPFN winsPFN mean ΔICL winsICL mean Δ
Mean (baseline)3/250.00 pp4/250.00 pp
Logistic on logits0/25−0.28 pp5/25−0.26 pp
Logistic on probs5/25−0.02 pp4/25+0.02 pp
Best single7/25−0.13 pp7/25+0.04 pp
Variance weighted2/25+0.00 pp5/25+0.00 pp

The best single-seed win was +0.83 pp (TabICL logistic on logits, telco-churn seed 43). The worst single-seed loss was −2.4 pp (TabPFN logistic on logits, cc-fraud seed 44). Over 25 seeds × 5 datasets, the mean delta for every strategy hovers within ±0.3 pp of zero. The mean is already near-optimal.

Why meta-learning fails: the estimators are clones

I computed the pairwise error correlation across all 8 estimators for both models:

DatasetPFN error corrICL error corrPFN AUC spreadICL AUC spread
credit-g0.9890.9960.0200.014
telco-churn0.9970.9970.0030.006
default-credit0.9980.9980.0040.004
bank-marketing0.9890.9580.0020.039
cc-fraud0.9790.9910.0070.004

100% of all estimator pairs in both models have error correlation > 0.9. They make the same errors on the same samples. The 8-estimator ensemble is not a random forest of independent hypotheses—it is 8 near-identical draws from the same approximate posterior. The developers already knew this; that is why they simply average.

TabICL shows slightly more diversity on bank-marketing (correlation as low as 0.928 on seed 46, AUC spread up to 0.094), thanks to its feature/class shuffling. But even there, the correlation remains > 0.9, and the mean is still hard to beat.

What the ensemble is good for: uncertainty quantification

While ensemble disagreement does not help with AUC, it is a genuine uncertainty signal. I computed the standard deviation across the 8 estimator probabilities per sample and binned samples by uncertainty:

DatasetLow-uncertainty accuracyHigh-uncertainty accuracyDrop
credit-g92.5%63.5%−29.0 pp
telco-churn97.9%69.0%−28.9 pp
default-credit91.3%74.3%−17.0 pp
bank-marketing99.9%70.7%−29.2 pp
cc-fraud100.0%99.7%−0.3 pp

On every non-fraud dataset, the 20% most uncertain samples have accuracy 17–29 pp lower than the 20% most certain. On cc-fraud this signal is weaker (all samples are near-certain because the task is trivially easy).

Practical takeaway: Use predict_raw_logits() to extract per-sample ensemble stddev. High stddev does not mean “one estimator is right and another is wrong”—the estimators agree too closely for that. It means “all 8 estimators are uncertain together.” This is an aleatoric (data-inherent) uncertainty signal, not an epistemic (model-doesn’t-know) one. Use it for selective classification: reject predictions on uncertain samples, or route them to human review.

The comparison

TabPFN (8 estimators)TabICL (8 estimators)
Mean error correlation0.9910.988
Typical AUC spread (max−min)0.0070.013
Best meta-learner win+0.26 pp+0.83 pp
Practical diversityNoBarely on some seeds
Uncertainty signal qualityStrongStrong

Both ensembles are homogeneous. TabICL’s feature/class augmentation creates marginally more diversity, but not enough for meta-learning to win consistently. The simple mean remains Bayes-optimal for both.

Limitations

  • Permutation importance is approximate. It measures marginal feature importance, not joint interactions. Two models could attend to the same interaction via different features and appear to disagree.
  • Five datasets. The relationship between disagreement and stacking value needs more data points to be statistically rigorous.
  • Fixed XGBoost depth. A deeper XGBoost (max_depth=12) was tested and does not close the gap.
  • Fixed boosting family. CatBoost consistently outperforms XGBoost and LightGBM in this benchmark.
  • Engineered features help only on small data. Below ~2,000 rows, feature interactions from TabPFN’s top features improve cheap models. Above that, they add noise.
  • Binary classification only. The pattern may not hold for multi-class or regression tasks.

Reproduction

1
2
3
4
5
6
7
8
9
ssh airig
cd ~
source tabpfn-hybrid/bin/activate

# Feature importance comparison
python3 feature_importance_cmp.py --dataset credit-g --seed 42 --out fi_credit_g.json

# Close-the-gap experiment
python3 distill_engineered.py --dataset credit-g --seed 42 --out distill_creditg.json

The script trains all three models, computes permutation importance, and outputs Spearman correlations and top-k overlaps. No test-set leakage: importance is computed on the held-out test set after training on the train set.

Bottom line

Stacking foundation models with gradient-boosted trees is not universally useful. It is useful precisely when the models look at different features. On cc-fraud, XGBoost and TabPFN attend to nearly disjoint feature sets—so their errors are uncorrelated and the ensemble provides real robustness. On every other dataset we tested, they look at the same features—so stacking is just an expensive way to average two copies of the same prediction.

If you want to close the accuracy gap without stacking: use CatBoost, not XGBoost. CatBoost raw is consistently closer to TabPFN than XGBoost on every dataset, especially on imbalanced data. Deeper trees do not help. Engineered interactions from TabPFN’s top features only help on small datasets (under ~2,000 rows), where the cheap model lacks enough data to discover those interactions itself.

Before you stack, check which features your models are actually using. And before you default to XGBoost, try CatBoost.