I ran these benchmarks on airig, a box that pairs an AMD Ryzen 9 9900X with an NVIDIA RTX 5090 FE and 64 GB of RAM. It’s probably overkill. I don’t care.

The software stack is frozen at Python 3.13.5, torch 2.12+cu130, tabicl 2.1.1 (with transformers), tabpfn 8.0.3, and scikit-learn 1.7. I pinned every dependency because reproducibility shouldn’t be a guessing game.

If you want to verify these numbers yourself, the reproduction script is tabicl_finetune_benchmark.py. I’m genuinely curious whether your setup hits the same wall or finds a completely different one.

The question

A pretty 2D demo can make any model look like a genius. TabICL ships with FinetunedTabICLClassifier, a built-in pipeline that fine-tunes the pretrained model with cross-entropy loss on raw logits. The official tutorial shows off a crisp disc-shaped decision boundary that the zero-shot model completely misses.

I immediately get suspicious when I see synthetic data that clean. Real tables have missing values, skewed features, and label noise. The gap between a toy visualization and production tabular data is usually massive.

So I ran the experiment myself. I benchmarked zero-shot TabICL against fine-tuned TabICL on five real-world classification datasets, using 5 random seeds apiece. A held-out validation set handled early stopping and hyperparameter selection.

The real test is whether that crisp decision boundary survives contact with actual missing values, skewed features, and noisy labels.

Method

Fine-tuning setup

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
from tabicl import FinetunedTabICLClassifier

ft = FinetunedTabICLClassifier(
    epochs=30,
    learning_rate=1e-5,
    n_estimators_finetune=2,
    n_estimators_validation=2,
    n_estimators_inference=4,
    early_stopping=True,
    patience=10,
    eval_metric="roc_auc",
    device="cuda",
    random_state=seed,
    verbose=False,
)
ft.fit(X_train, y_train, X_val=X_val, y_val=y_val)

I kept the estimator budget intentionally uneven. Only 2 estimators handle backpropagation during fine-tuning, which is enough to update the model without torching your memory budget. I also use 2 estimators to evaluate validation ROC-AUC for early stopping, while inference still rolls with 4 estimators to match the zero-shot default.

Early stopping is on with a patience of 10 epochs—if validation ROC-AUC does not improve for 10 epochs, the run stops immediately. I locked the learning rate at 1e-5, which is conservative by design, because the fastest way to ruin a pre-trained ensemble is to blast it with gradients and trigger catastrophic forgetting.

The open question is whether you can get away with an even smaller fine-tuning slice once you freeze the early layers, or if 2 estimators is already the floor before gradients start vanishing.

Zero-shot baseline

1
2
3
4
from tabicl import TabICLClassifier

zs = TabICLClassifier(device="cuda", random_state=seed, n_estimators=4, verbose=False)
zs.fit(X_train, y_train)

Data protocol

Nothing ruins a paper faster than a test set that leaked into training. I split every dataset three ways: train to fit the model, validation to nudge early stopping and pick the best estimator, and test for the final reality check. I stratified each split so class balance stayed locked across subsets.

The FinetunedTabICLClassifier does not accept string columns, so I applied OrdinalEncoder to object/category dtypes before fine-tuning. Zero-shot TabICL handles categorical encoding internally.

I didn’t trust a single lucky split. I ran every dataset through 5 random seeds—42, 43, 44, 45, and 46—to see if the results held up or fell apart.

If your favorite benchmark only reports one seed, how do you know it isn’t just noise dressed up as signal?

Datasets

DatasetOpenML IDRowsFeaturesPos. rate
credit-g311,0002030.0%
telco-churn421787,0431926.5%
default-credit4247730,0002322.1%
bank-marketing146145,2111611.3%
cc-fraud4645528,480300.17%

Metrics

I’ve watched too many fraud models look brilliant on paper and then drown the ops team in false alarms. The culprit is almost always the wrong metric. I stopped looking at raw accuracy years ago because it flatters you on imbalanced data while your model misses the actual fraud.

I use ROC-AUC as my primary metric because it doesn’t flinch at class imbalance. It tells me whether my model can separate signal from noise regardless of how rare the positive cases are.

But separation isn’t the whole story. Average Precision (AP) is what I watch when I want to know how clean my alert queue will look in production, because it directly reflects the quality of those ranked fraud alerts.

Log-loss keeps me honest about calibration. A model can score well on ranking metrics while being embarrassingly wrong about actual probabilities, and log-loss catches that drift immediately.

So the next time someone waves around a stellar accuracy score on a fraud dataset, ask them what their AP looks like. That single number will tell you whether their alerts are worth investigating or just noise.

Results: the headline

DatasetZS mean AUCFT mean AUCMean ΔBest ΔWorst ΔInterpretation
credit-g78.88%78.90%+0.02 pp+0.71 pp−1.92 ppCoin flip
telco-churn84.54%84.86%+0.32 pp+0.59 pp+0.16 ppConsistent but tiny
default-credit78.77%78.80%+0.03 pp+0.09 pp−0.01 ppFlat
bank-marketing93.91%94.12%+0.21 pp+0.32 pp+0.06 ppMarginal
cc-fraud99.39%99.39%+0.00 pp+0.00 pp+0.00 ppSaturated

I sliced the results by dataset hoping to find at least one reliable champion. No dataset delivers a consistent improvement above 0.6 pp.

Your best single-seed outcome is credit-g at +0.71 pp. But another seed on that same dataset loses 1.92 pp.

When identical data can swing from +0.71 pp to -1.92 pp just by changing the seed, the signal is indistinguishable from noise. If the improvement evaporates on rerun, what are you actually shipping?

Per-dataset analysis

credit-g: a coin flip

SeedZS AUCFT AUCΔ
4281.58%82.30%+0.71 pp
4381.39%81.79%+0.39 pp
4474.07%72.15%−1.92 pp
4577.43%77.89%+0.46 pp
4679.92%80.37%+0.45 pp

You would expect fine-tuning to dominate on a dataset with only 1,000 rows, but credit-g flips that assumption on its head. At this scale, the zero-shot model is already so well-matched to the data distribution that fine-tuning can easily tip from helpful to harmful.

Whether it helps or hurts depends on random initialization and the specific train/val split. I watched the same pipeline swing between gains and losses purely because of the seed.

That −1.92 pp seed is the giveaway: the fine-tuned model overfit the validation set, and early stopping with patience=10 was not enough to prevent it. If patience=10 still fails on a dataset this small, what guardrail actually will?

telco-churn: the only consistent win

SeedZS AUCFT AUCΔ
4283.47%84.06%+0.59 pp
4386.04%86.29%+0.25 pp
4486.23%86.43%+0.21 pp
4583.66%83.82%+0.16 pp
4683.30%83.71%+0.42 pp

Every single one of the 5 seeds improved. That kind of sweep is the first thing that made me do a double-take.

This is the only dataset where fine-tuning is unambiguously beneficial. The gain is still small—0.16–0.59 pp—but consistency beats a random walk.

Why telco-churn? I suspect the dataset has clear nonlinear interactions among tenure, contract type, and monthly charges that the pretrained prior does not fully capture. Fine-tuning nudges the attention weights to attend more strongly to those specific interactions.

If a few interacting variables are enough to make fine-tuning worth it, what other datasets are sitting in our pipeline with hidden feature collisions we never bothered to map?

default-credit: flat

SeedZS AUCFT AUCΔ
4278.37%78.43%+0.05 pp
4378.14%78.13%−0.01 pp
4478.70%78.70%0.00 pp
4579.36%79.45%+0.09 pp
4679.28%79.29%+0.01 pp

I kicked off a fine-tuning run on default-credit and the validation metrics just sat there. At 30,000 rows, this dataset should be meaty enough to teach the model something new.

But the zero-shot prior is already dialed in. The pretrained weights have soaked up every meaningful signal in the training data, so fine-tuning has nothing left to extract.

Before you queue up your next training job on a mid-sized tabular dataset, ask yourself: are you actually adding knowledge the model lacks, or just burning compute to prove it was already right?

bank-marketing: marginal

SeedZS AUCFT AUCΔ
4293.92%93.98%+0.06 pp
4393.75%93.97%+0.22 pp
4494.13%94.37%+0.23 pp
4593.85%94.07%+0.22 pp
4693.91%94.23%+0.32 pp

I ran every seed I could find, and sure enough, they all improved the score. But the absolute gain was a measly 0.06–0.32 pp. That’s not a breakthrough; that’s a rounding error.

Consider the baseline we’re up against. With 45,211 rows and an 11.3% positive rate, the zero-shot model already hits 93.9% AUC. When you’re starting that close to the ceiling, there simply isn’t much headroom left.

So what do you do when the dataset is this saturated? Maybe the real leverage isn’t in squeezing the seeds harder—it’s in questioning whether we’re even optimizing the right bottleneck.

cc-fraud: saturated

SeedZS AUCFT AUCΔ
4299.98%99.98%0.00 pp
4399.80%99.80%0.00 pp
4497.97%97.97%0.00 pp
4599.26%99.26%0.00 pp
4699.92%99.92%0.00 pp

I burned GPU hours fine-tuning a model that had already solved the problem. Zero-shot TabICL is near-perfect on this dataset.

The fraud signal is overwhelming. A 0.17% rate makes the pattern trivially separable, and the pretrained prior has nothing left to gain from extra training.

Then there is seed 44. Zero-shot drops to 97.97%, and fine-tuning flatlines at exactly the same number.

That is not convergence. It is the same seed-dependent variance we flagged in earlier experiments. The extra training does not improve the model; it merely matches a baseline that zero-shot already set.

If the pretrained prior already owns the problem, what is your next fine-tuning run actually going to teach it?

Is fine-tuning worth the GPU time?

DatasetMean AUC gainGPU time per seed (approx.)Gain per minute
credit-g+0.02 pp~2 min+0.01 pp/min
telco-churn+0.32 pp~8 min+0.04 pp/min
default-credit+0.03 pp~12 min+0.002 pp/min
bank-marketing+0.21 pp~15 min+0.014 pp/min
cc-fraud+0.00 pp~12 min0 pp/min

I burned ~60 minutes of RTX 5090 time on this study. For what? A net AUC improvement of roughly zero when averaged across all datasets.

Thirty epochs of fine-tuning takes 2–15 minutes on that card, depending on dataset size. Do the math: that is a lot of silicon for a result that does not move the needle.

So when is fine-tuning TabICL actually worth the trouble? Only in three narrow situations.

First, you have a specific dataset where zero-shot underperforms, and you have a validation set to verify the improvement.

Second, you have abundant GPU time and can afford to run fine-tuning as a free option—if it helps, use it; if not, fall back to zero-shot.

Third, you need the uncertainty signal from the per-epoch validation AUC history to diagnose whether your data is well-suited to the pretrained prior.

This is not a universal upgrade. Do not fine-tune blindly.

Before you launch your next run, ask yourself whether you have concrete evidence that zero-shot failed on this dataset. If the answer is no, you are not improving a model; you are just burning cycles.

Why the tutorial’s dramatic improvement doesn’t replicate

The official finetune_classifier.py tutorial opens with a gorgeous trap. It pits zero-shot TabICL against a synthetic 2D dataset—a sine-wave boundary wrapped around a circular island of positive class—and the zero-shot model coughs up a crude vertical split that misses the island entirely. Fine-tuning learns the curved boundary, the visual improvement is striking, and you start to believe.

Then you load a real CSV and the illusion shatters.

That synthetic world has exactly 2 features, which is absurd next to the 16–30 dimensions you actually face. It hides a sharp, localized nonlinear structure that the pretrained prior has absolutely no reason to expect. It offers exactly 80 training samples, smaller than any real dataset in our benchmark.

Real tabular data behaves nothing like this. You get smooth decision boundaries and pretrained priors that have already digested thousands of similar datasets. The model starts in the right ballpark; fine-tuning is just a nudge.

On the tutorial’s toy problem, the pretrained model starts in the wrong universe. That is why fine-tuning looks like a superhero—it has nowhere to go but up. In production, the gap is rarely that wide, and the honest question is whether your nudge justifies the compute.

Comparison to TabPFN3

You should not trust a baseline trained on different data, so I ran TabPFN3 on these exact same splits. The deltas above are slim—none of them exceed a single percentage point. If in-context fine-tuning matches a dedicated TabPFN3 run this closely across every dataset, what exactly are those extra training runs buying you?

DatasetTabPFN3 mean AUCTabICL FT mean AUCPFN − FT
credit-g78.64%78.90%−0.26 pp
telco-churn85.59%84.86%+0.73 pp
default-credit78.78%78.80%−0.02 pp
bank-marketing94.25%94.12%+0.13 pp
cc-fraud99.89%99.39%+0.50 pp

I ran the benchmarks expecting fine-tuned TabICL to at least trade blows with TabPFN3.

It lost on every dataset except credit-g. On telco-churn, the gap was +0.73 pp.

All the hyperparameter sweeps in the world won’t save you if you picked the wrong architecture.

How many of your own pipelines are just expensive tuning jobs stacked on a model that was never the right fit?

Limitations

I stopped at 30 epochs, and that might have been premature. A 60-epoch run is already in progress, and I will update these results if longer training changes the pattern.

I also locked myself into a single learning rate. Only lr=1e-5 made it onto the bench, so I cannot tell you whether a higher rate would have converged faster, or whether a lower rate would have prevented the seed-44 credit-g degradation.

The fine-tuning setup was equally spartan. I kept n_estimators_finetune at 2 because it is the memory-saving default, but throwing more estimators at the fine-tuning phase might smooth out the instability I saw.

I skipped data augmentation entirely. The tutorial’s synthetic dataset is tailored to the problem structure, so it did not hurt much here. On real-world data, however, engineered features during fine-tuning could easily matter more than anything I tuned above.

I hard-coded the random seeds, and the credit-g splits still bounced around with high variance. That is a clear signal that fine-tuning is sensitive to initialization. If I want to know what the outcome distribution actually looks like, I need to stop guessing and start sweeping across more seeds.

Reproduction

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
ssh airig
cd ~/tabpfn-playground
source .venv313/bin/activate

# Zero-shot TabICL
python3 -c "
from tabicl import TabICLClassifier
clf = TabICLClassifier(device='cuda', n_estimators=4)
clf.fit(X_train, y_train)
print(clf.predict_proba(X_test)[:, 1])
"

# Fine-tuned TabICL
python3 -c "
from tabicl import FinetunedTabICLClassifier
ft = FinetunedTabICLClassifier(
    epochs=30, learning_rate=1e-5,
    n_estimators_finetune=2, n_estimators_validation=2, n_estimators_inference=4,
    early_stopping=True, patience=10, eval_metric='roc_auc',
    device='cuda', random_state=42,
)
ft.fit(X_train, y_train, X_val=X_val, y_val=y_val)
print(ft.predict_proba(X_test)[:, 1])
"

I wasted ten minutes on an ImportError before I realized FinetunedTabICLClassifier needs the transformers package. Run pip install transformers before you import it.

Do not feed it raw string categories, either. You must label-encode those columns before fitting — the fine-tuning pipeline does not handle them automatically.

The full benchmark script is at tabicl_finetune_benchmark.py. Give it a spin — just encode those strings first, or you will debug tensor shapes instead of comparing scores.

Bottom line

Thirty epochs of gradient descent rarely move TabICL’s AUC by more than a fraction of a percentage point on real tabular data. The pretrained prior is already that strong.

The only dataset where fine-tuning consistently pays off is telco-churn, lifting AUC by 0.16 to 0.59 pp across all seeds. Everywhere else, the gain is marginal, inconsistent, or nonexistent.

Before you fire up the GPU, run this checklist. If zero-shot is already near-perfect, don’t bother. If you lack a validation set to verify the improvement, don’t bother. If you can’t afford the GPU time for a “maybe” gain, don’t bother.

If you actually need more accuracy from TabICL, the pragmatic path is not fine-tuning. Use more estimators, improve your preprocessing, or choose a model whose license fits your deployment constraints.

If the pretrained prior is this stubborn, the real question isn’t whether to fine-tune TabICL. It’s whether we should be spending our GPU budget on better preprocessing and ensemble design instead.