Machine: airig — AMD Ryzen 9 9900X, NVIDIA RTX 5090 FE, 64 GB RAM
Software: Python 3.13.5, torch 2.12+cu130, tabicl 2.1.1 (with transformers), tabpfn 8.0.3, scikit-learn 1.7
Script: tabicl_finetune_benchmark.py

The question

TabICL ships with FinetunedTabICLClassifier, a built-in fine-tuning pipeline that adapts the pretrained model to a target dataset with cross-entropy loss on raw logits. The official tutorial shows dramatic visual improvement on a synthetic 2D dataset: the zero-shot model misses a localized disc-shaped decision boundary, while fine-tuning learns it.

Does that translate to real tabular data?

I ran zero-shot TabICL vs fine-tuned TabICL on five real-world classification datasets, 5 random seeds each, with a held-out validation set for early stopping and hyperparameter selection.

Method

Fine-tuning setup

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
from tabicl import FinetunedTabICLClassifier

ft = FinetunedTabICLClassifier(
    epochs=30,
    learning_rate=1e-5,
    n_estimators_finetune=2,
    n_estimators_validation=2,
    n_estimators_inference=4,
    early_stopping=True,
    patience=10,
    eval_metric="roc_auc",
    device="cuda",
    random_state=seed,
    verbose=False,
)
ft.fit(X_train, y_train, X_val=X_val, y_val=y_val)

Key parameters:

  • n_estimators_finetune=2 — only 2 estimators are backpropagated during fine-tuning (saves memory)
  • n_estimators_validation=2 — 2 estimators evaluate the validation metric for early stopping
  • n_estimators_inference=4 — 4 estimators at predict time (matches the zero-shot default)
  • early_stopping=True, patience=10 — stop if validation ROC-AUC does not improve for 10 epochs
  • learning_rate=1e-5 — conservative LR to avoid catastrophic forgetting

Zero-shot baseline

1
2
3
4
from tabicl import TabICLClassifier

zs = TabICLClassifier(device="cuda", random_state=seed, n_estimators=4, verbose=False)
zs.fit(X_train, y_train)

Data protocol

  • 3-way split: train (fit model), val (drive early stopping / pick best single estimator), test (final evaluation)
  • Stratified splits to preserve class balance across subsets
  • Categorical encoding: FinetunedTabICLClassifier does not accept string columns; OrdinalEncoder was applied to object/category dtypes before fine-tuning. Zero-shot TabICL handles categorical encoding internally.
  • 5 random seeds per dataset (42, 43, 44, 45, 46)

Datasets

DatasetOpenML IDRowsFeaturesPos. rate
credit-g311,0002030.0%
telco-churn421787,0431926.5%
default-credit4247730,0002322.1%
bank-marketing146145,2111611.3%
cc-fraud4645528,480300.17%

Metrics

  • ROC-AUC — primary metric (imbalance-agnostic)
  • Average Precision (AP) — directly reflects alert-queue quality for fraud
  • Log-loss — calibration-sensitive
  • Accuracy — interpretable but misleading on imbalanced data

Results: the headline

DatasetZS mean AUCFT mean AUCMean ΔBest ΔWorst ΔInterpretation
credit-g78.88%78.90%+0.02 pp+0.71 pp−1.92 ppCoin flip
telco-churn84.54%84.86%+0.32 pp+0.59 pp+0.16 ppConsistent but tiny
default-credit78.77%78.80%+0.03 pp+0.09 pp−0.01 ppFlat
bank-marketing93.91%94.12%+0.21 pp+0.32 pp+0.06 ppMarginal
cc-fraud99.39%99.39%+0.00 pp+0.00 pp+0.00 ppSaturated

No dataset shows a consistent improvement above 0.6 pp. The best single-seed gain is +0.71 pp on credit-g — but another seed on the same dataset loses 1.92 pp.

Per-dataset analysis

credit-g: a coin flip

SeedZS AUCFT AUCΔ
4281.58%82.30%+0.71 pp
4381.39%81.79%+0.39 pp
4474.07%72.15%−1.92 pp
4577.43%77.89%+0.46 pp
4679.92%80.37%+0.45 pp

At 1,000 rows, credit-g is small enough that the zero-shot model is already well-matched to the data distribution. Fine-tuning can help or hurt depending on the random initialization and the specific train/val split. The −1.92 pp seed means the fine-tuned model overfit the validation set — early stopping with patience=10 was not enough to prevent this.

telco-churn: the only consistent win

SeedZS AUCFT AUCΔ
4283.47%84.06%+0.59 pp
4386.04%86.29%+0.25 pp
4486.23%86.43%+0.21 pp
4583.66%83.82%+0.16 pp
4683.30%83.71%+0.42 pp

All 5 seeds improve. This is the only dataset where fine-tuning is unambiguously beneficial. The gain is still small (0.16–0.59 pp), but the consistency matters. Why telco-churn? Possibly because the dataset has clear nonlinear interactions (tenure, contract type, monthly charges) that the pretrained prior does not fully capture, and fine-tuning nudges the attention weights to attend more strongly to those specific interactions.

default-credit: flat

SeedZS AUCFT AUCΔ
4278.37%78.43%+0.05 pp
4378.14%78.13%−0.01 pp
4478.70%78.70%0.00 pp
4579.36%79.45%+0.09 pp
4679.28%79.29%+0.01 pp

At 30,000 rows, default-credit is large enough that the zero-shot prior is already well-calibrated. Fine-tuning has nothing meaningful to add — the extra signal in the training data is already captured by the pretrained weights.

bank-marketing: marginal

SeedZS AUCFT AUCΔ
4293.92%93.98%+0.06 pp
4393.75%93.97%+0.22 pp
4494.13%94.37%+0.23 pp
4593.85%94.07%+0.22 pp
4693.91%94.23%+0.32 pp

All seeds improve, but the absolute gain is tiny (0.06–0.32 pp). At 45,211 rows with 11.3% positive rate, the zero-shot model is already performing at 93.9% AUC. There is simply not much headroom left.

cc-fraud: saturated

SeedZS AUCFT AUCΔ
4299.98%99.98%0.00 pp
4399.80%99.80%0.00 pp
4497.97%97.97%0.00 pp
4599.26%99.26%0.00 pp
4699.92%99.92%0.00 pp

Zero-shot TabICL is already near-perfect on this dataset. The 0.17% fraud rate creates a trivially separable signal — TabICL’s pretrained prior is strong enough that fine-tuning has nothing to improve. Note that on seed 44, zero-shot drops to 97.97% (the seed-dependent variance we observed in earlier experiments), and fine-tuning flatlines at the same value.

Is fine-tuning worth the GPU time?

DatasetMean AUC gainGPU time per seed (approx.)Gain per minute
credit-g+0.02 pp~2 min+0.01 pp/min
telco-churn+0.32 pp~8 min+0.04 pp/min
default-credit+0.03 pp~12 min+0.002 pp/min
bank-marketing+0.21 pp~15 min+0.014 pp/min
cc-fraud+0.00 pp~12 min0 pp/min

On an RTX 5090, 30 epochs of fine-tuning takes 2–15 minutes depending on dataset size. The total compute for this study was ~60 minutes of GPU time for a net AUC improvement of… roughly zero when averaged across all datasets.

The honest conclusion: Fine-tuning TabICL is useful only when:

  1. You have a specific dataset where zero-shot underperforms — and you have a validation set to verify the improvement.
  2. You have abundant GPU time and can afford to run fine-tuning as a “free option” — if it helps, use it; if not, fall back to zero-shot.
  3. You need the uncertainty signal from the validation metric history (per-epoch validation AUC) to diagnose whether your data is well-suited to the pretrained prior.

It is not a universal upgrade. Do not fine-tune blindly.

Why the tutorial’s dramatic improvement doesn’t replicate

The official finetune_classifier.py tutorial uses a synthetic 2D dataset with a sine-wave boundary and a circular “island” of positive class. On this data, zero-shot TabICL draws a crude vertical split (missing the island entirely), while fine-tuning learns the curved boundary. The visual improvement is striking.

Real tabular data does not look like this. The synthetic dataset has:

  • Only 2 features (extreme dimensionality mismatch with real data)
  • A sharp, localized, nonlinear structure that the pretrained prior has no reason to know about
  • 80 training samples (tiny — smaller than any real dataset in our benchmark)

Real datasets have 16–30 features, smooth decision boundaries, and pretrained priors that were trained on thousands of similar tabular datasets. The pretrained model is already in the right ballpark; fine-tuning just nudges it. On the tutorial’s synthetic data, the pretrained model is in the wrong universe — so fine-tuning has massive room to improve.

Comparison to TabPFN3

I also trained TabPFN3 on the same splits for reference:

DatasetTabPFN3 mean AUCTabICL FT mean AUCPFN − FT
credit-g78.64%78.90%−0.26 pp
telco-churn85.59%84.86%+0.73 pp
default-credit78.78%78.80%−0.02 pp
bank-marketing94.25%94.12%+0.13 pp
cc-fraud99.89%99.39%+0.50 pp

On every dataset except credit-g, TabPFN3 outperforms fine-tuned TabICL — sometimes by a wide margin (telco-churn: +0.73 pp). This is a reminder that choosing the right model matters more than fine-tuning the wrong one.

Limitations

  • 30 epochs may not be enough. A 60-epoch run is currently in progress; results will be updated if longer training changes the pattern.
  • Single learning rate. Only lr=1e-5 was tested. Higher rates might yield faster convergence; lower rates might prevent the seed-44 credit-g degradation.
  • Single fine-tuning configuration. n_estimators_finetune=2 is a memory-saving default. Using more estimators during fine-tuning might improve stability.
  • No data augmentation. The tutorial’s synthetic dataset benefits from the specific structure of the problem. Real datasets may benefit more from engineered features during fine-tuning.
  • Fixed random seeds. The high variance on credit-g suggests that fine-tuning is sensitive to initialization; more seeds would clarify the true distribution of outcomes.

Reproduction

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
ssh airig
cd ~/tabpfn-playground
source .venv313/bin/activate

# Zero-shot TabICL
python3 -c "
from tabicl import TabICLClassifier
clf = TabICLClassifier(device='cuda', n_estimators=4)
clf.fit(X_train, y_train)
print(clf.predict_proba(X_test)[:, 1])
"

# Fine-tuned TabICL
python3 -c "
from tabicl import FinetunedTabICLClassifier
ft = FinetunedTabICLClassifier(
    epochs=30, learning_rate=1e-5,
    n_estimators_finetune=2, n_estimators_validation=2, n_estimators_inference=4,
    early_stopping=True, patience=10, eval_metric='roc_auc',
    device='cuda', random_state=42,
)
ft.fit(X_train, y_train, X_val=X_val, y_val=y_val)
print(ft.predict_proba(X_test)[:, 1])
"

Note: FinetunedTabICLClassifier requires the transformers package (pip install transformers). Categorical string columns must be label-encoded before fitting — the fine-tuning pipeline does not handle them automatically.

Full benchmark script: tabicl_finetune_benchmark.py

Bottom line

Fine-tuning TabICL is not a free lunch. On real tabular data, the pretrained prior is already strong enough that 30 epochs of gradient descent rarely moves AUC by more than a fraction of a percentage point. The only dataset where fine-tuning is consistently helpful is telco-churn (+0.16 to +0.59 pp across all seeds). Everywhere else, the gain is marginal, inconsistent, or nonexistent.

Before you fine-tune, ask:

  1. Is zero-shot already near-perfect? If yes, don’t bother.
  2. Do you have a validation set to verify the improvement? If no, don’t bother.
  3. Can you afford the GPU time for a “maybe” gain? If no, don’t bother.

If you need more accuracy from TabICL, the pragmatic path is not fine-tuning — it is using more estimators, better preprocessing, or choosing a model whose license fits your deployment constraints.