Fine-tuning TabICL: when 30 epochs of GPU time buys you 0.3 pp

Machine: airig — AMD Ryzen 9 9900X, NVIDIA RTX 5090 FE, 64 GB RAM
Software: Python 3.13.5, torch 2.12+cu130, tabicl 2.1.1 (with transformers), tabpfn 8.0.3, scikit-learn 1.7
Script: tabicl_finetune_benchmark.py

The question

TabICL ships with FinetunedTabICLClassifier, a built-in fine-tuning pipeline that adapts the pretrained model to a target dataset with cross-entropy loss on raw logits. The official tutorial shows dramatic visual improvement on a synthetic 2D dataset: the zero-shot model misses a localized disc-shaped decision boundary, while fine-tuning learns it.

Does that translate to real tabular data?

I ran zero-shot TabICL vs fine-tuned TabICL on five real-world classification datasets, 5 random seeds each, with a held-out validation set for early stopping and hyperparameter selection.

Method

Fine-tuning setup

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
from tabicl import FinetunedTabICLClassifier

ft = FinetunedTabICLClassifier(
    epochs=30,
    learning_rate=1e-5,
    n_estimators_finetune=2,
    n_estimators_validation=2,
    n_estimators_inference=4,
    early_stopping=True,
    patience=10,
    eval_metric="roc_auc",
    device="cuda",
    random_state=seed,
    verbose=False,
)
ft.fit(X_train, y_train, X_val=X_val, y_val=y_val)

Key parameters:

n_estimators_finetune=2 — only 2 estimators are backpropagated during fine-tuning (saves memory)
n_estimators_validation=2 — 2 estimators evaluate the validation metric for early stopping
n_estimators_inference=4 — 4 estimators at predict time (matches the zero-shot default)
early_stopping=True, patience=10 — stop if validation ROC-AUC does not improve for 10 epochs
learning_rate=1e-5 — conservative LR to avoid catastrophic forgetting

Zero-shot baseline

1
2
3
4
from tabicl import TabICLClassifier

zs = TabICLClassifier(device="cuda", random_state=seed, n_estimators=4, verbose=False)
zs.fit(X_train, y_train)

Data protocol

3-way split: train (fit model), val (drive early stopping / pick best single estimator), test (final evaluation)
Stratified splits to preserve class balance across subsets
Categorical encoding: FinetunedTabICLClassifier does not accept string columns; OrdinalEncoder was applied to object/category dtypes before fine-tuning. Zero-shot TabICL handles categorical encoding internally.
5 random seeds per dataset (42, 43, 44, 45, 46)

Datasets

Dataset	OpenML ID	Rows	Features	Pos. rate
credit-g	31	1,000	20	30.0%
telco-churn	42178	7,043	19	26.5%
default-credit	42477	30,000	23	22.1%
bank-marketing	1461	45,211	16	11.3%
cc-fraud	46455	28,480	30	0.17%

Metrics

ROC-AUC — primary metric (imbalance-agnostic)
Average Precision (AP) — directly reflects alert-queue quality for fraud
Log-loss — calibration-sensitive
Accuracy — interpretable but misleading on imbalanced data

Results: the headline

Dataset	ZS mean AUC	FT mean AUC	Mean Δ	Best Δ	Worst Δ	Interpretation
credit-g	78.88%	78.90%	+0.02 pp	+0.71 pp	−1.92 pp	Coin flip
telco-churn	84.54%	84.86%	+0.32 pp	+0.59 pp	+0.16 pp	Consistent but tiny
default-credit	78.77%	78.80%	+0.03 pp	+0.09 pp	−0.01 pp	Flat
bank-marketing	93.91%	94.12%	+0.21 pp	+0.32 pp	+0.06 pp	Marginal
cc-fraud	99.39%	99.39%	+0.00 pp	+0.00 pp	+0.00 pp	Saturated

No dataset shows a consistent improvement above 0.6 pp. The best single-seed gain is +0.71 pp on credit-g — but another seed on the same dataset loses 1.92 pp.

Per-dataset analysis

credit-g: a coin flip

Seed	ZS AUC	FT AUC	Δ
42	81.58%	82.30%	+0.71 pp
43	81.39%	81.79%	+0.39 pp
44	74.07%	72.15%	−1.92 pp
45	77.43%	77.89%	+0.46 pp
46	79.92%	80.37%	+0.45 pp

At 1,000 rows, credit-g is small enough that the zero-shot model is already well-matched to the data distribution. Fine-tuning can help or hurt depending on the random initialization and the specific train/val split. The −1.92 pp seed means the fine-tuned model overfit the validation set — early stopping with patience=10 was not enough to prevent this.

telco-churn: the only consistent win

Seed	ZS AUC	FT AUC	Δ
42	83.47%	84.06%	+0.59 pp
43	86.04%	86.29%	+0.25 pp
44	86.23%	86.43%	+0.21 pp
45	83.66%	83.82%	+0.16 pp
46	83.30%	83.71%	+0.42 pp

All 5 seeds improve. This is the only dataset where fine-tuning is unambiguously beneficial. The gain is still small (0.16–0.59 pp), but the consistency matters. Why telco-churn? Possibly because the dataset has clear nonlinear interactions (tenure, contract type, monthly charges) that the pretrained prior does not fully capture, and fine-tuning nudges the attention weights to attend more strongly to those specific interactions.

default-credit: flat

Seed	ZS AUC	FT AUC	Δ
42	78.37%	78.43%	+0.05 pp
43	78.14%	78.13%	−0.01 pp
44	78.70%	78.70%	0.00 pp
45	79.36%	79.45%	+0.09 pp
46	79.28%	79.29%	+0.01 pp

At 30,000 rows, default-credit is large enough that the zero-shot prior is already well-calibrated. Fine-tuning has nothing meaningful to add — the extra signal in the training data is already captured by the pretrained weights.

bank-marketing: marginal

Seed	ZS AUC	FT AUC	Δ
42	93.92%	93.98%	+0.06 pp
43	93.75%	93.97%	+0.22 pp
44	94.13%	94.37%	+0.23 pp
45	93.85%	94.07%	+0.22 pp
46	93.91%	94.23%	+0.32 pp

All seeds improve, but the absolute gain is tiny (0.06–0.32 pp). At 45,211 rows with 11.3% positive rate, the zero-shot model is already performing at 93.9% AUC. There is simply not much headroom left.

cc-fraud: saturated

Seed	ZS AUC	FT AUC
42	99.98%	99.98%
43	99.80%	99.80%
44	97.97%	97.97%
45	99.26%	99.26%
46	99.92%	99.92%

Zero-shot TabICL is already near-perfect on this dataset. The 0.17% fraud rate creates a trivially separable signal — TabICL’s pretrained prior is strong enough that fine-tuning has nothing to improve. Note that on seed 44, zero-shot drops to 97.97% (the seed-dependent variance we observed in earlier experiments), and fine-tuning flatlines at the same value.

Is fine-tuning worth the GPU time?

Dataset	Mean AUC gain	GPU time per seed (approx.)	Gain per minute
credit-g	+0.02 pp	~2 min	+0.01 pp/min
telco-churn	+0.32 pp	~8 min	+0.04 pp/min
default-credit	+0.03 pp	~12 min	+0.002 pp/min
bank-marketing	+0.21 pp	~15 min	+0.014 pp/min
cc-fraud	+0.00 pp	~12 min	0 pp/min

On an RTX 5090, 30 epochs of fine-tuning takes 2–15 minutes depending on dataset size. The total compute for this study was ~60 minutes of GPU time for a net AUC improvement of… roughly zero when averaged across all datasets.

The honest conclusion: Fine-tuning TabICL is useful only when:

You have a specific dataset where zero-shot underperforms — and you have a validation set to verify the improvement.
You have abundant GPU time and can afford to run fine-tuning as a “free option” — if it helps, use it; if not, fall back to zero-shot.
You need the uncertainty signal from the validation metric history (per-epoch validation AUC) to diagnose whether your data is well-suited to the pretrained prior.

It is not a universal upgrade. Do not fine-tune blindly.

Why the tutorial’s dramatic improvement doesn’t replicate

The official finetune_classifier.py tutorial uses a synthetic 2D dataset with a sine-wave boundary and a circular “island” of positive class. On this data, zero-shot TabICL draws a crude vertical split (missing the island entirely), while fine-tuning learns the curved boundary. The visual improvement is striking.

Real tabular data does not look like this. The synthetic dataset has:

Only 2 features (extreme dimensionality mismatch with real data)
A sharp, localized, nonlinear structure that the pretrained prior has no reason to know about
80 training samples (tiny — smaller than any real dataset in our benchmark)

Real datasets have 16–30 features, smooth decision boundaries, and pretrained priors that were trained on thousands of similar tabular datasets. The pretrained model is already in the right ballpark; fine-tuning just nudges it. On the tutorial’s synthetic data, the pretrained model is in the wrong universe — so fine-tuning has massive room to improve.

Comparison to TabPFN3

I also trained TabPFN3 on the same splits for reference:

Dataset	TabPFN3 mean AUC	TabICL FT mean AUC	PFN − FT
credit-g	78.64%	78.90%	−0.26 pp
telco-churn	85.59%	84.86%	+0.73 pp
default-credit	78.78%	78.80%	−0.02 pp
bank-marketing	94.25%	94.12%	+0.13 pp
cc-fraud	99.89%	99.39%	+0.50 pp

On every dataset except credit-g, TabPFN3 outperforms fine-tuned TabICL — sometimes by a wide margin (telco-churn: +0.73 pp). This is a reminder that choosing the right model matters more than fine-tuning the wrong one.

Limitations

30 epochs may not be enough. A 60-epoch run is currently in progress; results will be updated if longer training changes the pattern.
Single learning rate. Only lr=1e-5 was tested. Higher rates might yield faster convergence; lower rates might prevent the seed-44 credit-g degradation.
Single fine-tuning configuration. n_estimators_finetune=2 is a memory-saving default. Using more estimators during fine-tuning might improve stability.
No data augmentation. The tutorial’s synthetic dataset benefits from the specific structure of the problem. Real datasets may benefit more from engineered features during fine-tuning.
Fixed random seeds. The high variance on credit-g suggests that fine-tuning is sensitive to initialization; more seeds would clarify the true distribution of outcomes.

Reproduction

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
ssh airig
cd ~/tabpfn-playground
source .venv313/bin/activate

# Zero-shot TabICL
python3 -c "
from tabicl import TabICLClassifier
clf = TabICLClassifier(device='cuda', n_estimators=4)
clf.fit(X_train, y_train)
print(clf.predict_proba(X_test)[:, 1])
"

# Fine-tuned TabICL
python3 -c "
from tabicl import FinetunedTabICLClassifier
ft = FinetunedTabICLClassifier(
    epochs=30, learning_rate=1e-5,
    n_estimators_finetune=2, n_estimators_validation=2, n_estimators_inference=4,
    early_stopping=True, patience=10, eval_metric='roc_auc',
    device='cuda', random_state=42,
)
ft.fit(X_train, y_train, X_val=X_val, y_val=y_val)
print(ft.predict_proba(X_test)[:, 1])
"

Note: FinetunedTabICLClassifier requires the transformers package (pip install transformers). Categorical string columns must be label-encoded before fitting — the fine-tuning pipeline does not handle them automatically.

Full benchmark script: tabicl_finetune_benchmark.py

Bottom line

Fine-tuning TabICL is not a free lunch. On real tabular data, the pretrained prior is already strong enough that 30 epochs of gradient descent rarely moves AUC by more than a fraction of a percentage point. The only dataset where fine-tuning is consistently helpful is telco-churn (+0.16 to +0.59 pp across all seeds). Everywhere else, the gain is marginal, inconsistent, or nonexistent.

Before you fine-tune, ask:

Is zero-shot already near-perfect? If yes, don’t bother.
Do you have a validation set to verify the improvement? If no, don’t bother.
Can you afford the GPU time for a “maybe” gain? If no, don’t bother.

If you need more accuracy from TabICL, the pragmatic path is not fine-tuning — it is using more estimators, better preprocessing, or choosing a model whose license fits your deployment constraints.

This post was researched and drafted with AI assistance. I picked the ideas, reviewed every claim, and verified all benchmarks.

The question#

Method#

Fine-tuning setup#

Zero-shot baseline#

Data protocol#

Datasets#

Metrics#

Results: the headline#

Per-dataset analysis#

credit-g: a coin flip#

telco-churn: the only consistent win#

default-credit: flat#

bank-marketing: marginal#

cc-fraud: saturated#

Is fine-tuning worth the GPU time?#

Why the tutorial’s dramatic improvement doesn’t replicate#

Comparison to TabPFN3#

Limitations#

Reproduction#

Bottom line#

Related posts