Machine: airig — AMD Ryzen 9 9900X, NVIDIA RTX 5090 FE, 64 GB RAM
Software: Python 3.13.5, torch 2.12+cu130, tabicl 2.1.1 (with transformers), tabpfn 8.0.3, scikit-learn 1.7
Script: tabicl_finetune_benchmark.py
The question
TabICL ships with FinetunedTabICLClassifier, a built-in fine-tuning pipeline that adapts the pretrained model to a target dataset with cross-entropy loss on raw logits. The official tutorial shows dramatic visual improvement on a synthetic 2D dataset: the zero-shot model misses a localized disc-shaped decision boundary, while fine-tuning learns it.
Does that translate to real tabular data?
I ran zero-shot TabICL vs fine-tuned TabICL on five real-world classification datasets, 5 random seeds each, with a held-out validation set for early stopping and hyperparameter selection.
Method
Fine-tuning setup
| |
Key parameters:
n_estimators_finetune=2— only 2 estimators are backpropagated during fine-tuning (saves memory)n_estimators_validation=2— 2 estimators evaluate the validation metric for early stoppingn_estimators_inference=4— 4 estimators at predict time (matches the zero-shot default)early_stopping=True, patience=10— stop if validation ROC-AUC does not improve for 10 epochslearning_rate=1e-5— conservative LR to avoid catastrophic forgetting
Zero-shot baseline
| |
Data protocol
- 3-way split: train (fit model), val (drive early stopping / pick best single estimator), test (final evaluation)
- Stratified splits to preserve class balance across subsets
- Categorical encoding:
FinetunedTabICLClassifierdoes not accept string columns;OrdinalEncoderwas applied to object/category dtypes before fine-tuning. Zero-shot TabICL handles categorical encoding internally. - 5 random seeds per dataset (42, 43, 44, 45, 46)
Datasets
| Dataset | OpenML ID | Rows | Features | Pos. rate |
|---|---|---|---|---|
| credit-g | 31 | 1,000 | 20 | 30.0% |
| telco-churn | 42178 | 7,043 | 19 | 26.5% |
| default-credit | 42477 | 30,000 | 23 | 22.1% |
| bank-marketing | 1461 | 45,211 | 16 | 11.3% |
| cc-fraud | 46455 | 28,480 | 30 | 0.17% |
Metrics
- ROC-AUC — primary metric (imbalance-agnostic)
- Average Precision (AP) — directly reflects alert-queue quality for fraud
- Log-loss — calibration-sensitive
- Accuracy — interpretable but misleading on imbalanced data
Results: the headline
| Dataset | ZS mean AUC | FT mean AUC | Mean Δ | Best Δ | Worst Δ | Interpretation |
|---|---|---|---|---|---|---|
| credit-g | 78.88% | 78.90% | +0.02 pp | +0.71 pp | −1.92 pp | Coin flip |
| telco-churn | 84.54% | 84.86% | +0.32 pp | +0.59 pp | +0.16 pp | Consistent but tiny |
| default-credit | 78.77% | 78.80% | +0.03 pp | +0.09 pp | −0.01 pp | Flat |
| bank-marketing | 93.91% | 94.12% | +0.21 pp | +0.32 pp | +0.06 pp | Marginal |
| cc-fraud | 99.39% | 99.39% | +0.00 pp | +0.00 pp | +0.00 pp | Saturated |
No dataset shows a consistent improvement above 0.6 pp. The best single-seed gain is +0.71 pp on credit-g — but another seed on the same dataset loses 1.92 pp.
Per-dataset analysis
credit-g: a coin flip
| Seed | ZS AUC | FT AUC | Δ |
|---|---|---|---|
| 42 | 81.58% | 82.30% | +0.71 pp |
| 43 | 81.39% | 81.79% | +0.39 pp |
| 44 | 74.07% | 72.15% | −1.92 pp |
| 45 | 77.43% | 77.89% | +0.46 pp |
| 46 | 79.92% | 80.37% | +0.45 pp |
At 1,000 rows, credit-g is small enough that the zero-shot model is already well-matched to the data distribution. Fine-tuning can help or hurt depending on the random initialization and the specific train/val split. The −1.92 pp seed means the fine-tuned model overfit the validation set — early stopping with patience=10 was not enough to prevent this.
telco-churn: the only consistent win
| Seed | ZS AUC | FT AUC | Δ |
|---|---|---|---|
| 42 | 83.47% | 84.06% | +0.59 pp |
| 43 | 86.04% | 86.29% | +0.25 pp |
| 44 | 86.23% | 86.43% | +0.21 pp |
| 45 | 83.66% | 83.82% | +0.16 pp |
| 46 | 83.30% | 83.71% | +0.42 pp |
All 5 seeds improve. This is the only dataset where fine-tuning is unambiguously beneficial. The gain is still small (0.16–0.59 pp), but the consistency matters. Why telco-churn? Possibly because the dataset has clear nonlinear interactions (tenure, contract type, monthly charges) that the pretrained prior does not fully capture, and fine-tuning nudges the attention weights to attend more strongly to those specific interactions.
default-credit: flat
| Seed | ZS AUC | FT AUC | Δ |
|---|---|---|---|
| 42 | 78.37% | 78.43% | +0.05 pp |
| 43 | 78.14% | 78.13% | −0.01 pp |
| 44 | 78.70% | 78.70% | 0.00 pp |
| 45 | 79.36% | 79.45% | +0.09 pp |
| 46 | 79.28% | 79.29% | +0.01 pp |
At 30,000 rows, default-credit is large enough that the zero-shot prior is already well-calibrated. Fine-tuning has nothing meaningful to add — the extra signal in the training data is already captured by the pretrained weights.
bank-marketing: marginal
| Seed | ZS AUC | FT AUC | Δ |
|---|---|---|---|
| 42 | 93.92% | 93.98% | +0.06 pp |
| 43 | 93.75% | 93.97% | +0.22 pp |
| 44 | 94.13% | 94.37% | +0.23 pp |
| 45 | 93.85% | 94.07% | +0.22 pp |
| 46 | 93.91% | 94.23% | +0.32 pp |
All seeds improve, but the absolute gain is tiny (0.06–0.32 pp). At 45,211 rows with 11.3% positive rate, the zero-shot model is already performing at 93.9% AUC. There is simply not much headroom left.
cc-fraud: saturated
| Seed | ZS AUC | FT AUC | Δ |
|---|---|---|---|
| 42 | 99.98% | 99.98% | 0.00 pp |
| 43 | 99.80% | 99.80% | 0.00 pp |
| 44 | 97.97% | 97.97% | 0.00 pp |
| 45 | 99.26% | 99.26% | 0.00 pp |
| 46 | 99.92% | 99.92% | 0.00 pp |
Zero-shot TabICL is already near-perfect on this dataset. The 0.17% fraud rate creates a trivially separable signal — TabICL’s pretrained prior is strong enough that fine-tuning has nothing to improve. Note that on seed 44, zero-shot drops to 97.97% (the seed-dependent variance we observed in earlier experiments), and fine-tuning flatlines at the same value.
Is fine-tuning worth the GPU time?
| Dataset | Mean AUC gain | GPU time per seed (approx.) | Gain per minute |
|---|---|---|---|
| credit-g | +0.02 pp | ~2 min | +0.01 pp/min |
| telco-churn | +0.32 pp | ~8 min | +0.04 pp/min |
| default-credit | +0.03 pp | ~12 min | +0.002 pp/min |
| bank-marketing | +0.21 pp | ~15 min | +0.014 pp/min |
| cc-fraud | +0.00 pp | ~12 min | 0 pp/min |
On an RTX 5090, 30 epochs of fine-tuning takes 2–15 minutes depending on dataset size. The total compute for this study was ~60 minutes of GPU time for a net AUC improvement of… roughly zero when averaged across all datasets.
The honest conclusion: Fine-tuning TabICL is useful only when:
- You have a specific dataset where zero-shot underperforms — and you have a validation set to verify the improvement.
- You have abundant GPU time and can afford to run fine-tuning as a “free option” — if it helps, use it; if not, fall back to zero-shot.
- You need the uncertainty signal from the validation metric history (per-epoch validation AUC) to diagnose whether your data is well-suited to the pretrained prior.
It is not a universal upgrade. Do not fine-tune blindly.
Why the tutorial’s dramatic improvement doesn’t replicate
The official finetune_classifier.py tutorial uses a synthetic 2D dataset with a sine-wave boundary and a circular “island” of positive class. On this data, zero-shot TabICL draws a crude vertical split (missing the island entirely), while fine-tuning learns the curved boundary. The visual improvement is striking.
Real tabular data does not look like this. The synthetic dataset has:
- Only 2 features (extreme dimensionality mismatch with real data)
- A sharp, localized, nonlinear structure that the pretrained prior has no reason to know about
- 80 training samples (tiny — smaller than any real dataset in our benchmark)
Real datasets have 16–30 features, smooth decision boundaries, and pretrained priors that were trained on thousands of similar tabular datasets. The pretrained model is already in the right ballpark; fine-tuning just nudges it. On the tutorial’s synthetic data, the pretrained model is in the wrong universe — so fine-tuning has massive room to improve.
Comparison to TabPFN3
I also trained TabPFN3 on the same splits for reference:
| Dataset | TabPFN3 mean AUC | TabICL FT mean AUC | PFN − FT |
|---|---|---|---|
| credit-g | 78.64% | 78.90% | −0.26 pp |
| telco-churn | 85.59% | 84.86% | +0.73 pp |
| default-credit | 78.78% | 78.80% | −0.02 pp |
| bank-marketing | 94.25% | 94.12% | +0.13 pp |
| cc-fraud | 99.89% | 99.39% | +0.50 pp |
On every dataset except credit-g, TabPFN3 outperforms fine-tuned TabICL — sometimes by a wide margin (telco-churn: +0.73 pp). This is a reminder that choosing the right model matters more than fine-tuning the wrong one.
Limitations
- 30 epochs may not be enough. A 60-epoch run is currently in progress; results will be updated if longer training changes the pattern.
- Single learning rate. Only
lr=1e-5was tested. Higher rates might yield faster convergence; lower rates might prevent the seed-44 credit-g degradation. - Single fine-tuning configuration.
n_estimators_finetune=2is a memory-saving default. Using more estimators during fine-tuning might improve stability. - No data augmentation. The tutorial’s synthetic dataset benefits from the specific structure of the problem. Real datasets may benefit more from engineered features during fine-tuning.
- Fixed random seeds. The high variance on credit-g suggests that fine-tuning is sensitive to initialization; more seeds would clarify the true distribution of outcomes.
Reproduction
| |
Note: FinetunedTabICLClassifier requires the transformers package (pip install transformers). Categorical string columns must be label-encoded before fitting — the fine-tuning pipeline does not handle them automatically.
Full benchmark script: tabicl_finetune_benchmark.py
Bottom line
Fine-tuning TabICL is not a free lunch. On real tabular data, the pretrained prior is already strong enough that 30 epochs of gradient descent rarely moves AUC by more than a fraction of a percentage point. The only dataset where fine-tuning is consistently helpful is telco-churn (+0.16 to +0.59 pp across all seeds). Everywhere else, the gain is marginal, inconsistent, or nonexistent.
Before you fine-tune, ask:
- Is zero-shot already near-perfect? If yes, don’t bother.
- Do you have a validation set to verify the improvement? If no, don’t bother.
- Can you afford the GPU time for a “maybe” gain? If no, don’t bother.
If you need more accuracy from TabICL, the pragmatic path is not fine-tuning — it is using more estimators, better preprocessing, or choosing a model whose license fits your deployment constraints.