Machine: airig — AMD Ryzen 9 9900X, NVIDIA RTX 5090 FE, 64 GB RAM, Debian trixie
Software: Python 3.13.5, torch 2.12+cu130, tabpfn 8.0.3, tabicl 2.1.1
Datasets: Amazon Science fraud-dataset-benchmark (FDB) — 4 fraud-detection datasets1

TL;DR

  1. PFN wins below ~10k rows. I ran stratified subsamples from 1k to 10k rows on sparknov, ieeecis, and fakejob. PFN reached higher ROC-AUC than ICL at every point in that range.

  2. By ~100k rows the gap vanishes. I ran both models on sparknov up to 100k rows. Both plateaued around that point. Differences at that scale fell inside run-to-run noise.

  3. PFN degrades beyond 200k on sparknov. I trained PFN on sparknov up to 1M rows. Its ROC-AUC peaked at 97.22% at 200k rows. Then it dropped to 95.93% at 1M rows. That 1.3 percentage point decline suggests its attention mechanism gets swamped by noise at extreme scale.

  4. PFN is consistently ~2× slower. I timed inference on identical hardware for both models. PFN took roughly twice as long because its transformer backbone has 1.93× more parameters. The exact counts are 53.15M trainable parameters for PFN versus 27.55M for ICL.

  5. torch.inference_mode() + torch.autocast(bfloat16) gives a clean +21% TabPFN speedup with zero ROC-AUC degradation. I wrapped PFN prediction calls in torch.inference_mode() plus torch.autocast with bfloat16 dtype. That cut wall-clock time by 21% with no accuracy loss.


Why this benchmark matters

Fraud detection is the canonical rare-event prediction problem. Fraud rates sit between 0.1% and 10%. Class imbalance is severe. Production teams need to rank risky transactions correctly.

ROC-AUC is the standard headline metric. It measures the probability that a random fraud case scores higher than a random legitimate case. It is insensitive to class imbalance.

But fraud teams actually care about precision at the top of the queue. They want to know how many alerts an analyst must review to catch 80% of fraud. That is a different question.

Average Precision answers that question. It is the area under the precision-recall curve. AP is sensitive to the positive class. It directly reflects alert-queue quality.

I report both metrics in this sweep. A model can have high ROC-AUC but low AP. That model still ranks most positives above most negatives. Yet it is imprecise at the decision threshold that matters.

Metrics I report

  • ROC-AUC — probability a random fraud case scores higher than a random legitimate case. Standard metric, imbalance-agnostic.
  • Average Precision (AP) — area under the precision-recall curve. More informative than ROC-AUC for rare events because it weights precision at each recall level.
  • Accuracy / Precision / Recall — standard definitions available in raw results. I treat them as secondary here.

If you have predictions, AP is a one-liner:

1
2
from sklearn.metrics import average_precision_score
ap = average_precision_score(y_true, y_proba[:, 1])

The models

TabPFN3 is a pretrained transformer designed for small tabular datasets. Its sweet spot is roughly 100 to 10,000 rows. TabICL uses in-context learning and is designed to scale to larger tables.

Both models support GPU inference. I ran identical stratified subsamples with random_state=42. I used identical FDB preprocessing. Both models ran on device="cuda" on the RTX 5090.

Method Glossary

MethodOne-sentence explanation
TabPFN3Pretrained transformer for small tabular classification, typically 100–10k rows.
TabICLIn-context learning tabular classifier designed to scale to larger tables.
FDBAmazon Science fraud-dataset-benchmark preprocessing pipeline.
torch.profilerPyTorch profiler for kernel-level GPU and CPU tracing.
torch.inference_modePyTorch context that disables gradient bookkeeping for faster inference.
torch.autocastPyTorch mixed-precision wrapper that runs ops in bfloat16 where safe.
PerformanceOptionsTabPFN3 configuration object for compile and memory toggles.

Method in two sentences

For each dataset I drew stratified subsamples of size 1k, 5k, 10k, 20k, 50k, 100k, and beyond where the dataset had rows. Both classifiers saw exactly the same rows. Feature preprocessing was identical via FDB. Metadata columns were dropped. Categoricals were label-encoded. Train and test columns were aligned.


Results

sparknov: the clearest story

sparknov is the dataset with the most complete size ladder. I ran subsamples up to 1M training rows. It anchors the narrative.

SizePFN AUCPFN pred (s)ICL AUCICL pred (s)Ratio
1,00087.08%1.1185.71%0.731.5×
5,00090.98%0.9087.12%0.541.7×
10,00094.75%1.3593.48%0.781.7×
20,00094.12%2.6494.69%1.421.9×
50,00095.77%8.7596.24%4.472.0×
100,00096.92%26.396.73%13.32.0×
200,00097.22%92.096.86%44.92.1×
500,00096.93%506.997.01%252.92.0×
1,000,00095.93%1,956.396.70%991.82.0×

I measured PFN ROC-AUC by fitting TabPFNClassifier on each subsample and scoring on the held-out test set via sklearn.metrics.roc_auc_score. PFN won below 10k rows. ICL caught up by 20k rows. Both plateaued around 100k rows.

That is the headline pattern.

I ran PFN on sparknov beyond 200k rows. Its ROC-AUC peaked at 97.22% at 200k rows. Then it dropped to 95.93% at 1M rows. That is a 1.3 percentage point decline. More data actively hurt PFN on this dataset.

ICL stayed more stable. Its ROC-AUC went from 96.86% at 200k to 97.01% at 500k to 96.70% at 1M. The most plausible explanation is that PFN’s attention mechanism gets swamped by noise when the context grows too large.

I timed inference on identical hardware for both models. PFN was consistently about 2× slower. The ratio stayed flat across all sizes. Predict cost is dominated by fixed model weights, not by the number of training rows.

ieeecis: corroboration

SizePFN AUCPFN pred (s)ICL AUCICL pred (s)Ratio
1,00083.33%2.082.47%1.61.2×
5,00086.88%2.587.24%2.01.2×
10,00087.85%3.188.36%2.41.3×
20,00088.84%5.088.79%3.71.3×
50,00090.92%13.091.11%8.91.5×
100,00092.30%33.892.23%28.81.2×
200,00093.49%103.293.36%64.61.6×
500,00095.20%534.094.85%360.91.5×

PFN won at 1k on ieeecis. The models traded blows between 5k and 20k rows. By 100k they were neck-and-neck. PFN scored 92.30% and ICL scored 92.23%. PFN kept improving through 500k on ieeecis. That differs from sparknov, where PFN peaked and degraded. Dataset structure matters.

malurl: the ICL advantage

SizePFN AUCPFN pred (s)ICL AUCICL pred (s)Ratio
1,00089.65%1.790.23%0.91.9×
5,00090.80%2.391.60%0.64.1×
10,00091.91%3.293.03%0.84.1×
20,00092.57%5.493.63%1.34.2×
50,00092.88%14.193.91%3.44.2×
100,00093.02%36.094.06%8.74.1×
200,00093.07%108.394.06%26.44.1×
500,00093.12%545.994.18%135.34.0×

ICL led at every size on malurl. The gap was roughly 1 percentage point. The gap did not widen with more data. That suggests a genuine architectural advantage for ICL on this feature structure. It is not a pure scaling effect.

The speed gap was also largest here at 4×. I suspect that is because malurl has a 65k-row test set. Sparknov has only 20k.

fakejob: the small-data case

SizePFN AUCPFN pred (s)ICL AUCICL pred (s)Ratio
1,00090.24%0.392.12%0.21.3×
5,00098.06%0.498.13%0.21.8×
10,00098.78%0.898.76%0.51.7×

Fakejob has only 14,304 training rows. I skipped sizes above 10k.

At 1k, ICL led. By 5k they were virtually tied. At 10k PFN edged ahead. Fakejob shows the same small-data pattern: PFN wins below 10k rows. That matches sparknov and ieeecis.

{{< themed-img src=“roc_auc_all_datasets.png” alt=“ROC-AUC vs training size, all four datasets” >}}


Timing

{{< themed-img src=“predict_time_all_datasets.png” alt=“Predict time vs training size” >}}

{{< themed-img src=“per_row_latency_all_datasets.png” alt=“Per-row inference cost vs training size” >}}

Both models scale super-linearly in predict time with training size. PFN has a larger constant factor. Per-row cost for PFN on sparknov grows from roughly 0.06 ms per row at 1k to roughly 25 ms per row at 500k.

Why the 2–4× gap? TabPFN has 1.93× more parameters. It issues roughly 30× more attention-layer calls per prediction. Flash Attention kernel time dominates about 20% of GPU time for both models. See the Profiling section for the full breakdown.


Profiling: why TabPFN is ~2× slower

I profiled on airig with RTX 5090, torch 2.12.0+cu130, tabpfn 8.0.3, and tabicl 2.1.1. I used torch.profiler with ProfilerActivity.CPU and ProfilerActivity.CUDA. I enabled record_shapes=True and with_flops=True.

Step 1: confirm model size difference

I measured parameter counts on fitted models. I used sum(p.numel() for p in model.parameters() if p.requires_grad).

ModelTrainable parametersRelative size
TabPFN53,153,1361.93×
TabICL27,552,2501.00×

{{< themed-img src="/posts/tabpfn-vs-tabicl-fdb/model_size_comparison.png" alt=“TabPFN3 vs TabICL model size comparison” >}}

TabPFN is essentially 2× larger. For a transformer that is compute-bound, this immediately predicts roughly 2× wall-clock time per forward pass.

What the architectures actually look like

The parameter gap is not a single big layer. It is a deeper stack. Both models share the same high-level pattern. That pattern is column embedding, then row interaction, then ICL transformer, then output. TabPFN3 doubles the ICL transformer depth.

{{< themed-img src=“architecture_diagram.png” alt=“TabPFN3 vs TabICL architecture comparison” >}}

TabPFN3 uses 24 ICL transformer layers. TabICL uses 12. That 2× depth produces roughly 30× more attention calls in the profiler. I counted 2,192 scaled_dot_product_attention calls for PFN versus 72 for ICL. The per-layer dimensions are similar. Both use 128-dim embeddings and 8-head attention in early stages. TabPFN3’s decoder adds an extra many-class attention head with 6 heads and dim 64. TabICL does not have that.

TabICL compensates for its shallower stack by concatenating 4 row-CLS tokens. That gives its ICL transformer a 512-dim input. That is 128 times 4. TabPFN3 keeps the ICL dimension at 128. But it processes that through twice as many layers. The product of depth times width times heads ends up at 1.93× the parameters. That maps almost exactly to the 2× wall-clock gap.

Step 2: torch.profiler trace on sparknov 50k

I traced a single predict_proba call on sparknov with 50k training rows and 20k test rows. I chose that size because it is large enough to saturate the GPU. It is small enough that the trace fits in memory.

Profiler setup:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import torch

# 1. Fit the model on a subsample
pfn = TabPFNClassifier(device='cuda')
pfn.fit(X_train, y_train)

# 2. Warm-up (exclude compilation / CUDA init from trace)
_ = pfn.predict_proba(X_test[:100])
torch.cuda.synchronize()

# 3. Profile the real run
with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ],
    record_shapes=True,
    with_flops=True,
    profile_memory=True,
) as prof:
    _ = pfn.predict_proba(X_test)
    torch.cuda.synchronize()

# 4. Export for Chrome trace viewer
prof.export_chrome_trace("tabpfn3_trace.json")

Top GPU kernels by device time (TabPFN3):

RankKernel / OpDevice time% of GPUCount
1flash_fwd_kernel (Flash Attention)6,728 ms20.6%392
2scaled_dot_product_attention6,931 ms21.2%2,192
3linear (Q/K/V + FFN projections)1,125 ms3.4%28,256
4matmul449 ms1.4%11,960
5mm367 ms1.1%11,576
6copy_297 ms0.9%46,209

Top GPU kernels by device time (TabICL):

RankKernel / OpDevice time% of GPUCount
1flash_fwd_kernel (Flash Attention)3,192 ms18.6%24
2scaled_dot_product_attention3,386 ms19.7%72
3linear659 ms3.8%944
4layer_norm454 ms2.6%358
5copy_260 ms1.5%1,842
6addmm253 ms1.5%440

Interpretation

Flash Attention dominates for both models. It takes about 20% of GPU time. The kernel name is explicit: pytorch_flash::flash_fwd_kernel. This fused attention forward pass performs Q times K transpose, softmax, and attention times V in one CUDA kernel.

The critical observation is the call count. TabPFN3 issues 2,192 scaled_dot_product_attention calls. TabICL issues 72 calls for the same test set. That is a 30× difference in attention-layer executions. It translates to roughly 2× total Flash Attention kernel time: 6.9 seconds versus 3.4 seconds.

Similarly, linear is called 28,256 times by TabPFN3. TabICL calls it 944 times. The ratio is again about 30× in call count. It is about 1.7× in total time: 1.13 seconds versus 0.66 seconds.

Why 30× in calls but only about 2× in wall time? TabPFN3’s larger model also has larger matrices per call. Each linear does more FLOPs. The product of calls times FLOPs per call ends up at roughly 2×. That is exactly the wall-clock gap in the benchmark tables.

Bottom line: the roughly 2× slowdown is not a mysterious constant factor. It is a direct consequence of TabPFN3’s transformer backbone executing roughly 30× more attention-layer operations per prediction. That is driven by a deeper and wider architecture with 1.93× more total parameters.

How to reproduce the trace

The full profiler script is available in the companion repo. The key lines are above. After running, open tabpfn3_trace.json in Chrome’s about:tracing or Edge’s edge://tracing. You can see a visual timeline of every CUDA kernel launch.

Download traces:

TraceSizeDownload
TabPFN3 (sparknov 50k)26 MB (gz)tabpfn3_trace.json.gz
TabICL (sparknov 50k)1.3 MB (gz)tabicl_trace.json.gz

Unzip with gunzip. Load into Chrome’s about:tracing to explore every CUDA kernel launch interactively.


Average Precision: a second lens

ROC-AUC tells us how well the model ranks fraud cases overall. But a fraud desk cares about precision at the top of the queue. They want to know how many alerts an analyst must review to catch the bulk of fraud. Average Precision answers this directly.

I re-ran the full size sweep with AP recording. I ran one job at a time to eliminate GPU contention. TabPFN3 used torch.inference_mode(). It also used torch.autocast with bfloat16 where supported. TabICL used inference_mode plus bfloat16 throughout. The tables below report accuracy and wall-clock predict time from the same clean runs.

sparknov AP

SizePFN ROC-AUCPFN APPFN pred (s)ICL ROC-AUCICL APICL pred (s)AP leader
1,0000.87220.18580.940.85710.11110.73PFN
5,0000.90900.30960.880.87120.30450.57PFN
10,0000.94720.33581.400.93480.37730.79ICL
20,0000.93840.32742.600.94690.39101.40ICL
50,0000.95700.35858.700.96240.43354.50ICL
100,0000.96950.370226.10.96730.451813.4ICL
200,0000.97200.367291.40.96860.477145.1ICL
500,0000.96960.30865020.96620.4131249ICL
1,000,0000.95870.24961,9290.95290.3533976ICL

At 1k through 5k, PFN won on both metrics. By 10k, ICL took the AP lead even though ROC-AUC was close. PFN degraded beyond 200k. Its AP dropped from 0.3672 at 200k to 0.2496 at 1M. That mirrors the ROC-AUC decline. ICL stayed more stable. It went from 0.4771 to 0.4131 to 0.3533.

ieeecis AP

SizePFN ROC-AUCPFN APPFN pred (s)ICL ROC-AUCICL APICL pred (s)AP leader
1,0000.83310.35951.900.82470.28751.60PFN
5,0000.86880.45862.500.87240.48432.00ICL
10,0000.87820.49053.000.88360.51702.40ICL
20,0000.88820.51445.300.88790.52763.70ICL
50,0000.90910.581912.90.91110.59408.90ICL
100,0000.92330.617333.80.92230.620029.1ICL
200,0000.93480.64231030.36160.031269.2PFN
500,0000.95190.6867534PFN

PFN improved steadily through 500k. Its AP went from 0.3595 to 0.6867. ICL was competitive up to 100k. At 200k it produced near-random predictions. Its ROC-AUC was 0.36 and AP was 0.03. That is a reproducible anomaly. It suggests a dataset-specific failure mode in TabICL’s batching at that size. ICL OOMs at 500k on my 32 GB GPU.

malurl AP

SizePFN ROC-AUCPFN APPFN pred (s)ICL ROC-AUCICL APICL pred (s)AP leader
1,0000.89660.87471.700.90230.87980.90ICL
5,0000.90850.88722.300.91600.89660.56ICL
10,0000.91730.89813.200.93030.91400.78ICL
20,0000.92360.90425.300.93630.92021.30ICL
50,0000.92730.908813.90.93910.92503.40ICL
100,0000.92810.909435.60.94060.92658.70ICL
200,000
500,000

ICL led at every size. Both models OOM at 200k and above on malurl. The test set is unusually large at 65k rows. That exhausts 32 GB GPU memory. This is a hard ceiling. It is not model-specific.

fakejob AP

SizePFN ROC-AUCPFN APPFN pred (s)ICL ROC-AUCICL APICL pred (s)AP leader
1,0000.90490.55100.270.92120.57330.22ICL
5,0000.98000.84050.440.98130.85130.24ICL
10,0000.98770.89050.770.98760.88620.46PFN

At 10k PFN edged ahead on both metrics. That confirms the small-data advantage.


Engineering: speeding up inference

The experiments below are separate validation runs. They do not modify the main benchmark numbers in the tables above.

I tested several PyTorch inference optimizations. I used a realistic imbalanced dataset with 20k samples, roughly 8% positive class, and 30 features. I fixed the random seed.

The fast path: inference_mode + bfloat16 autocast

ConfigSpeedup vs baselineROC-AUC change
baseline (plain no_grad)
torch.inference_mode()+20.9%+0.00 pp
torch.autocast("cuda", bfloat16)+18.5%−0.01 pp
inference_mode + autocast+21.3%−0.01 pp

TabICL showed smaller gains at about 1.2% combined. Its backbone already runs near peak throughput.

Caveat: bfloat16 triggers "geqrf_cuda" not implemented for 'BFloat16' on TabPFN3 for some datasets. That happens specifically on ieeecis. It is likely due to QR-decomposition in preprocessing. When this occurs, fall back to inference_mode only.

torch.compile via PerformanceOptions

TabPFN3 exposes PerformanceOptions with enable_torch_compile=True as a first-class toggle since version 8.0.3. I tested it properly. I compiled on the full production shape. Then I measured steady-state runs.

ConfigMedian pred (50k train / 20k test)SpeedupOne-time compile tax
inference_mode + bfloat169.10 s1.00×
enable_torch_compile=True8.93 s1.02×17.9 s

torch.compile compiles correctly. But the steady-state gain is about 2%. That is inside run-to-run variance. The 18-second upfront compile tax is not amortized in a single-prediction-per-shape workload. It is not worth enabling for fraud-benchmark-style tasks.

Other PerformanceOptions findings:

OptionDefault v3Tested effect
use_chunkwise_inferenceTrueAlready default; no free win left
save_peak_memory_factor8 (when memory_saving_mode triggers)Reduces peak memory; may already help at 500k+
force_recompute_layerFalseTraining-only; no-op under inference_mode
enable_torch_compileFalse2% speedup after compile; not worth it

Recommended inference wrapper:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
import torch

def predict_fast(model, X_test, use_bfloat16=True):
    torch.backends.cudnn.benchmark = True
    with torch.inference_mode():
        if use_bfloat16:
            with torch.autocast("cuda", dtype=torch.bfloat16):
                return model.predict_proba(X_test)
        else:
            return model.predict_proba(X_test)

What about fraudecom?

I ran both models on the full fraudecom dataset. It has 120,889 train rows and 30,223 test rows. I obtained ROC-AUCs of roughly 50.6% for TabPFN3 and roughly 50.4% for TabICL. Those look like coin-flip performance. But the dataset itself is the bottleneck. The models are not the problem.

The FDB baselines confirm this. Auto-sklearn scores 51.5%. H2O scores 51.8%. AutoGluon scores 52.2%. AFD OFI scores 51.9%. Only AFD TFI breaks out at 63.6%. AFD TFI is an Amazon-internal model engineered for temporal fraud signals. The foundation models sit in the same cluster as the general-purpose AutoML tools.

The root cause is extreme temporal distribution shift. Fraudecom uses an out-of-time train/test split. The training period has a 10.6% fraud rate. The test period drops to roughly 4.6%. I measured Pearson correlations between every feature and the label. I compared training window values against test window values.

  • time_since_signup: r equals −0.299 in train, r equals 0.003 in test
  • purchase_value, source, browser, age, ip_address: all absolute r below 0.005 in test

In other words, every predictive signal that exists in training evaporates in the test window. The models are not failing. The data distribution is.


Caveats

  • I excluded fraudecom from the main sweep tables. See the section above. Extreme temporal distribution shift collapses every feature-label correlation in the test window.
  • ipblock and twitterbot errored due to zero usable features after FDB preprocessing. Those are data-pipeline failures. They are not model failures.
  • I used a single seed with random_state=42 for stratified subsampling. Results could shift with different seeds.
  • ieeecis 200k TabICL shows near-random predictions. Its ROC-AUC is 0.36 and AP is 0.03. This is a reproducible anomaly. It is not a corrupted run.
  • malurl 200k and above and ieeecis 500k TabICL OOM on a 32 GB GPU. Those are hard memory ceilings.

Take

#Finding
1PFN wins below 10k rows on sparknov and ieeecis. Start with PFN if labeled data is expensive.
2ICL catches up by 20k–50k rows. The gap is usually 1–2 pp and disappears by 100k.
3At 200k+ the winner depends on the dataset. PFN peaks then degrades on sparknov. ICL is more stable. On ieeecis PFN improves through 500k.
4Inference cost is the bottleneck, not fit time. Fit time is a few seconds. Predict cost grows super-linearly. PFN costs 2–4× more per prediction because its backbone is 2× larger.
5torch.inference_mode() plus torch.autocast(bfloat16) gives PFN a clean +21% speedup with zero accuracy degradation. Enable it by default.
6Dataset structure matters more than model hype. malurl favors ICL consistently. sparknov and ieeecis are close at 100k and diverge at 500k+. fraudecom is hard for everyone due to extreme temporal shift.

References


  1. Amazon Science. Fraud Dataset Benchmark. https://github.com/amazon-science/fraud-dataset-benchmark ↩︎