Machine: airig — AMD Ryzen 9 9900X, NVIDIA RTX 5090 FE, 64 GB RAM, Debian trixie
Software: Python 3.13.5, torch 2.12+cu130, tabpfn 8.0.3, tabicl 2.1.1
Datasets: Amazon Science fraud-dataset-benchmark (FDB) — 4 fraud-detection datasets1

TL;DR

  1. Below ~10k rows, TabPFN wins. It reaches higher ROC-AUC with less data on sparknov, ieeecis, and fakejob.
  2. By ~100k rows the gap vanishes. Both models plateau; differences are inside run-to-run noise.
  3. TabPFN degrades beyond 200k on sparknov. Its attention mechanism appears to get swamped by noise at extreme scale.
  4. TabPFN is consistently ~2× slower because its transformer backbone has 1.93× more parameters (53.15M vs 27.55M).
  5. torch.inference_mode() + torch.autocast(bfloat16) gives a clean +21% TabPFN speedup with zero ROC-AUC degradation.

Why this benchmark matters

Fraud detection is the canonical rare-event prediction problem: fraud rates of 0.1–10%, severe class imbalance, and a production requirement to rank risky transactions correctly. ROC-AUC is the standard headline metric because it is insensitive to class imbalance, but in practice a fraud team cares about precision at the top of the queue: how many alerts must an analyst review to catch 80% of fraud?

This is why Average Precision (AP) — the area under the precision-recall curve — is often more informative for fraud than ROC-AUC. AP is sensitive to the positive class and directly reflects the quality of the alert queue. We report both in this sweep. A model with high ROC-AUC but low AP is still a bad fraud detector: it may rank most positives above most negatives while being imprecise at the decision threshold that matters.

Metrics we report

  • ROC-AUC — probability a random fraud case scores higher than a random legitimate case. Standard metric, imbalance-agnostic.
  • Average Precision (AP) — area under the precision-recall curve. More informative than ROC-AUC for rare events because it weights precision at each recall level.
  • Accuracy / Precision / Recall — standard definitions; available in raw results but secondary here.

If you have the predictions, AP is a one-liner:

1
2
from sklearn.metrics import average_precision_score
ap = average_precision_score(y_true, y_proba[:, 1])

The models

TabPFN3 is a pretrained transformer designed for small tabular datasets (its sweet spot is ~100–10k rows). TabICL uses in-context learning and is designed to scale to larger tables.

Both models support GPU inference. We ran on identical stratified subsamples with random_state=42, identical FDB preprocessing, and device="cuda" on the RTX 5090.

Average Precision (AP) is reported alongside ROC-AUC in the full sweep tables below. AP measures the area under the precision-recall curve and is the more informative metric for rare-event problems like fraud: it directly reflects how many alerts an analyst must review to catch the bulk of fraud cases. If you have predictions, it is a one-liner:

1
2
from sklearn.metrics import average_precision_score
ap = average_precision_score(y_true, y_proba[:, 1])

AP is always lower than ROC-AUC for the same model because it is sensitive to class imbalance, whereas ROC-AUC is not. We recorded AP during the re-run so that both metrics come from the same predictions.


Method in two sentences

For each dataset we drew stratified subsamples of size 1k, 5k, 10k, 20k, 50k, 100k (and beyond where the dataset had rows). Both classifiers saw exactly the same rows. Feature preprocessing was identical via FDB: metadata columns dropped, categoricals label-encoded, train/test columns aligned.


Results

sparknov: the clearest story

sparknov is the dataset with the most complete size ladder (up to 1M train rows), so it anchors the narrative.

SizePFN AUCPFN pred (s)ICL AUCICL pred (s)Ratio
1,00087.08%1.1185.71%0.731.5×
5,00090.98%0.9087.12%0.541.7×
10,00094.75%1.3593.48%0.781.7×
20,00094.12%2.6494.69%1.421.9×
50,00095.77%8.7596.24%4.472.0×
100,00096.92%26.396.73%13.32.0×
200,00097.22%92.096.86%44.92.1×
500,00096.93%506.997.01%252.92.0×
1,000,00095.93%1,956.396.70%991.82.0×

PFN wins below 10k; ICL catches up by 20k; both plateau around 100k. This is the headline pattern.

The most striking finding is what happens after 200k: PFN peaks at 97.22% and then drops to 95.93% at 1M — a 1.3 pp decline. More data actively hurt PFN on this dataset. ICL is more stable (96.86% → 97.01% → 96.70%). The most plausible explanation is that PFN’s attention mechanism gets swamped by noise when the context grows too large.

Inference speed is a steady 2× gap across all sizes. The ratio stays flat because predict cost is dominated by the fixed model weights, not the number of training rows.

ieeecis: corroboration

SizePFN AUCPFN pred (s)ICL AUCICL pred (s)Ratio
1,00083.33%2.082.47%1.61.2×
5,00086.88%2.587.24%2.01.2×
10,00087.85%3.188.36%2.41.3×
20,00088.84%5.088.79%3.71.3×
50,00090.92%13.091.11%8.91.5×
100,00092.30%33.892.23%28.81.2×
200,00093.49%103.293.36%64.61.6×
500,00095.20%534.094.85%360.91.5×

PFN wins at 1k, the models trade blows between 5k–20k, and by 100k they are neck-and-neck (92.30% vs 92.23%). Unlike sparknov, PFN keeps improving through 500k here. Dataset structure matters.

malurl: the ICL advantage

SizePFN AUCPFN pred (s)ICL AUCICL pred (s)Ratio
1,00089.65%1.790.23%0.91.9×
5,00090.80%2.391.60%0.64.1×
10,00091.91%3.293.03%0.84.1×
20,00092.57%5.493.63%1.34.2×
50,00092.88%14.193.91%3.44.2×
100,00093.02%36.094.06%8.74.1×
200,00093.07%108.394.06%26.44.1×
500,00093.12%545.994.18%135.34.0×

malurl is the exception: ICL leads at every size by roughly 1 pp. The gap does not widen with data, suggesting a genuine architectural advantage for ICL on this feature structure rather than a pure scaling effect.

The speed gap is also largest here (4×), likely because malurl has a larger test set (65k rows) relative to sparknov (20k).

fakejob: the small-data case

SizePFN AUCPFN pred (s)ICL AUCICL pred (s)Ratio
1,00090.24%0.392.12%0.21.3×
5,00098.06%0.498.13%0.21.8×
10,00098.78%0.898.76%0.51.7×

fakejob has only 14,304 training rows so sizes >10k were skipped.

At 1k ICL leads; by 5k they are virtually tied; at 10k PFN edges ahead. This confirms the small-data pattern seen on sparknov and ieeecis.

ROC-AUC vs training size, all four datasets ROC-AUC vs training size, all four datasets

Timing

Predict time vs training size Predict time vs training size Per-row inference cost vs training size Per-row inference cost vs training size

Both models scale super-linearly in predict time with training size, but PFN has a larger constant factor. Per-row cost for PFN on sparknov grows from ~0.06 ms/row at 1k to ~25 ms/row at 500k.

Why the 2–4× gap? See the Profiling section below. The short version: TabPFN has 1.93× more parameters (53.15M vs 27.55M) and issues ~30× more attention-layer calls per prediction, with Flash Attention kernel time dominating ~20% of GPU time for both models.


Profiling: why TabPFN is ~2× slower

All profiling below was done on airig (RTX 5090, torch 2.12.0+cu130, tabpfn 8.0.3, tabicl 2.1.1). We used torch.profiler with ProfilerActivity.CPU and ProfilerActivity.CUDA, plus record_shapes=True and with_flops=True.

Step 1: confirm model size difference

We measured parameter counts on fitted models using sum(p.numel() for p in model.parameters() if p.requires_grad):

ModelTrainable parametersRelative size
TabPFN53,153,1361.93×
TabICL27,552,2501.00×
TabPFN3 vs TabICL model size comparison TabPFN3 vs TabICL model size comparison

TabPFN is essentially 2× larger. For a transformer that is compute-bound, this immediately predicts roughly 2× the wall-clock time per forward pass.

What the architectures actually look like

The parameter gap is not a single big layer — it is a deeper stack. Both models share the same high-level pattern (column embedding → row interaction → ICL transformer → output), but TabPFN3 doubles the ICL transformer depth:

TabPFN3 vs TabICL architecture comparison TabPFN3 vs TabICL architecture comparison

TabPFN3 uses 24 ICL transformer layers against TabICL’s 12. That 2× depth is what produces the ~30× attention-call gap we measured in the profiler (2,192 scaled_dot_product_attention calls vs 72). The per-layer dimensions are similar — both use 128-dim embeddings and 8-head attention in their early stages — but TabPFN3’s decoder adds an extra many-class attention head (6 heads, dim 64) that TabICL does not have.

TabICL compensates for its shallower ICL stack by concatenating the 4 row-CLS tokens, giving its ICL transformer a 512-dim input (128 × 4). TabPFN3 keeps the ICL dimension at 128 but processes it through twice as many layers. The product of (depth × width × heads) ends up at 1.93× the parameters, which maps almost exactly to the 2× wall-clock gap.

Step 2: torch.profiler trace on sparknov 50k

We traced a single predict_proba(X_test) call on sparknov with 50k training rows and 20k test rows.

Why this size? Large enough that GPU is saturated, small enough that the trace fits in memory.

Profiler setup:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import torch

# 1. Fit the model on a subsample
pfn = TabPFNClassifier(device='cuda')
pfn.fit(X_train, y_train)

# 2. Warm-up (exclude compilation / CUDA init from trace)
_ = pfn.predict_proba(X_test[:100])
torch.cuda.synchronize()

# 3. Profile the real run
with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ],
    record_shapes=True,
    with_flops=True,
    profile_memory=True,
) as prof:
    _ = pfn.predict_proba(X_test)
    torch.cuda.synchronize()

# 4. Export for Chrome trace viewer
prof.export_chrome_trace("tabpfn3_trace.json")

Top GPU kernels by device time (TabPFN3):

RankKernel / OpDevice time% of GPUCount
1flash_fwd_kernel (Flash Attention)6,728 ms20.6%392
2scaled_dot_product_attention6,931 ms21.2%2,192
3linear (Q/K/V + FFN projections)1,125 ms3.4%28,256
4matmul449 ms1.4%11,960
5mm367 ms1.1%11,576
6copy_297 ms0.9%46,209

Top GPU kernels by device time (TabICL):

RankKernel / OpDevice time% of GPUCount
1flash_fwd_kernel (Flash Attention)3,192 ms18.6%24
2scaled_dot_product_attention3,386 ms19.7%72
3linear659 ms3.8%944
4layer_norm454 ms2.6%358
5copy_260 ms1.5%1,842
6addmm253 ms1.5%440

Interpretation

Flash Attention dominates for both models (~20% of GPU time). The kernel name is explicit: pytorch_flash::flash_fwd_kernel. This is the fused attention forward pass that performs Q·K^T, softmax, and attention·V in one CUDA kernel.

The critical observation is the call count: TabPFN3 issues 2,192 scaled_dot_product_attention calls vs TabICL’s 72 calls for the same test set. That’s a 30× difference in attention-layer executions, which translates to roughly 2× the total Flash Attention kernel time (6.9s vs 3.4s).

Similarly, linear (the Q/K/V projection and FFN matmuls) is called 28,256 times by TabPFN3 vs 944 times by TabICL. The ratio is again ~30× in call count and ~1.7× in total time (1.13s vs 0.66s).

Why 30× in calls but only ~2× in wall time? Because TabPFN3’s larger model also has larger matrices per call — each linear does more FLOPs. The product of (calls × FLOPs per call) ends up at roughly 2×, which is exactly the wall-clock gap we observe in the benchmark tables.

Bottom line: The ~2× slowdown is not a mysterious constant factor. It is a direct consequence of TabPFN3’s transformer backbone executing ~30× more attention-layer operations per prediction, driven by a deeper/wider architecture with 1.93× more total parameters.

How to reproduce the trace

The full profiler script is available in the companion repo. The key lines are above. After running, open tabpfn3_trace.json in Chrome’s about:tracing or Edge’s edge://tracing to see a visual timeline of every CUDA kernel launch.

Download our traces:

TraceSizeDownload
TabPFN3 (sparknov 50k)26 MB (gz)tabpfn3_trace.json.gz
TabICL (sparknov 50k)1.3 MB (gz)tabicl_trace.json.gz

Unzip with gunzip and load into Chrome’s about:tracing to explore every CUDA kernel launch interactively.


Average Precision: a second lens

ROC-AUC tells us how well the model ranks fraud cases overall, but a fraud desk cares about precision at the top of the queue: how many alerts must an analyst review to catch the bulk of fraud? Average Precision (AP) answers this directly.

We re-ran the full size sweep with AP recording, one job at a time to eliminate GPU contention. TabPFN3 used torch.inference_mode() (and torch.autocast(bfloat16) where supported — see caveats below). TabICL used inference_mode + bfloat16 throughout. The tables below report both accuracy and wall-clock predict time from the same clean runs.

sparknov AP

SizePFN ROC-AUCPFN APPFN pred (s)ICL ROC-AUCICL APICL pred (s)AP leader
1,0000.87220.18580.940.85710.11110.73PFN
5,0000.90900.30960.880.87120.30450.57PFN
10,0000.94720.33581.400.93480.37730.79ICL
20,0000.93840.32742.600.94690.39101.40ICL
50,0000.95700.35858.700.96240.43354.50ICL
100,0000.96950.370226.10.96730.451813.4ICL
200,0000.97200.367291.40.96860.477145.1ICL
500,0000.96960.30865020.96620.4131249ICL
1,000,0000.95870.24961,9290.95290.3533976ICL

At 1k–5k PFN wins on both metrics. By 10k, ICL takes the AP lead even though ROC-AUC is close. PFN degrades beyond 200k — its AP drops from 0.3672 at 200k to 0.2496 at 1M, mirroring the ROC-AUC decline. ICL is more stable (0.4771 → 0.4131 → 0.3533).

ieeecis AP

SizePFN ROC-AUCPFN APPFN pred (s)ICL ROC-AUCICL APICL pred (s)AP leader
1,0000.83310.35951.900.82470.28751.60PFN
5,0000.86880.45862.500.87240.48432.00ICL
10,0000.87820.49053.000.88360.51702.40ICL
20,0000.88820.51445.300.88790.52763.70ICL
50,0000.90910.581912.90.91110.59408.90ICL
100,0000.92330.617333.80.92230.620029.1ICL
200,0000.93480.64231030.36160.031269.2PFN
500,0000.95190.6867534PFN

PFN improves steadily through 500k (AP 0.3595 → 0.6867). ICL is competitive up to 100k but produces near-random predictions at 200k (ROC-AUC 0.36, AP 0.03) — a reproducible anomaly that suggests a dataset-specific failure mode in TabICL’s batching at that size. ICL OOMs at 500k on our 32 GB GPU.

malurl AP

SizePFN ROC-AUCPFN APPFN pred (s)ICL ROC-AUCICL APICL pred (s)AP leader
1,0000.89660.87471.700.90230.87980.90ICL
5,0000.90850.88722.300.91600.89660.56ICL
10,0000.91730.89813.200.93030.91400.78ICL
20,0000.92360.90425.300.93630.92021.30ICL
50,0000.92730.908813.90.93910.92503.40ICL
100,0000.92810.909435.60.94060.92658.70ICL
200,000
500,000

ICL leads at every size. Both models OOM at 200k+ on malurl because the test set is unusually large (65k rows), exhausting 32 GB GPU memory. This is a hard ceiling, not a model-specific issue.

fakejob AP

SizePFN ROC-AUCPFN APPFN pred (s)ICL ROC-AUCICL APICL pred (s)AP leader
1,0000.90490.55100.270.92120.57330.22ICL
5,0000.98000.84050.440.98130.85130.24ICL
10,0000.98770.89050.770.98760.88620.46PFN

At 10k PFN edges ahead on both metrics, confirming the small-data advantage.


Engineering: speeding up inference

The experiments below are separate validation runs. They do not modify the main benchmark numbers reported in the tables above.

We tested several PyTorch inference optimizations on a realistic imbalanced dataset (20k samples, ~8% positive class, 30 features) with a fixed random seed.

The fast path: inference_mode + bfloat16 autocast

ConfigSpeedup vs baselineROC-AUC change
baseline (plain no_grad)
torch.inference_mode()+20.9%+0.00 pp
torch.autocast("cuda", bfloat16)+18.5%−0.01 pp
inference_mode + autocast+21.3%−0.01 pp

TabICL showed smaller gains (~1.2% combined) because its backbone already runs near peak throughput.

Caveat: bfloat16 triggers "geqrf_cuda" not implemented for 'BFloat16' on TabPFN3 for some datasets (specifically ieeecis, likely due to QR-decomposition in preprocessing). When this occurs, fall back to inference_mode only.

torch.compile via PerformanceOptions

TabPFN3 exposes PerformanceOptions(enable_torch_compile=True) as a first-class toggle (since 8.0.3). We tested it properly: compile on the full production shape, then measure steady-state runs.

ConfigMedian pred (50k train / 20k test)SpeedupOne-time compile tax
inference_mode + bfloat169.10 s1.00×
enable_torch_compile=True8.93 s1.02×17.9 s

Verdict: torch.compile compiles correctly, but the steady-state gain (~2%) is inside run-to-run variance. The 18-second upfront compile tax is not amortized in a single-prediction-per-shape workload. Not worth enabling for fraud-benchmark-style tasks.

Other PerformanceOptions findings:

OptionDefault v3Tested effect
use_chunkwise_inferenceTrueAlready default; no free win left
save_peak_memory_factor8 (when memory_saving_mode triggers)Reduces peak memory; may already help at 500k+
force_recompute_layerFalseTraining-only; no-op under inference_mode
enable_torch_compileFalse2% speedup after compile; not worth it

Recommended inference wrapper:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
import torch

def predict_fast(model, X_test, use_bfloat16=True):
    torch.backends.cudnn.benchmark = True
    with torch.inference_mode():
        if use_bfloat16:
            with torch.autocast("cuda", dtype=torch.bfloat16):
                return model.predict_proba(X_test)
        else:
            return model.predict_proba(X_test)

What about fraudecom?

We ran both models on the full fraudecom dataset (120,889 train / 30,223 test) and obtained ROC-AUCs of ~50.6% (TabPFN3) and ~50.4% (TabICL). These look like coin-flip performance, but the dataset itself is the bottleneck — not the models.

The FDB baselines confirm this. Auto-sklearn scores 51.5%, H2O 51.8%, AutoGluon 52.2%, and AFD OFI 51.9%. Only AFD TFI, an Amazon-internal model engineered specifically for temporal fraud signals, breaks out at 63.6%. The foundation models sit squarely in the same cluster as the general-purpose AutoML tools.

The root cause is extreme temporal distribution shift. Fraudecom uses an out-of-time train/test split. The training period has a 10.6% fraud rate; the test period drops to ~4.6%. We measured Pearson correlations between every feature and the label in the training window versus the test window:

  • time_since_signup: r = −0.299 in train, r = 0.003 in test
  • purchase_value, source, browser, age, ip_address: all |r| < 0.005 in test

In other words, every predictive signal that exists in training evaporates in the test window. The models are not failing — the data distribution is.


Caveats

  • fraudecom is excluded from the main sweep tables. See the section above. Extreme temporal distribution shift collapses every feature-label correlation in the test window.
  • ipblock and twitterbot errored due to zero usable features after FDB preprocessing — data-pipeline failures, not model failures.
  • Single seed (random_state=42) for stratified subsampling. Results could shift with different seeds.
  • ieeecis 200k TabICL shows near-random predictions (ROC-AUC 0.36, AP 0.03). This is a reproducible anomaly, not a corrupted run.
  • malurl 200k+ and ieeecis 500k TabICL OOM on a 32 GB GPU. These are hard memory ceilings.

Take

  1. PFN is the better pick below 10k rows. It is more sample-efficient than ICL on sparknov and ieeecis. If labeled data is expensive, start with PFN.

  2. By ~20k–50k rows ICL catches up and often leads. The gap is usually 1–2 pp and disappears by 100k on most datasets.

  3. At 200k+ the picture depends on the dataset. PFN peaks and then degrades on sparknov; ICL is more stable. On ieeecis PFN keeps improving through 500k. There is no universal winner at scale.

  4. Inference cost, not fit time, is the bottleneck. Fit time is usually a few seconds. Predict cost grows super-linearly, and PFN costs 2–4× more per prediction than ICL because its backbone is 2× larger.

  5. inference_mode + autocast(bfloat16) gives a clean +21% TabPFN speedup with zero accuracy degradation. Enable it by default.

  6. Dataset structure matters more than model hype. malurl consistently favors ICL; sparknov and ieeecis are close at 100k and diverge differently at 500k+. fraudecom is hard for everyone due to extreme temporal shift. There is no universal winner.


References


  1. Amazon Science. Fraud Dataset Benchmark. https://github.com/amazon-science/fraud-dataset-benchmark ↩︎