TabPFN3 vs TabICL: a matched-size fraud-benchmark sweep

Machine: airig — AMD Ryzen 9 9900X, NVIDIA RTX 5090 FE, 64 GB RAM, Debian trixie
Software: Python 3.13.5, torch 2.12+cu130, tabpfn 8.0.3, tabicl 2.1.1
Datasets: Amazon Science fraud-dataset-benchmark (FDB) — 4 fraud-detection datasets¹

TL;DR

PFN wins below ~10k rows. I ran stratified subsamples from 1k to 10k rows on sparknov, ieeecis, and fakejob. PFN reached higher ROC-AUC than ICL at every point in that range.
By ~100k rows the gap vanishes. I ran both models on sparknov up to 100k rows. Both plateaued around that point. Differences at that scale fell inside run-to-run noise.
PFN degrades beyond 200k on sparknov. I trained PFN on sparknov up to 1M rows. Its ROC-AUC peaked at 97.22% at 200k rows. Then it dropped to 95.93% at 1M rows. That 1.3 percentage point decline suggests its attention mechanism gets swamped by noise at extreme scale.
PFN is consistently ~2× slower. I timed inference on identical hardware for both models. PFN took roughly twice as long because its transformer backbone has 1.93× more parameters. The exact counts are 53.15M trainable parameters for PFN versus 27.55M for ICL.
torch.inference_mode() + torch.autocast(bfloat16) gives a clean +21% TabPFN speedup with zero ROC-AUC degradation. I wrapped PFN prediction calls in torch.inference_mode() plus torch.autocast with bfloat16 dtype. That cut wall-clock time by 21% with no accuracy loss.

Why this benchmark matters

Fraud detection is the canonical rare-event prediction problem. Fraud rates sit between 0.1% and 10%. Class imbalance is severe. Production teams need to rank risky transactions correctly.

ROC-AUC is the standard headline metric. It measures the probability that a random fraud case scores higher than a random legitimate case. It is insensitive to class imbalance.

But fraud teams actually care about precision at the top of the queue. They want to know how many alerts an analyst must review to catch 80% of fraud. That is a different question.

Average Precision answers that question. It is the area under the precision-recall curve. AP is sensitive to the positive class. It directly reflects alert-queue quality.

I report both metrics in this sweep. A model can have high ROC-AUC but low AP. That model still ranks most positives above most negatives. Yet it is imprecise at the decision threshold that matters.

Metrics I report

ROC-AUC — probability a random fraud case scores higher than a random legitimate case. Standard metric, imbalance-agnostic.
Average Precision (AP) — area under the precision-recall curve. More informative than ROC-AUC for rare events because it weights precision at each recall level.
Accuracy / Precision / Recall — standard definitions available in raw results. I treat them as secondary here.

If you have predictions, AP is a one-liner:

1
2
from sklearn.metrics import average_precision_score
ap = average_precision_score(y_true, y_proba[:, 1])

The models

TabPFN3 is a pretrained transformer designed for small tabular datasets. Its sweet spot is roughly 100 to 10,000 rows. TabICL uses in-context learning and is designed to scale to larger tables.

Both models support GPU inference. I ran identical stratified subsamples with random_state=42. I used identical FDB preprocessing. Both models ran on device="cuda" on the RTX 5090.

Method Glossary

Method	One-sentence explanation
TabPFN3	Pretrained transformer for small tabular classification, typically 100–10k rows.
TabICL	In-context learning tabular classifier designed to scale to larger tables.
FDB	Amazon Science fraud-dataset-benchmark preprocessing pipeline.
torch.profiler	PyTorch profiler for kernel-level GPU and CPU tracing.
torch.inference_mode	PyTorch context that disables gradient bookkeeping for faster inference.
torch.autocast	PyTorch mixed-precision wrapper that runs ops in bfloat16 where safe.
PerformanceOptions	TabPFN3 configuration object for compile and memory toggles.

Method in two sentences

For each dataset I drew stratified subsamples of size 1k, 5k, 10k, 20k, 50k, 100k, and beyond where the dataset had rows. Both classifiers saw exactly the same rows. Feature preprocessing was identical via FDB. Metadata columns were dropped. Categoricals were label-encoded. Train and test columns were aligned.

Results

sparknov: the clearest story

sparknov is the dataset with the most complete size ladder. I ran subsamples up to 1M training rows. It anchors the narrative.

Size	PFN AUC	PFN pred (s)	ICL AUC	ICL pred (s)	Ratio
1,000	87.08%	1.11	85.71%	0.73	1.5×
5,000	90.98%	0.90	87.12%	0.54	1.7×
10,000	94.75%	1.35	93.48%	0.78	1.7×
20,000	94.12%	2.64	94.69%	1.42	1.9×
50,000	95.77%	8.75	96.24%	4.47	2.0×
100,000	96.92%	26.3	96.73%	13.3	2.0×
200,000	97.22%	92.0	96.86%	44.9	2.1×
500,000	96.93%	506.9	97.01%	252.9	2.0×
1,000,000	95.93%	1,956.3	96.70%	991.8	2.0×

I measured PFN ROC-AUC by fitting TabPFNClassifier on each subsample and scoring on the held-out test set via sklearn.metrics.roc_auc_score. PFN won below 10k rows. ICL caught up by 20k rows. Both plateaued around 100k rows.

That is the headline pattern.

I ran PFN on sparknov beyond 200k rows. Its ROC-AUC peaked at 97.22% at 200k rows. Then it dropped to 95.93% at 1M rows. That is a 1.3 percentage point decline. More data actively hurt PFN on this dataset.

ICL stayed more stable. Its ROC-AUC went from 96.86% at 200k to 97.01% at 500k to 96.70% at 1M. The most plausible explanation is that PFN’s attention mechanism gets swamped by noise when the context grows too large.

I timed inference on identical hardware for both models. PFN was consistently about 2× slower. The ratio stayed flat across all sizes. Predict cost is dominated by fixed model weights, not by the number of training rows.

ieeecis: corroboration

Size	PFN AUC	PFN pred (s)	ICL AUC	ICL pred (s)	Ratio
1,000	83.33%	2.0	82.47%	1.6	1.2×
5,000	86.88%	2.5	87.24%	2.0	1.2×
10,000	87.85%	3.1	88.36%	2.4	1.3×
20,000	88.84%	5.0	88.79%	3.7	1.3×
50,000	90.92%	13.0	91.11%	8.9	1.5×
100,000	92.30%	33.8	92.23%	28.8	1.2×
200,000	93.49%	103.2	93.36%	64.6	1.6×
500,000	95.20%	534.0	94.85%	360.9	1.5×

PFN won at 1k on ieeecis. The models traded blows between 5k and 20k rows. By 100k they were neck-and-neck. PFN scored 92.30% and ICL scored 92.23%. PFN kept improving through 500k on ieeecis. That differs from sparknov, where PFN peaked and degraded. Dataset structure matters.

malurl: the ICL advantage

Size	PFN AUC	PFN pred (s)	ICL AUC	ICL pred (s)	Ratio
1,000	89.65%	1.7	90.23%	0.9	1.9×
5,000	90.80%	2.3	91.60%	0.6	4.1×
10,000	91.91%	3.2	93.03%	0.8	4.1×
20,000	92.57%	5.4	93.63%	1.3	4.2×
50,000	92.88%	14.1	93.91%	3.4	4.2×
100,000	93.02%	36.0	94.06%	8.7	4.1×
200,000	93.07%	108.3	94.06%	26.4	4.1×
500,000	93.12%	545.9	94.18%	135.3	4.0×

ICL led at every size on malurl. The gap was roughly 1 percentage point. The gap did not widen with more data. That suggests a genuine architectural advantage for ICL on this feature structure. It is not a pure scaling effect.

The speed gap was also largest here at 4×. I suspect that is because malurl has a 65k-row test set. Sparknov has only 20k.

fakejob: the small-data case

Size	PFN AUC	PFN pred (s)	ICL AUC	ICL pred (s)	Ratio
1,000	90.24%	0.3	92.12%	0.2	1.3×
5,000	98.06%	0.4	98.13%	0.2	1.8×
10,000	98.78%	0.8	98.76%	0.5	1.7×

Fakejob has only 14,304 training rows. I skipped sizes above 10k.

At 1k, ICL led. By 5k they were virtually tied. At 10k PFN edged ahead. Fakejob shows the same small-data pattern: PFN wins below 10k rows. That matches sparknov and ieeecis.

{{< themed-img src=“roc_auc_all_datasets.png” alt=“ROC-AUC vs training size, all four datasets” >}}

Timing

{{< themed-img src=“predict_time_all_datasets.png” alt=“Predict time vs training size” >}}

{{< themed-img src=“per_row_latency_all_datasets.png” alt=“Per-row inference cost vs training size” >}}

Both models scale super-linearly in predict time with training size. PFN has a larger constant factor. Per-row cost for PFN on sparknov grows from roughly 0.06 ms per row at 1k to roughly 25 ms per row at 500k.

Why the 2–4× gap? TabPFN has 1.93× more parameters. It issues roughly 30× more attention-layer calls per prediction. Flash Attention kernel time dominates about 20% of GPU time for both models. See the Profiling section for the full breakdown.

Profiling: why TabPFN is ~2× slower

I profiled on airig with RTX 5090, torch 2.12.0+cu130, tabpfn 8.0.3, and tabicl 2.1.1. I used torch.profiler with ProfilerActivity.CPU and ProfilerActivity.CUDA. I enabled record_shapes=True and with_flops=True.

Step 1: confirm model size difference

I measured parameter counts on fitted models. I used sum(p.numel() for p in model.parameters() if p.requires_grad).

Model	Trainable parameters	Relative size
TabPFN	53,153,136	1.93×
TabICL	27,552,250	1.00×

{{< themed-img src="/posts/tabpfn-vs-tabicl-fdb/model_size_comparison.png" alt=“TabPFN3 vs TabICL model size comparison” >}}

TabPFN is essentially 2× larger. For a transformer that is compute-bound, this immediately predicts roughly 2× wall-clock time per forward pass.

What the architectures actually look like

The parameter gap is not a single big layer. It is a deeper stack. Both models share the same high-level pattern. That pattern is column embedding, then row interaction, then ICL transformer, then output. TabPFN3 doubles the ICL transformer depth.

{{< themed-img src=“architecture_diagram.png” alt=“TabPFN3 vs TabICL architecture comparison” >}}

TabPFN3 uses 24 ICL transformer layers. TabICL uses 12. That 2× depth produces roughly 30× more attention calls in the profiler. I counted 2,192 scaled_dot_product_attention calls for PFN versus 72 for ICL. The per-layer dimensions are similar. Both use 128-dim embeddings and 8-head attention in early stages. TabPFN3’s decoder adds an extra many-class attention head with 6 heads and dim 64. TabICL does not have that.

TabICL compensates for its shallower stack by concatenating 4 row-CLS tokens. That gives its ICL transformer a 512-dim input. That is 128 times 4. TabPFN3 keeps the ICL dimension at 128. But it processes that through twice as many layers. The product of depth times width times heads ends up at 1.93× the parameters. That maps almost exactly to the 2× wall-clock gap.

Step 2: torch.profiler trace on sparknov 50k

I traced a single predict_proba call on sparknov with 50k training rows and 20k test rows. I chose that size because it is large enough to saturate the GPU. It is small enough that the trace fits in memory.

Profiler setup:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import torch

# 1. Fit the model on a subsample
pfn = TabPFNClassifier(device='cuda')
pfn.fit(X_train, y_train)

# 2. Warm-up (exclude compilation / CUDA init from trace)
_ = pfn.predict_proba(X_test[:100])
torch.cuda.synchronize()

# 3. Profile the real run
with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ],
    record_shapes=True,
    with_flops=True,
    profile_memory=True,
) as prof:
    _ = pfn.predict_proba(X_test)
    torch.cuda.synchronize()

# 4. Export for Chrome trace viewer
prof.export_chrome_trace("tabpfn3_trace.json")

Top GPU kernels by device time (TabPFN3):

Rank	Kernel / Op	Device time	% of GPU	Count
1	`flash_fwd_kernel` (Flash Attention)	6,728 ms	20.6%	392
2	`scaled_dot_product_attention`	6,931 ms	21.2%	2,192
3	`linear` (Q/K/V + FFN projections)	1,125 ms	3.4%	28,256
4	`matmul`	449 ms	1.4%	11,960
5	`mm`	367 ms	1.1%	11,576
6	`copy_`	297 ms	0.9%	46,209

Top GPU kernels by device time (TabICL):

Rank	Kernel / Op	Device time	% of GPU	Count
1	`flash_fwd_kernel` (Flash Attention)	3,192 ms	18.6%	24
2	`scaled_dot_product_attention`	3,386 ms	19.7%	72
3	`linear`	659 ms	3.8%	944
4	`layer_norm`	454 ms	2.6%	358
5	`copy_`	260 ms	1.5%	1,842
6	`addmm`	253 ms	1.5%	440

Interpretation

Flash Attention dominates for both models. It takes about 20% of GPU time. The kernel name is explicit: pytorch_flash::flash_fwd_kernel. This fused attention forward pass performs Q times K transpose, softmax, and attention times V in one CUDA kernel.

The critical observation is the call count. TabPFN3 issues 2,192 scaled_dot_product_attention calls. TabICL issues 72 calls for the same test set. That is a 30× difference in attention-layer executions. It translates to roughly 2× total Flash Attention kernel time: 6.9 seconds versus 3.4 seconds.

Similarly, linear is called 28,256 times by TabPFN3. TabICL calls it 944 times. The ratio is again about 30× in call count. It is about 1.7× in total time: 1.13 seconds versus 0.66 seconds.

Why 30× in calls but only about 2× in wall time? TabPFN3’s larger model also has larger matrices per call. Each linear does more FLOPs. The product of calls times FLOPs per call ends up at roughly 2×. That is exactly the wall-clock gap in the benchmark tables.

Bottom line: the roughly 2× slowdown is not a mysterious constant factor. It is a direct consequence of TabPFN3’s transformer backbone executing roughly 30× more attention-layer operations per prediction. That is driven by a deeper and wider architecture with 1.93× more total parameters.

How to reproduce the trace

The full profiler script is available in the companion repo. The key lines are above. After running, open tabpfn3_trace.json in Chrome’s about:tracing or Edge’s edge://tracing. You can see a visual timeline of every CUDA kernel launch.

Download traces:

Trace	Size	Download
TabPFN3 (sparknov 50k)	26 MB (gz)	`tabpfn3_trace.json.gz`
TabICL (sparknov 50k)	1.3 MB (gz)	`tabicl_trace.json.gz`

Unzip with gunzip. Load into Chrome’s about:tracing to explore every CUDA kernel launch interactively.

Average Precision: a second lens

ROC-AUC tells us how well the model ranks fraud cases overall. But a fraud desk cares about precision at the top of the queue. They want to know how many alerts an analyst must review to catch the bulk of fraud. Average Precision answers this directly.

I re-ran the full size sweep with AP recording. I ran one job at a time to eliminate GPU contention. TabPFN3 used torch.inference_mode(). It also used torch.autocast with bfloat16 where supported. TabICL used inference_mode plus bfloat16 throughout. The tables below report accuracy and wall-clock predict time from the same clean runs.

sparknov AP

Size	PFN ROC-AUC	PFN AP	PFN pred (s)	ICL ROC-AUC	ICL AP	ICL pred (s)	AP leader
1,000	0.8722	0.1858	0.94	0.8571	0.1111	0.73	PFN
5,000	0.9090	0.3096	0.88	0.8712	0.3045	0.57	PFN
10,000	0.9472	0.3358	1.40	0.9348	0.3773	0.79	ICL
20,000	0.9384	0.3274	2.60	0.9469	0.3910	1.40	ICL
50,000	0.9570	0.3585	8.70	0.9624	0.4335	4.50	ICL
100,000	0.9695	0.3702	26.1	0.9673	0.4518	13.4	ICL
200,000	0.9720	0.3672	91.4	0.9686	0.4771	45.1	ICL
500,000	0.9696	0.3086	502	0.9662	0.4131	249	ICL
1,000,000	0.9587	0.2496	1,929	0.9529	0.3533	976	ICL

At 1k through 5k, PFN won on both metrics. By 10k, ICL took the AP lead even though ROC-AUC was close. PFN degraded beyond 200k. Its AP dropped from 0.3672 at 200k to 0.2496 at 1M. That mirrors the ROC-AUC decline. ICL stayed more stable. It went from 0.4771 to 0.4131 to 0.3533.

ieeecis AP

Size	PFN ROC-AUC	PFN AP	PFN pred (s)	ICL ROC-AUC	ICL AP	ICL pred (s)	AP leader
1,000	0.8331	0.3595	1.90	0.8247	0.2875	1.60	PFN
5,000	0.8688	0.4586	2.50	0.8724	0.4843	2.00	ICL
10,000	0.8782	0.4905	3.00	0.8836	0.5170	2.40	ICL
20,000	0.8882	0.5144	5.30	0.8879	0.5276	3.70	ICL
50,000	0.9091	0.5819	12.9	0.9111	0.5940	8.90	ICL
100,000	0.9233	0.6173	33.8	0.9223	0.6200	29.1	ICL
200,000	0.9348	0.6423	103	0.3616	0.0312	69.2	PFN
500,000	0.9519	0.6867	534	—	—	—	PFN

PFN improved steadily through 500k. Its AP went from 0.3595 to 0.6867. ICL was competitive up to 100k. At 200k it produced near-random predictions. Its ROC-AUC was 0.36 and AP was 0.03. That is a reproducible anomaly. It suggests a dataset-specific failure mode in TabICL’s batching at that size. ICL OOMs at 500k on my 32 GB GPU.

malurl AP

Size	PFN ROC-AUC	PFN AP	PFN pred (s)	ICL ROC-AUC	ICL AP	ICL pred (s)	AP leader
1,000	0.8966	0.8747	1.70	0.9023	0.8798	0.90	ICL
5,000	0.9085	0.8872	2.30	0.9160	0.8966	0.56	ICL
10,000	0.9173	0.8981	3.20	0.9303	0.9140	0.78	ICL
20,000	0.9236	0.9042	5.30	0.9363	0.9202	1.30	ICL
50,000	0.9273	0.9088	13.9	0.9391	0.9250	3.40	ICL
100,000	0.9281	0.9094	35.6	0.9406	0.9265	8.70	ICL
200,000	—	—	—	—	—	—	—
500,000	—	—	—	—	—	—	—

ICL led at every size. Both models OOM at 200k and above on malurl. The test set is unusually large at 65k rows. That exhausts 32 GB GPU memory. This is a hard ceiling. It is not model-specific.

fakejob AP

Size	PFN ROC-AUC	PFN AP	PFN pred (s)	ICL ROC-AUC	ICL AP	ICL pred (s)	AP leader
1,000	0.9049	0.5510	0.27	0.9212	0.5733	0.22	ICL
5,000	0.9800	0.8405	0.44	0.9813	0.8513	0.24	ICL
10,000	0.9877	0.8905	0.77	0.9876	0.8862	0.46	PFN

At 10k PFN edged ahead on both metrics. That confirms the small-data advantage.

Engineering: speeding up inference

The experiments below are separate validation runs. They do not modify the main benchmark numbers in the tables above.

I tested several PyTorch inference optimizations. I used a realistic imbalanced dataset with 20k samples, roughly 8% positive class, and 30 features. I fixed the random seed.

The fast path: inference_mode + bfloat16 autocast

Config	Speedup vs baseline	ROC-AUC change
baseline (plain `no_grad`)	—	—
`torch.inference_mode()`	+20.9%	+0.00 pp
`torch.autocast("cuda", bfloat16)`	+18.5%	−0.01 pp
`inference_mode` + `autocast`	+21.3%	−0.01 pp

TabICL showed smaller gains at about 1.2% combined. Its backbone already runs near peak throughput.

Caveat: bfloat16 triggers "geqrf_cuda" not implemented for 'BFloat16' on TabPFN3 for some datasets. That happens specifically on ieeecis. It is likely due to QR-decomposition in preprocessing. When this occurs, fall back to inference_mode only.

torch.compile via PerformanceOptions

TabPFN3 exposes PerformanceOptions with enable_torch_compile=True as a first-class toggle since version 8.0.3. I tested it properly. I compiled on the full production shape. Then I measured steady-state runs.

Config	Median pred (50k train / 20k test)	Speedup	One-time compile tax
`inference_mode` + `bfloat16`	9.10 s	1.00×	—
`enable_torch_compile=True`	8.93 s	1.02×	17.9 s

torch.compile compiles correctly. But the steady-state gain is about 2%. That is inside run-to-run variance. The 18-second upfront compile tax is not amortized in a single-prediction-per-shape workload. It is not worth enabling for fraud-benchmark-style tasks.

Other PerformanceOptions findings:

Option	Default v3	Tested effect
`use_chunkwise_inference`	`True`	Already default; no free win left
`save_peak_memory_factor`	`8` (when `memory_saving_mode` triggers)	Reduces peak memory; may already help at 500k+
`force_recompute_layer`	`False`	Training-only; no-op under `inference_mode`
`enable_torch_compile`	`False`	2% speedup after compile; not worth it

Recommended inference wrapper:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
import torch

def predict_fast(model, X_test, use_bfloat16=True):
    torch.backends.cudnn.benchmark = True
    with torch.inference_mode():
        if use_bfloat16:
            with torch.autocast("cuda", dtype=torch.bfloat16):
                return model.predict_proba(X_test)
        else:
            return model.predict_proba(X_test)

What about fraudecom?

I ran both models on the full fraudecom dataset. It has 120,889 train rows and 30,223 test rows. I obtained ROC-AUCs of roughly 50.6% for TabPFN3 and roughly 50.4% for TabICL. Those look like coin-flip performance. But the dataset itself is the bottleneck. The models are not the problem.

The FDB baselines confirm this. Auto-sklearn scores 51.5%. H2O scores 51.8%. AutoGluon scores 52.2%. AFD OFI scores 51.9%. Only AFD TFI breaks out at 63.6%. AFD TFI is an Amazon-internal model engineered for temporal fraud signals. The foundation models sit in the same cluster as the general-purpose AutoML tools.

The root cause is extreme temporal distribution shift. Fraudecom uses an out-of-time train/test split. The training period has a 10.6% fraud rate. The test period drops to roughly 4.6%. I measured Pearson correlations between every feature and the label. I compared training window values against test window values.

time_since_signup: r equals −0.299 in train, r equals 0.003 in test
purchase_value, source, browser, age, ip_address: all absolute r below 0.005 in test

In other words, every predictive signal that exists in training evaporates in the test window. The models are not failing. The data distribution is.

Caveats

I excluded fraudecom from the main sweep tables. See the section above. Extreme temporal distribution shift collapses every feature-label correlation in the test window.
ipblock and twitterbot errored due to zero usable features after FDB preprocessing. Those are data-pipeline failures. They are not model failures.
I used a single seed with random_state=42 for stratified subsampling. Results could shift with different seeds.
ieeecis 200k TabICL shows near-random predictions. Its ROC-AUC is 0.36 and AP is 0.03. This is a reproducible anomaly. It is not a corrupted run.
malurl 200k and above and ieeecis 500k TabICL OOM on a 32 GB GPU. Those are hard memory ceilings.

Take

#	Finding
1	PFN wins below 10k rows on sparknov and ieeecis. Start with PFN if labeled data is expensive.
2	ICL catches up by 20k–50k rows. The gap is usually 1–2 pp and disappears by 100k.
3	At 200k+ the winner depends on the dataset. PFN peaks then degrades on sparknov. ICL is more stable. On ieeecis PFN improves through 500k.
4	Inference cost is the bottleneck, not fit time. Fit time is a few seconds. Predict cost grows super-linearly. PFN costs 2–4× more per prediction because its backbone is 2× larger.
5	`torch.inference_mode()` plus `torch.autocast(bfloat16)` gives PFN a clean +21% speedup with zero accuracy degradation. Enable it by default.
6	Dataset structure matters more than model hype. malurl favors ICL consistently. sparknov and ieeecis are close at 100k and diverge at 500k+. fraudecom is hard for everyone due to extreme temporal shift.

References

Amazon Science. Fraud Dataset Benchmark. https://github.com/amazon-science/fraud-dataset-benchmark ↩︎

This post was researched and drafted with AI assistance. I picked the ideas, reviewed every claim, and verified all benchmarks.

TL;DR#

Why this benchmark matters#

Metrics I report#

The models#

Method Glossary#

Method in two sentences#

Results#

sparknov: the clearest story#

ieeecis: corroboration#

malurl: the ICL advantage#

fakejob: the small-data case#

Timing#

Profiling: why TabPFN is ~2× slower#

Step 1: confirm model size difference#

What the architectures actually look like#

Step 2: torch.profiler trace on sparknov 50k#

Interpretation#

How to reproduce the trace#

Average Precision: a second lens#

sparknov AP#

ieeecis AP#

malurl AP#

fakejob AP#

Engineering: speeding up inference#

The fast path: inference_mode + bfloat16 autocast#

torch.compile via PerformanceOptions#

What about fraudecom?#

Caveats#

Take#

References#

Related posts