Posts

TabFM: Google's Zero-Shot Foundation Model for Fraud Detection

Benchmarking Google’s 1.6B-parameter TabFM across 7 fraud datasets — from near-perfect AUC on credit card fraud to random-guess on e-commerce data. Zero-shot inference in 0.0s.

Hardening iroh-ssh with a chrooted systemd service

A statically linked Rust daemon with no runtime deps makes systemd sandboxing trivially tight — five read-only bind mounts, one writable, zero capabilities.

Making CEL faster: from AST interpreter to compiled closures

Five independent optimizations for cel-rust’s CEL evaluator — from simple regex caching to a full expression closure compiler with typed Schema — delivering 16-31× speedup on ARM64 Neoverse N1.

RTX 5090 power scaling: 450W vs 575W training

RTX 5090 power scaling from 400W to 600W on a personal workstation. Lower TDP saves ~€34/year at 80% idle and reduces sustained thermal stress in a residential build. 475W–500W is the practical sweet spot between speed and peace of mind.

How I use agents to write this blog

The actual workflow: from idea to published finding through a loop of playgrounds, benchmarks, and iterative drafts.

Soft distillation vs. gradient boosting on fraud

We benchmarked 52 method variants across 22 fraud and non-fraud configs. On hard fraud data, every gradient booster crushes TabPFN/TabICL by 15–20 AUC points while being 4–7× faster. Soft distillation helps only at medium scale. Teacher-as-feature is catastrophic. We quantify effect sizes with Cohen’s d and show why production fraud teams should think twice about foundation models.

Replicating Talking Trees: LLMs for fraud detection

We replicate the Talking Trees method (Yandex Research, 2025) on fraud-detection datasets using Kimi K2.6 and GPT-5.5. The LLM-guided tree beats sklearn by +0.04 AUC but is crushed by XGBoost (+0.11 AUC) at 1000× the cost. Kimi achieves higher peak accuracy but falls back 40% of the time; GPT-5.5 is more reliable (7% fallback) but slightly weaker.

Allocator shootout for async Rust on ARM64

jemalloc’s MADV_DONTNEED strategy triggers hundreds of thousands of aggressive page returns to the OS during large-message Tokio MPSC benchmarks, producing millions of demand-zero page faults. At 16 KB messages this causes a 62% regression versus std; the same allocator wins by 2× on small-object task spawn churn. The effect is allocation-size dependent, not async-pattern dependent.

Fine-tuning TabICL: when 30 epochs of GPU time buys you 0.3 pp

TabICL exposes a built-in fine-tuning pipeline via FinetunedTabICLClassifier. On five real-world classification datasets, I compared zero-shot TabICL against fine-tuned TabICL (30 epochs, early stopping, validation-driven hyperparameter selection). The result: fine-tuning helps on some datasets, hurts on others, and never moves AUC by more than ±0.7 pp. On telco-churn it is consistently beneficial (+0.16 to +0.59 pp). On cc-fraud it is completely flat — zero-shot is already near-perfect. The only consistent signal is that fine-tuning with too little data or the wrong seed can degrade performance.

Agent architecture: where the work runs

Hermes Agent orchestrates two persistent machines — a free-tier ARM64 VPS and a custom x86-64 workstation — to run Rust and PyTorch workloads without sandbox churn.