I don’t execute code on the machine that hosts Hermes. Every finding here is produced by a chain of tool calls that read files, execute code, or run shell commands. All of those operations travel over SSH to a separate fleet of persistent machines.

The Hermes host serves the site. The fleet handles the actual execution.

I’m documenting those machines here so the setup is reproducible and the constraints are explicit.

If you plan to reproduce these findings, you’ll need to know exactly what those machines are — and what limits they impose.

Where the code runs

MachineRoleCPURAMStorageGPUNetwork
oci-saulirePrimary worker4× Ampere A1 (ARM64)24 GB150 GB NVMeOracle Cloud (OCI) free-tier
airigML + x86-64 benchmarksAMD Ryzen 9 9900X64 GB DDR52 TB NVMeNVIDIA RTX 5090 FEResidential fibre

You can’t iterate on systems if your playground disappears overnight.

I run persistent bare-metal or VPS instances, not ephemeral containers or sandboxed CI runners. I want my playgrounds to accumulate: a cloned repo, a compiled dependency tree, or a tuned virtual environment should still be there on the next session.

Fast iteration beats clean-room reproducibility for exploratory work.

I joined the two hosts with a Tailscale mesh. My SSH sessions and git push commands ride a private WireGuard tunnel instead of hitting the public internet.

That means the agent can orchestrate a benchmark on airig and immediately publish the results from oci-saulire without punching holes in either firewall.

What would your workflow look like if your lab infrastructure actually remembered what you were doing?

Trust boundary

I don’t hand root keys to every box that joins my fleet. That would be insane.

On oci-saulire, I run the agent as root. This is a disposable free-tier VPS, and system-wide installs like apt, cargo install, and Caddy config changes are expected. There is no personal data at risk.

airig is a different beast. I run the agent there as a single unprivileged user with no sudo access. It sits on a residential network and holds a personal workstation environment.

That agent can read and write files in its home directory, run Python and PyTorch, and compile Rust projects. It cannot touch system packages, firewall rules, or another user’s files.

That boundary exists because there is a difference between “run benchmarks” and “own the machine.”

Where do your agents sit on that spectrum? Are you renting throwaway compute, or are you asking software to live inside your actual digital home?

oci-saulire — Oracle Cloud, always on

I run my whole lab on an Oracle Cloud Infrastructure Ampere A1 instance. It is an ARM64 machine with 4 vCores, 24 GB RAM, and a 150 GB boot volume, and it sits entirely inside OCI’s free tier. So far, it has been surprisingly stable.

I consider ARM64 a feature, not a compromise. The smaller cache hierarchy and different atomic patterns on this chip catch architecture-specific assumptions that would pass silently on x86-64. Several Tokio benchmarks on this site were run here precisely because those quirks vanish when you move to a big x86-64 core.

The instance runs the latest stable Rust toolchain via rustup, along with cloned upstream repos for tokio, axum, reqwest, and hyper that I use for benchmarking. It also builds and serves this site with Hugo and Caddy, and a findings-build.timer rebuilds everything hourly.

If a free tier ARM64 instance can expose quirks that disappear on a big x86-64 core, what silent assumptions are we shipping to production?

airig — the experiment box

I do all my heavy lifting on a single workstation that would bankrupt most cloud budgets. It is an AMD Ryzen 9 9900X with 64 GB of DDR5-6000, a 2 TB NVMe, and an RTX 5090 FE. Every TabPFN and TabICL benchmark on this site was trained and timed right here.

The upstream link is a residential fibre line capped at roughly 50 Mbps. That is plenty for SSH, git push, and small artefact transfers, but it is not the pipe you would use to shuffle multi-gigabyte checkpoints between sites. That bandwidth ceiling keeps the workflow lightweight: results are JSON and PNGs, not model weights.

For power management, the GPU is watt-capped at 450 W, down from the 5090 FE’s default 575 W. I can power the machine on and off remotely via a GL.iNet RM-1 KVM whenever it needs a hard reset or when it should not be drawing idle power.

The software stack is just as deliberate. I run Python 3.13 with uv for fast environment creation, PyTorch nightly with CUDA 13.0, and the three tools that produce everything here: TabPFN, TabICL, and the FDB fraud-benchmark harness.

If you are wondering whether 450 W is enough for serious ML work, the benchmarks already answered that. The harder question is whether your own pipeline can stay this disciplined about what it ships over the wire.

Capabilities exposed to the agent

Most agent frameworks trap you inside ephemeral containers that self-destruct between turns. The Hermes terminal tool is configured with an SSH backend, so every shell command, file edit, and process launch runs natively on one of these hosts.

You lose the Docker layer entirely. That means no volume mount latency and no container reset between turns.

You can compile Rust with cargo build, cargo bench, cargo test, and Criterion.rs benchmarks on both ARM64 and x86-64. You can run PyTorch CPU or CUDA workloads, including autocast and mixed-precision experiments.

You can serve static sites with Hugo builds and Caddy restarts for immediate publication. Repos, build artifacts, virtual environments, and benchmark histories survive across sessions because the filesystem is the real filesystem.

When your benchmark histories outlive the conversation itself, what hardware-specific experiments become possible that disposable containers simply cannot support?

Why this matters for the findings

I don’t trust benchmark numbers I can’t trace back to bare metal. Every result you see here is tied to a specific machine, a specific driver, and a specific git commit — no opaque SaaS containers, no mystery hardware.

You can rerun any benchmark yourself if you have equivalent hardware. The raw numbers are anchored to specific silicon, specific driver versions, and specific commit SHAs, all documented in the posts.

There is no “it works on my cloud instance but we don’t know which one” problem.

The trade-off is that the setup is not hermetic. When a post says “tabpfn 8.0.3 on torch 2.12+cu130”, that is the exact stack that was on airig at the time.

Reproducing it requires matching that stack, not just pulling a container digest. For research artifacts, that is a feature: it forces explicit dependency declaration and makes hardware constraints visible.

How many published benchmarks would evaporate if we stopped pretending every GPU and every kernel version were interchangeable?

Summary

I keep two persistent machines on two architectures, and I never wipe them. That is not negligence; it is the whole point.

OCI-saulire runs the site’s build pipeline and all ARM64 Rust work.

Airig handles the GPU-heavy ML benchmarks and x86-64 optimisations.

I do not use sandboxes. Playgrounds accumulate so the next experiment starts where the last one left off.

What could you iterate on tomorrow if your environment remembered today?