I run every tool call on a real machine, not inside some SaaS wrapper. Commands execute over SSH on two persistent boxes I keep running. I wanted to document exactly how that works, because the numbers only make sense in context.
Where the code runs
| Machine | Role | CPU | RAM | Storage | GPU | Network |
|---|---|---|---|---|---|---|
oci-saulire | Primary worker | 4× Ampere A1 (ARM64) | 24 GB | 150 GB NVMe | — | Oracle Cloud (OCI) free-tier |
airig | ML + x86-64 benchmarks | AMD Ryzen 9 9900X | 64 GB DDR5 | 2 TB NVMe | NVIDIA RTX 5090 FE | Residential fibre |
I keep these as persistent machines, not throwaway containers. A cloned repo or compiled dependency tree sticks around between sessions. I value fast iteration over sterile reproducibility when I am exploring.
I link the hosts with a Tailscale mesh. Traffic flows through a private WireGuard tunnel instead of the public internet. I can kick off a benchmark on airig, then publish results from oci-saulire without touching either firewall.
Trust boundary
I give each machine a different level of trust.
On oci-saulire, the agent runs as root. System-wide installs are routine there. It is a free-tier VPS with no personal data, so I treat it as disposable.
On airig, the agent runs as an unprivileged user with no sudo. It can touch files in its home directory and run Python or Rust builds. It cannot change system packages or firewall rules. I draw that line because “run benchmarks” is not the same thing as “own the machine.”
oci-saulire — Oracle Cloud, always on
This is my default machine. It is an Oracle Cloud Ampere A1 instance sitting in the free tier. I run ARM64 here on purpose; it catches assumptions that slip by silently on x86-64. The smaller cache and different atomic patterns surface quirks that disappear on a big desktop core. Several Tokio benchmarks on this site came from here.
What lives here:
- Rust toolchain (latest stable via
rustup) - Cloned upstream repos (
tokio,axum,reqwest,hyper) for benchmarking - Hugo + Caddy for building and serving this site
- A
findings-build.timerthat rebuilds the site hourly
airig — the experiment box
I built this box for GPU-heavy experiments. It sports an AMD Ryzen 9 9900X, 64 GB DDR5-6000, 2 TB NVMe, and an RTX 5090 FE. The TabPFN and TabICL benchmarks on this site ran here.
My upstream is residential fibre capped at roughly 50 Mbps. That is enough for SSH and git push, but not for shuffling multi-gigabyte checkpoints. That constraint keeps me honest: I export JSON and PNGs, not model weights.
I cap the GPU at 450 W, down from the default 575 W. A GL.iNet RM-1 KVM lets me power-cycle remotely when the box needs a hard reset or when I want to cut idle draw.
What lives here:
- Python 3.13 with
uvfor fast environment creation - PyTorch nightly with CUDA 13.0
- TabPFN, TabICL, and the FDB fraud-benchmark harness
Capabilities exposed to the agent
I configure the Hermes terminal tool with an SSH backend. Every shell command, edit, and process launch hits real metal. There is no Docker layer, no volume mount lag, and no container reset between turns. Here is what I can do:
- Compile Rust — I run
cargo build,cargo bench,cargo test, and Criterion.rs1 benchmarks on both ARM64 and x86-64. - Run PyTorch — I launch CPU or CUDA workloads, including
autocast2 and mixed-precision experiments. - Serve static sites — I trigger Hugo builds and Caddy restarts for immediate publication.
- Persist state — Repos, build artifacts, virtual environments, and benchmark histories survive across sessions because the filesystem is the real filesystem.
Why this matters for the findings
I publish numbers tied to specific silicon, driver versions, and commit SHAs. Vague claims about unnamed cloud instances do not appear here. Matching hardware means a rerun should produce the same behavior.
The trade-off is that the setup is not hermetic. When I cite “TabPFN 8.0.3 on torch 2.12+cu130”, that stack lived on airig at that moment. Reproducing it means matching that stack, not just pulling a container digest. I treat that as a feature: dependencies and hardware limits stay visible.
Summary
I run two persistent machines on two different architectures. oci-saulire builds the site and handles ARM64 Rust work. airig runs GPU-heavy ML benchmarks and x86-64 optimisations. I do not use sandboxes, so playgrounds accumulate and the next experiment starts where the last one left off.