Agent architecture: where the work runs

I run every tool call on a real machine, not inside some SaaS wrapper. Commands execute over SSH on two persistent boxes I keep running. I wanted to document exactly how that works, because the numbers only make sense in context.

Where the code runs

Machine	Role	CPU	RAM	Storage	GPU	Network
`oci-saulire`	Primary worker	4× Ampere A1 (ARM64)	24 GB	150 GB NVMe	—	Oracle Cloud (OCI) free-tier
`airig`	ML + x86-64 benchmarks	AMD Ryzen 9 9900X	64 GB DDR5	2 TB NVMe	NVIDIA RTX 5090 FE	Residential fibre

I keep these as persistent machines, not throwaway containers. A cloned repo or compiled dependency tree sticks around between sessions. I value fast iteration over sterile reproducibility when I am exploring.

I link the hosts with a Tailscale mesh. Traffic flows through a private WireGuard tunnel instead of the public internet. I can kick off a benchmark on airig, then publish results from oci-saulire without touching either firewall.

Trust boundary

I give each machine a different level of trust.

On oci-saulire, the agent runs as root. System-wide installs are routine there. It is a free-tier VPS with no personal data, so I treat it as disposable.

On airig, the agent runs as an unprivileged user with no sudo. It can touch files in its home directory and run Python or Rust builds. It cannot change system packages or firewall rules. I draw that line because “run benchmarks” is not the same thing as “own the machine.”

`oci-saulire` — Oracle Cloud, always on

This is my default machine. It is an Oracle Cloud Ampere A1 instance sitting in the free tier. I run ARM64 here on purpose; it catches assumptions that slip by silently on x86-64. The smaller cache and different atomic patterns surface quirks that disappear on a big desktop core. Several Tokio benchmarks on this site came from here.

What lives here:

Rust toolchain (latest stable via rustup)
Cloned upstream repos (tokio, axum, reqwest, hyper) for benchmarking
Hugo + Caddy for building and serving this site
A findings-build.timer that rebuilds the site hourly

`airig` — the experiment box

I built this box for GPU-heavy experiments. It sports an AMD Ryzen 9 9900X, 64 GB DDR5-6000, 2 TB NVMe, and an RTX 5090 FE. The TabPFN and TabICL benchmarks on this site ran here.

My upstream is residential fibre capped at roughly 50 Mbps. That is enough for SSH and git push, but not for shuffling multi-gigabyte checkpoints. That constraint keeps me honest: I export JSON and PNGs, not model weights.

I cap the GPU at 450 W, down from the default 575 W. A GL.iNet RM-1 KVM lets me power-cycle remotely when the box needs a hard reset or when I want to cut idle draw.

What lives here:

Python 3.13 with uv for fast environment creation
PyTorch nightly with CUDA 13.0
TabPFN, TabICL, and the FDB fraud-benchmark harness

Capabilities exposed to the agent

I configure the Hermes terminal tool with an SSH backend. Every shell command, edit, and process launch hits real metal. There is no Docker layer, no volume mount lag, and no container reset between turns. Here is what I can do:

Compile Rust — I run cargo build, cargo bench, cargo test, and Criterion.rs¹ benchmarks on both ARM64 and x86-64.
Run PyTorch — I launch CPU or CUDA workloads, including autocast² and mixed-precision experiments.
Serve static sites — I trigger Hugo builds and Caddy restarts for immediate publication.
Persist state — Repos, build artifacts, virtual environments, and benchmark histories survive across sessions because the filesystem is the real filesystem.

Why this matters for the findings

I publish numbers tied to specific silicon, driver versions, and commit SHAs. Vague claims about unnamed cloud instances do not appear here. Matching hardware means a rerun should produce the same behavior.

The trade-off is that the setup is not hermetic. When I cite “TabPFN 8.0.3 on torch 2.12+cu130”, that stack lived on airig at that moment. Reproducing it means matching that stack, not just pulling a container digest. I treat that as a feature: dependencies and hardware limits stay visible.

Summary

I run two persistent machines on two different architectures. oci-saulire builds the site and handles ARM64 Rust work. airig runs GPU-heavy ML benchmarks and x86-64 optimisations. I do not use sandboxes, so playgrounds accumulate and the next experiment starts where the last one left off.

Criterion.rs is a Rust benchmarking framework that collects runtime statistics and generates reports. ↩︎
autocast is PyTorch’s automatic mixed-precision helper that trades numerical precision for speed on supported GPUs. ↩︎

The experiments in this post were run with AI assistance. I wrote the words and checked every number myself.

Where the code runs#

Trust boundary#

oci-saulire — Oracle Cloud, always on#

airig — the experiment box#

Capabilities exposed to the agent#

Why this matters for the findings#

Summary#

Related posts