This is a follow-up to the agent architecture post. That one covered where the machines live. This one covers how I use them. I do not write every word on this blog. But I do own every result, every benchmark, and every decision to publish or discard. Here is the actual workflow.

1. Start with a question

I pick a topic I want to understand better. It might be narrow, like whether batching tokio’s inject-queue pops would cut latency. Or broad, like how RTX 5090 power scaling works on real training loads. I need enough context to spot a bad result immediately. I’m not outsourcing curiosity. I’m scaling my reach.

2. Build a playground

I ask the agent to set up a workspace. It clones the repo, checks it compiles, runs existing benchmarks, and grabs a baseline. This is grunt work I’d do myself. But it takes hours, and the agent runs it in parallel with other tasks. My playgrounds are persistent. The tokio repo on oci-saulire has been there for weeks1. The tabular-ML environment on airig has survived across multiple posts. Dependency resolution and compilation are the slowest parts of any experiment. Once a playground is warm, the next idea starts in minutes instead of hours.

3. Generate a shortlist

Once the baseline is clean, I ask the agent to suggest experiments. My original idea is almost always in the list. Sometimes the agent surfaces an unexpected configuration flag, an alternative allocator, or a paper reference. But I pick the experiment myself. I trust my own sense of what will produce a clean signal.

4. Run the experiments

There are two patterns I use.

Pattern A: goal-directed auto-research

I define the objective clearly. Then I let an autonomous coding agent explore. I do not micromanage the patch. The agent tries ideas, compiles, benchmarks, and reports back. When numbers land, I cherry-pick changes with the best performance-to-cost ratio. The tokio batch-pop post came from this pattern. It was a single helper function called batch_from_inject. It batches inject-queue pops and gets reused across two call sites.

Pattern B: directed checklist

Sometimes I already know exactly what to try. I give the agent a checklist. First, run spawn_many, a benchmark that stresses task spawning. Test it under jemalloc, mimalloc, snmalloc, and std. Vary message size from 8 bytes to 32 KB. Collect latency and throughput for each combination. The allocator shootout post was built this way. The agent ran the matrix, collected the CSV, and plotted the charts. I verified one specific claim. jemalloc uses the MADV_DONTNEED flag to tell the kernel to drop memory pages. I checked that this actually showed up in strace and perf stat before signing off on the narrative.

5. Iterate on the draft

On average it takes four iterations before the post is not trash. The first draft is usually a pile of numbers with no story. The second draft has a story but buries the methodology. The third draft over-explains convenience details. The fourth draft is readable. I do the review myself. I check that bench commands match what was actually run. I verify axis limits cover every data point. I hunt for fabricated footnotes. The results are real because the agent ran them on real hardware2.

6. Publish

The agent commits the Markdown, builds the site with Hugo, and pushes. I keep the Hugo toolkit intentionally rich. It supports Markdown, p5.js, Chart.js, KaTeX, Typst, Mermaid, and Python plotting. The agent is never blocked by formatting limitations. If an experiment needs a custom interactive slider, the agent generates the HTML inline. If it needs a vector figure, it writes Typst.

Why this works

I can explore a research thread end-to-end in a few hours. A failed experiment costs nothing but a few GPU hours. The benchmarks run on actual hardware. oci-saulire handles ARM workloads. airig runs x86 and GPU tasks. There are no hallucinated throughput tables. I would never spend a weekend building a GPU power-scaling harness for a single blog post. With an agent, I don’t have to. The agent writes the script, runs the sweep, and discards everything if the result is flat3.

Where it falls short

The writing is still mechanical. These posts read like findings reports, not essays. The transitions are functional. The narrative arcs are weak. The jokes land with a thud. I accept this trade-off because the alternative is not publishing at all. I am still looking for a better middle ground. Maybe a tighter review loop. Maybe more explicit voice instructions. Maybe a second pass with a different model. I would love to hear ideas. Find me on X, LinkedIn, or by email.

The stack in detail

LayerToolWhy
ContentMarkdownPortable, diff-friendly, agent-native
LayoutHugo + PaperModFast builds, clean typography
MathKaTeXInline and display math in $\LaTeX$
FiguresTypst → SVGVector quality, version-controlled source
Interactivityp5.js / Chart.jsSketches and data charts without bundlers
DiagramsMermaidFlowcharts and sequence diagrams from text
PlotsPython + matplotlib/seabornStatic PNGs for complex multi-panel figures
OrchestrationHermes AgentTool calls across SSH, persistent sessions

  1. Warm repos save roughly two to four hours per experiment. ↩︎

  2. oci-saulire is an ARM VPS. airig is my local x86 workstation with an RTX 4090. ↩︎

  3. Discarded experiments still live in Git history. Nothing is truly lost. ↩︎