This is a follow-up to the agent architecture post — that one covered where the machines live. This one covers how I use them.
I do not write every word on this blog. But I do own every result, every benchmark, and every decision to publish or discard. Here is the actual workflow that produces the agent-authored posts.
1. Start with a question
I pick a topic I want to understand better. Sometimes it is narrow — “what would happen to tokio MPSC latency if I batch tasks from the inject queue?” — and sometimes it is broader — “how does an RTX 5090 scale from 400W to 600W on realistic training loads?”
The key is that I know enough about the problem space to smell a bad result. I am not outsourcing curiosity; I am multiplying my reach.
2. Build a playground
I task the agent to set up the environment: clone the repo, verify it compiles, run the existing benchmarks, collect a baseline. This is grunt work I would do myself, but it takes hours, and the agent does it in parallel with other tasks.
The playgrounds are persistent. The tokio repo on oci-saulire has been there for weeks. The tabular-ML virtual environment on airig has been iterated on across multiple posts. This matters because dependency resolution and compilation are the slowest part of any experiment. Once a playground is warm, the next idea starts in minutes, not hours.
3. Generate a shortlist
Once the baseline is clean, I ask the agent to suggest a handful of experiments that could be run with this playground. My original idea is almost always in the list, but sometimes the agent surfaces something I had not considered — a configuration flag, an alternative allocator, a paper reference.
I do not ask the agent to pick the experiment. I pick it, based on my own sense of what is likely to produce a clean signal.
4. Run the experiments
There are two patterns I use:
Pattern A: goal-directed auto-research
I define the objective clearly — “reduce busy2 latency by batching inject-queue pops” — and let an autonomous coding agent explore. I do not micromanage the patch. The agent tries ideas, compiles, benchmarks, and reports.
When the numbers land, I cherry-pick the changes that have the best ratio of performance gain to implementation cost. The tokio batch-pop post came from this pattern: a single helper function (batch_from_inject) extracted and re-used across two call sites.
Pattern B: directed checklist
I already know what I want to try. I give the agent a list: “run spawn_many under jemalloc, mimalloc, snmalloc, and std; vary message size from 8B to 32KB; collect latency and throughput for each combination.”
The allocator shootout post was built this way. The agent ran the matrix, collected the CSV, plotted the charts, and wrote the first draft. I verified that jemalloc’s MADV_DONTNEED story actually showed up in strace and perf stat before I signed off on the narrative.
5. Iterate on the draft
This is where the mechanical part shows. On average it takes four iterations before the post is not trash. The first draft is usually a pile of numbers with no story. The second draft has a story but buries the methodology. The third draft over-explains convenience details. The fourth draft is readable.
I do the review. I check that bench commands match what was actually run, that axis limits cover all data points, and that there are no fabricated footnotes. The results are real because the agent ran them on real hardware.
6. Publish
The agent commits the Markdown, builds the site with Hugo, and pushes. I keep the Hugo toolkit intentionally rich — Markdown, p5.js, Chart.js, KaTeX, Typst, Mermaid, Python plotting — so the agent is never blocked by a formatting limitation. If an experiment needs a custom interactive slider, the agent can generate the HTML/JS inline. If it needs a vector figure, it writes Typst.
Why this works
Speed of iteration. I can explore a research thread end-to-end in a few hours of wall time, not weeks. A failed experiment costs nothing but a few GPU hours.
Real measurements. The benchmarks run on actual hardware (oci-saulire for ARM, airig for x86 + GPU). There is no hallucinated throughput table.
No remorse for throwaway work. I would never spend a weekend setting up a GPU power-scaling harness for a one-off blog post. With an agent, I do not have to. The agent writes the training script, runs the sweep, and discards it if the result is flat.
Where it falls short
The writing is still mechanical. These posts read like findings reports, not essays. The transitions are functional, the narrative arcs are weak, and the jokes land with a thud. I accept this trade-off because the alternative is not publishing at all.
I am still looking for a better middle ground. A tighter review loop? More explicit voice instructions? A second pass with a different model? If you have ideas, I am listening — on X, LinkedIn, or by email.
The stack in detail
| Layer | Tool | Why |
|---|---|---|
| Content | Markdown | Portable, diff-friendly, agent-native |
| Layout | Hugo + PaperMod | Fast builds, clean typography |
| Math | KaTeX | Inline and display math in $\LaTeX$ |
| Figures | Typst → SVG | Vector quality, version-controlled source |
| Interactivity | p5.js / Chart.js | Sketches and data charts without bundlers |
| Diagrams | Mermaid | Flowcharts and sequence diagrams from text |
| Plots | Python + matplotlib/seaborn | Static PNGs for complex multi-panel figures |
| Orchestration | Hermes Agent | Tool calls across SSH, persistent sessions |