What is the point of owning a cluster if you still have to edit everything by hand? I mapped out the hardware in the agent architecture post; this is how I put it to work.

I do not write every word on this blog. I do own every result, every benchmark, and every decision to publish or discard.

Here is the actual workflow that produces the agent-authored posts. How do I decide which drafts survive?

1. Start with a question

I only build playgrounds around questions that genuinely annoy me. One day that’s surgical—what would happen to tokio MPSC latency if I batch tasks from the inject queue? The next day it’s massive—how does an RTX 5090 scale from 400W to 600W on realistic training loads?

But I never pick a topic cold. I need to know the space well enough to smell a bad result the instant it appears. I’m not outsourcing my curiosity. I’m multiplying my reach.

What part of your stack have you taken on faith that a weekend playground could finally put to the test?

2. Build a playground

I used to lose entire afternoons to dependency resolution before I could run a single benchmark. Now I hand the setup to an agent: clone the repo, verify it compiles, run the existing benchmarks, and collect a baseline. It is the same grunt work I would do myself, except it takes me hours and the agent runs it in parallel with other tasks.

My playgrounds stick around. The tokio repo on oci-saulire has been sitting there for weeks, and the tabular-ML virtual environment on airig has been iterated on across multiple posts. That persistence matters because dependency resolution and compilation are still the slowest part of any experiment. Once a playground is warm, the next idea starts in minutes, not hours.

What would you test if your environment was already waiting?

3. Generate a shortlist

I used to think I knew exactly which experiment to run next. Then I started asking the agent for a menu of options, and it keeps handing me ideas I didn’t put there—an obscure configuration flag, a different allocator, a paper I never bookmarked.

My original hunch almost always shows up in that list. But the real value is in the one item I didn’t expect.

I never let the agent choose for me. I scan the list and pick the one that feels like it will cut through the noise and give me the cleanest signal.

The next time you have a clean baseline, try asking for five experiments instead of one. What shows up in that list that you didn’t already have in mind?

4. Run the experiments

I’ve got exactly two patterns for auto-research. Everything else is just expensive procrastination.

The first one requires a goal before you start. What happens when you give a research loop a destination instead of a topic?

Pattern A: goal-directed auto-research

I give the agent a single sentence of intent — “reduce busy2 latency by batching inject-queue pops” — and then I get out of the way. No micromanagement, no patch review. It tries ideas, compiles, benchmarks, and reports back on its own.

When the numbers land, I cherry-pick the changes with the best ratio of performance gain to implementation cost. The tokio batch-pop post came from this exact pattern. One helper function, batch_from_inject, extracted once and reused across two call sites.

If a single sentence of context can surface a reusable helper that cuts latency across two hot paths, what else in your codebase is waiting for an agent with permission to explore?

Pattern B: directed checklist

I didn’t need an ideation session. I needed a lab tech. I handed the agent a shopping list: “run spawn_many under jemalloc, mimalloc, snmalloc, and std; vary message size from 8B to 32KB; collect latency and throughput for each combination.”

The allocator shootout post was built exactly this way. The agent ran the matrix, collected the CSV, plotted the charts, and wrote the first draft.

I didn’t blindly trust the output. I pulled up strace and perf stat to verify that jemalloc’s MADV_DONTNEED story actually showed up in the traces before I signed off on the narrative.

If the agent can run the full matrix unsupervised, the real question becomes how much verification I can realistically do before the next dataset lands.

5. Iterate on the draft

I iterate four times before a post stops being trash. The first draft is just a pile of numbers with no story. The second draft finds the story, then immediately buries the methodology.

The third draft over-explains every convenience detail. The fourth draft is finally readable.

I do the review. I check that bench commands match what was actually run, that axis limits cover all data points, and that there are no fabricated footnotes.

The results are real because the agent ran them on real hardware. When was the last time you audited every footnote against the raw logs?

6. Publish

I refuse to let an agent stall because the toolkit is too thin. It commits the Markdown, builds the site with Hugo, and pushes without me babysitting the pipeline.

I keep the Hugo stack intentionally rich: Markdown, p5.js, Chart.js, KaTeX, Typst, Mermaid, Python plotting. The agent never has to ask permission to render an equation, sketch a diagram, or script an interaction.

If an experiment needs a custom interactive slider, the agent generates the HTML and JS inline. If it needs a crisp vector figure, it writes Typst and moves on.

Once formatting stops being the bottleneck, the only limit is how fast you can feed it the next experiment.

Why this works

The real bottleneck in research is never the GPU. It’s the calendar. I can explore a research thread end-to-end in a few hours of wall time, not weeks, and a failed experiment costs nothing but a few GPU hours.

The benchmarks run on actual hardware. oci-saulire handled the ARM numbers, and airig covered x86 + GPU. There is no hallucinated throughput table.

I would never spend a weekend setting up a GPU power-scaling harness for a one-off blog post. With an agent, I don’t have to—it writes the training script, runs the sweep, and discards it if the result is flat.

That throwaway freedom changes what questions I even bother asking.

Where it falls short

I publish findings reports and call them essays. The transitions barely hold together, the narrative arcs sag in the middle, and the jokes land with a thud.

I accept this trade-off. Mechanical prose still beats the void of not shipping at all.

I am still hunting for a better middle ground. Maybe a tighter review loop, more explicit voice instructions, or a second pass with a different model is the answer.

If you have ideas, I am listening on X, LinkedIn, or by email — what would you try first?

The stack in detail

LayerToolWhy
ContentMarkdownPortable, diff-friendly, agent-native
LayoutHugo + PaperModFast builds, clean typography
MathKaTeXInline and display math in $\LaTeX$
FiguresTypst → SVGVector quality, version-controlled source
Interactivityp5.js / Chart.jsSketches and data charts without bundlers
DiagramsMermaidFlowcharts and sequence diagrams from text
PlotsPython + matplotlib/seabornStatic PNGs for complex multi-panel figures
OrchestrationHermes AgentTool calls across SSH, persistent sessions