§ architecture note

When the agent is in one container and its definition is in another

2026-05-22 · by Dennis Gubsky · ~7 min read

For most of loomcycle's early life, the way an agent got into the runtime was deeply boring: someone wrote .claude/agents/job-searcher.md with YAML frontmatter and a system-prompt body; loomcycle, configured with a path to that directory, read the file off disk on every /v1/runs call. That works beautifully when loomcycle and the application live on the same VM and share a working-copy checkout.

It stops working the moment you put them in different containers.

Which is what we did at the end of week 33. JobEmber.ai moved into its own Docker image. Loomcycle moved into its own. The shared filesystem assumption that the entire agent-loading path implicitly depended on quietly evaporated. The runtime had no agents to run; the app had agents but no way to install them into the runtime; and the legacy "read from disk" code path pointed at a directory that didn't exist inside the loomcycle container.

This post is about the substrate that replaced that - the AgentDef / SkillDef / MCPServerDef trio - and the design decisions behind it that took longer to get right than the code itself did.

What the substrate actually is

Three database tables. Each one stores a versioned, signed definition of one thing the runtime can dispatch:

agent_defs stores AgentDef rows - the things you used to write in .claude/agents/*.md. Name, description, system prompt, allowed tools, default model, tier, max iterations, memory scopes.
skill_defs stores SkillDef rows - the things you used to write in .claude/skills/<name>/SKILL.md. Name, description, body, allowed tools.
mcp_server_defs stores MCPServerDef rows - runtime registrations of MCP servers, replacing the static mcp_servers block in loomcycle.yaml when you need server-set membership to change without a config redeploy.

Each row carries a content_sha256 column computed from a fixed, ordered subset of the row's fields. For an AgentDef the digest covers name, description, system_prompt, allowed_tools, skills, model, provider, tier, effort, max_tokens, max_iterations, providers, models, memory_scopes, and memory_quota_bytes. Two consumers computing the digest the same way over the same content get the same hash. That's the whole point - it lets the consumer ask, in one round-trip, "do you already have this definition?" and act accordingly.

Boot-time push, not boot-time pull

The first design that didn't work was the obvious one: have loomcycle pull definitions from the consumer over HTTP at boot. Loomcycle, please fetch your agents from https://jobember.com/agents.json on startup. That collapses a lot of complexity into one direction of dependency, and it sounds nice on a whiteboard.

It dies on the operational side. Loomcycle would refuse to come up if JobEmber.ai's HTTP surface was down. The dependency arrow points the wrong direction - the runtime that holds the loop is now blocked by the consumer that wants to use the loop. CI got uglier; staging got uglier; the deployment order matters now; and the runtime's "I am a small Go binary that can also run on its own with a single yaml" story dies.

So we inverted it. The consumer pushes. At boot, the JobEmber.ai process scans its own bundled .claude/ directory (still the source of truth, still living next to the application code in the same Docker image), computes a SHA-256 over each definition's content fields, and for each one calls a tiny sequence against loomcycle's substrate:

POST /v1/_substrate/agent_defs/verify
  { name: "job-searcher", content_sha256: "ab12…" }
  → 200 { exists: true, matches: true }
    ↓ skip - already in sync

POST /v1/_substrate/agent_defs/verify
  { name: "job-searcher", content_sha256: "cd34…" }
  → 200 { exists: true, matches: false }
    ↓ POST /v1/_substrate/agent_defs/fork - new active version

POST /v1/_substrate/agent_defs/verify
  { name: "new-agent",   content_sha256: "ef56…" }
  → 200 { exists: false }
    ↓ POST /v1/_substrate/agent_defs/create

One round-trip per name in the steady state - verify returns matches: true and the consumer moves on. Boot is fast and idempotent. Multi-replica races on first boot (two JobEmber.ai pods trying to create the same agent simultaneously) are handled by re-verify after a 409: whoever wrote first wins, everyone else's verify call now matches, and the pod proceeds. No coordinator, no leader election.

Static + dynamic, resolved through one path

Substrate-backed isn't the only way an agent can exist in loomcycle. A standalone deployment with a fixed agent roster still wants loomcycle.yaml with a static agents: block - no Postgres, no consumer pushes, no machinery the workload doesn't need. Both modes have to coexist cleanly in one binary.

The first cut had two separate load paths. Static agents went through one resolver. Dynamic substrate-backed agents went through a different resolver. Sub-agent spawning (Agent.runSubAgent) initially only knew about the static path, and the introduction of dynamic agents meant sub-spawns of dynamic agents resolved nil at run time - visible only when a higher-tier orchestrator tried to spawn one of the new agents and got a confusing "agent not found" error.

The fix was the lookup.Agent canonical resolver (PR #188). One function. Takes a name, returns the resolved AgentDef regardless of where it lives. Yaml-static agents and substrate-backed agents go through the same function, in the same order: substrate first (so dynamic overrides static at runtime, an operator's most likely intent), yaml second as a fallback. The two preceding sub-spawn bug fixes - fix(resolver): substrate-registered agents must resolve via /v1/runs and fix(api): runSubAgent resolves through lookup.Agent - both stop being possible after the canonical resolver lands. There's no second path to forget to update.

In the Web UI Library view, each agent / skill / MCP-server entry now carries a STATIC or DYNAMIC chip showing which path it came from. Operators can see at a glance which definitions are baked in and which are runtime-managed. We added that because we kept getting confused ourselves.

The hash is a contract, not a checksum

Picking what goes into the content_sha256 looks like a minor implementation detail. It is not. The hash is the consumer's statement of equivalence: "if this hash matches what you have, you and I agree on what this agent is, and you do not need to update anything." If the hash covers too little, the consumer pushes new content and the runtime silently keeps the old version. If the hash covers too much - say, it includes created_at or version - every consumer boot looks like a change and the substrate fills with identical-content fork rows.

What the AgentDef hash explicitly does not cover: version, created_at, updated_at, active, retired, retired_at, the row's primary key. Those are metadata about this stored version of the content, not facts about the content itself. The consumer doesn't care which version row it's looking at; it cares whether the content matches.

First-pass implementation had the consumer compute the hash and include it in the create/fork payload. That was a lot of trust to place in the consumer side. Two days later we flipped it: the server computes the hash from the content fields it receives, stores that, and returns it. The consumer's content_sha256 in the verify payload becomes a claim the server checks against its own computation - "I think this should hash to X; do you agree?" The loomcycle hash CLI subcommand exists for local dev and CI but isn't a deployment dependency anymore. JobEmber.ai's build pipeline lost a step. We deleted a few hundred lines of sidecar-baking code.

The rule of thumb that fell out: if two services need to agree about whether a piece of content is the same, compute the digest server-side from the content's canonical form, not consumer-side from whatever the consumer thinks the content is. The latter is a foot-gun the moment the consumer's serialization, key order, or trailing whitespace handling drifts a little.

One substrate, every wire surface

Once the substrate exists, the question becomes how do humans and other systems poke at it. The answer turns out to be the same as it is for everything else in loomcycle: offer the same operation on every wire surface, so callers can pick whichever fits their context.

HTTP. POST /v1/_substrate/agent_defs, /verify, /fork, /retire. Bearer-auth'd. Stable wire contract.
MCP. Server-side tools mcp__loomcycle__substrate_agent_create, _verify, _fork, _retire. Same parameters, same auth surface, model-callable when an agent needs to install or update another agent at runtime.
gRPC. One SubstrateService with symmetric methods. Used by the n8n connector under the hood.
TypeScript adapter. client.createAgentDef(…), client.verifyAgentDef(…), client.forkAgentDef(…), client.listLibraryAgents(…). Same shape, idiomatic TS types.
Web UI. The Library admin modal, which is a fully structured form per definition kind - agent system prompts as markdown textareas, allowed-tools as multi-select, the lot. (We had a JSON-textarea phase. Operators with raw newlines in their system prompts hit "JSON parse error" on submit. The structured form replaced it on the next iteration.)

The substrate operations are roughly twelve. Each surface gets all twelve, with identical semantics. That's the rule we keep arriving back at: the runtime's primitives should be visible from every wire dialect, because no team has the same wire preference and we'd rather not litigate it.

What we'd do differently

Three things.

One: ship MCPServerDef sooner. AgentDef landed at v0.8.5, SkillDef at v0.8.22, MCPServerDef finally at v0.9.x. For three weeks operators could update agents and skills at runtime but had to redeploy loomcycle to change which MCP servers got mounted. The asymmetry felt wrong every time it came up, and the eventual implementation was a straightforward mirror of the AgentDef one - no real reason to wait.

Two: the consumer-side hash was a wrong turn we should have caught at design review. Anyone who's ever written a content-addressed storage system has seen the "two clients hash the same content differently" failure mode. We didn't think about it on day one; the cleanup PR was a multi-hundred-line revert plus an apology in the commit message.

Three: the canonical resolver should have been there before the second loading path was. Each new path you add to a system that already has one path doubles the surface area for fix-one-forget-the-other bugs. The two pre-resolver sub-spawn fixes were both six-line patches that wouldn't have been needed at all.

None of these are surprising in hindsight. All three were forecastable at design time. The lesson, again, is that the shape of the substrate matters more than the substrate itself. Once you have something that's content-addressed, monotonic, verifiable, and pushed-not-pulled - you can hang almost anything off it. The substrate becomes the part of the system that stays still while everything around it changes shape.

Companion writeup, coming next: Scrubbing the model's incoming mail - a PostTool hook that strips known injection patterns from WebFetch / WebSearch / Brave results before they hit the agent's context. Different problem, same overall move: do the work outside the model where you can.