When the agent is in one container and its definition is in another
For most of loomcycle's early life, the way an agent got into the
runtime was deeply boring: someone wrote
.claude/agents/job-searcher.md with YAML frontmatter
and a system-prompt body; loomcycle, configured with a path to that
directory, read the file off disk on every /v1/runs
call. That works beautifully when loomcycle and the application
live on the same VM and share a working-copy checkout.
It stops working the moment you put them in different containers.
Which is what we did at the end of week 33. JobEmber moved into its own Docker image. Loomcycle moved into its own. The shared filesystem assumption that the entire agent-loading path implicitly depended on quietly evaporated. The runtime had no agents to run; the app had agents but no way to install them into the runtime; and the legacy "read from disk" code path pointed at a directory that didn't exist inside the loomcycle container.
This post is about the substrate that replaced that — the
AgentDef / SkillDef / MCPServerDef
trio — and the design decisions behind it that took longer to
get right than the code itself did.
What the substrate actually is
Three database tables. Each one stores a versioned, signed definition of one thing the runtime can dispatch:
-
agent_defsstores AgentDef rows — the things you used to write in.claude/agents/*.md. Name, description, system prompt, allowed tools, default model, tier, max iterations, memory scopes. -
skill_defsstores SkillDef rows — the things you used to write in.claude/skills/<name>/SKILL.md. Name, description, body, allowed tools. -
mcp_server_defsstores MCPServerDef rows — runtime registrations of MCP servers, replacing the staticmcp_serversblock inloomcycle.yamlwhen you need server-set membership to change without a config redeploy.
Each row carries a content_sha256 column computed
from a fixed, ordered subset of the row's fields. For an AgentDef
the digest covers name, description,
system_prompt, allowed_tools,
skills, model, provider,
tier, effort, max_tokens,
max_iterations, providers,
models, memory_scopes, and
memory_quota_bytes. Two consumers computing the
digest the same way over the same content get the same hash.
That's the whole point — it lets the consumer ask, in one
round-trip, "do you already have this definition?"
and act accordingly.
Boot-time push, not boot-time pull
The first design that didn't work was the obvious one: have
loomcycle pull definitions from the consumer over HTTP at boot.
Loomcycle, please fetch your agents from
https://jobember.com/agents.json on startup.
That collapses a lot of complexity into one direction of dependency,
and it sounds nice on a whiteboard.
It dies on the operational side. Loomcycle would refuse to come up if JobEmber's HTTP surface was down. The dependency arrow points the wrong direction — the runtime that holds the loop is now blocked by the consumer that wants to use the loop. CI got uglier; staging got uglier; the deployment order matters now; and the runtime's "I am a small Go binary that can also run on its own with a single yaml" story dies.
So we inverted it. The consumer pushes. At boot, the JobEmber
process scans its own bundled .claude/ directory
(still the source of truth, still living next to the application
code in the same Docker image), computes a SHA-256 over each
definition's content fields, and for each one calls a tiny
sequence against loomcycle's substrate:
POST /v1/_substrate/agent_defs/verify
{ name: "job-searcher", content_sha256: "ab12…" }
→ 200 { exists: true, matches: true }
↓ skip — already in sync
POST /v1/_substrate/agent_defs/verify
{ name: "job-searcher", content_sha256: "cd34…" }
→ 200 { exists: true, matches: false }
↓ POST /v1/_substrate/agent_defs/fork — new active version
POST /v1/_substrate/agent_defs/verify
{ name: "new-agent", content_sha256: "ef56…" }
→ 200 { exists: false }
↓ POST /v1/_substrate/agent_defs/create
One round-trip per name in the steady state — verify
returns matches: true and the consumer moves on.
Boot is fast and idempotent. Multi-replica races on first boot
(two JobEmber pods trying to create the same agent simultaneously)
are handled by re-verify after a 409: whoever wrote first wins,
everyone else's verify call now matches, and the
pod proceeds. No coordinator, no leader election.
Static + dynamic, resolved through one path
Substrate-backed isn't the only way an agent can exist
in loomcycle. A standalone deployment with a fixed agent roster
still wants loomcycle.yaml with a static
agents: block — no Postgres, no consumer pushes,
no machinery the workload doesn't need. Both modes have to
coexist cleanly in one binary.
The first cut had two separate load paths. Static agents went
through one resolver. Dynamic substrate-backed agents went
through a different resolver. Sub-agent spawning
(Agent.runSubAgent) initially only knew about
the static path, and the introduction of dynamic agents meant
sub-spawns of dynamic agents resolved nil at
run time — visible only when a higher-tier orchestrator tried
to spawn one of the new agents and got a confusing
"agent not found" error.
The fix was the lookup.Agent canonical resolver
(PR
#188). One function. Takes a name, returns the resolved
AgentDef regardless of where it lives. Yaml-static agents and
substrate-backed agents go through the same function, in the
same order: substrate first (so dynamic overrides static at
runtime, an operator's most likely intent), yaml second as a
fallback. The two preceding sub-spawn bug fixes —
fix(resolver): substrate-registered agents must resolve
via /v1/runs and fix(api): runSubAgent resolves
through lookup.Agent — both stop being possible after
the canonical resolver lands. There's no second path to forget
to update.
In the Web UI Library view, each agent / skill / MCP-server
entry now carries a STATIC or DYNAMIC
chip showing which path it came from. Operators can see at a
glance which definitions are baked in and which are
runtime-managed. We added that because we kept getting confused
ourselves.
The hash is a contract, not a checksum
Picking what goes into the content_sha256 looks like
a minor implementation detail. It is not. The hash is the
consumer's statement of equivalence: "if this hash
matches what you have, you and I agree on what this agent is, and
you do not need to update anything." If the hash covers too
little, the consumer pushes new content and the runtime silently
keeps the old version. If the hash covers too much — say, it
includes created_at or version — every
consumer boot looks like a change and the substrate fills with
identical-content fork rows.
What the AgentDef hash explicitly does not cover:
version, created_at,
updated_at, active, retired,
retired_at, the row's primary key. Those are
metadata about this stored version of the content, not
facts about the content itself. The consumer doesn't care which
version row it's looking at; it cares whether the content
matches.
First-pass implementation had the consumer compute the hash and
include it in the create/fork payload. That was a lot of trust
to place in the consumer side. Two days later we flipped it: the
server computes the hash from the content fields it receives,
stores that, and returns it. The consumer's
content_sha256 in the verify payload becomes a
claim the server checks against its own computation —
"I think this should hash to X; do you agree?" The
loomcycle hash CLI subcommand exists for local dev
and CI but isn't a deployment dependency anymore. JobEmber's
build pipeline lost a step. We deleted a few hundred lines of
sidecar-baking code.
The rule of thumb that fell out: if two services need to agree about whether a piece of content is the same, compute the digest server-side from the content's canonical form, not consumer-side from whatever the consumer thinks the content is. The latter is a foot-gun the moment the consumer's serialization, key order, or trailing whitespace handling drifts a little.
One substrate, every wire surface
Once the substrate exists, the question becomes how do humans and other systems poke at it. The answer turns out to be the same as it is for everything else in loomcycle: offer the same operation on every wire surface, so callers can pick whichever fits their context.
-
HTTP.
POST /v1/_substrate/agent_defs,/verify,/fork,/retire. Bearer-auth'd. Stable wire contract. -
MCP. Server-side tools
mcp__loomcycle__substrate_agent_create,_verify,_fork,_retire. Same parameters, same auth surface, model-callable when an agent needs to install or update another agent at runtime. -
gRPC. One
SubstrateServicewith symmetric methods. Used by the n8n connector under the hood. -
TypeScript adapter.
client.createAgentDef(…),client.verifyAgentDef(…),client.forkAgentDef(…),client.listLibraryAgents(…). Same shape, idiomatic TS types. - Web UI. The Library admin modal, which is a fully structured form per definition kind — agent system prompts as markdown textareas, allowed-tools as multi-select, the lot. (We had a JSON-textarea phase. Operators with raw newlines in their system prompts hit "JSON parse error" on submit. The structured form replaced it on the next iteration.)
The substrate operations are roughly twelve. Each surface gets all twelve, with identical semantics. That's the rule we keep arriving back at: the runtime's primitives should be visible from every wire dialect, because no team has the same wire preference and we'd rather not litigate it.
What we'd do differently
Three things.
One: ship MCPServerDef sooner. AgentDef landed at v0.8.5, SkillDef at v0.8.22, MCPServerDef finally at v0.9.x. For three weeks operators could update agents and skills at runtime but had to redeploy loomcycle to change which MCP servers got mounted. The asymmetry felt wrong every time it came up, and the eventual implementation was a straightforward mirror of the AgentDef one — no real reason to wait.
Two: the consumer-side hash was a wrong turn we should have caught at design review. Anyone who's ever written a content-addressed storage system has seen the "two clients hash the same content differently" failure mode. We didn't think about it on day one; the cleanup PR was a multi-hundred-line revert plus an apology in the commit message.
Three: the canonical resolver should have been there before the second loading path was. Each new path you add to a system that already has one path doubles the surface area for fix-one-forget-the-other bugs. The two pre-resolver sub-spawn fixes were both six-line patches that wouldn't have been needed at all.
None of these are surprising in hindsight. All three were forecastable at design time. The lesson, again, is that the shape of the substrate matters more than the substrate itself. Once you have something that's content-addressed, monotonic, verifiable, and pushed-not-pulled — you can hang almost anything off it. The substrate becomes the part of the system that stays still while everything around it changes shape.
Companion writeup, coming next: Scrubbing the model's incoming mail — a PostTool hook that strips known injection patterns from WebFetch / WebSearch / Brave results before they hit the agent's context. Different problem, same overall move: do the work outside the model where you can.