This is an idea for structured AI loop execution.
- Foreword
- The Problem
- Initial Attempt:
SkillSpecsIn Choir - Programmable Agents
- Skill to Schema
- Taking It Further
- Closing Thoughts
Foreword
While designing choir-agent, I noticed an asymmetry:
tool calling was structured and typed, but skills were not. This blog
will focus on the latter, proposing a structured system for LLM
orchestration of “skills”.
This post is a result of my reflections on Choir (repo). In the first half, I will talk about the friction I felt when designing the system, and how I sought to resolve it; then, in the second half, I will include more details on the idea that arose from my reflections on the resolution.
And after some thought, I arrived at the design of schema to provide tiered orchestration of agent loops, like chords in a song.
Infinite songs are orchestrated by finite chords and rhythm.
The Problem
When designing choir, there were a lot of existing tools
I could leverage to simplify the design. The one that caught my eye was
structured tool calling: defined in a JSON format and occupying a
separate field in the API request body.
And I noticed the asymmetry: there was no set definition of skills.
After a quick search online, I found that most “skill frameworks” act
like prompt templates that are optionally injected. They
usually provide a sequence of actions for the agent to follow, and
encode a pipeline with potentially implicit control flow. But that’s not
what I wanted for choir-agent: it has the potential to
pollute the prompt, and the next step in the control flow may be
ambiguous.
I wanted something that can describe what gets injected into the prompts for a given step, and describe the control flow explicitly.
Initial Attempt: SkillSpecs In Choir
The SkillSpec
design in choir was my initial approach to solve this
problem: encode skill execution as a state machine schema.
Skills are encoded as state machines, and each state has an objective
with a list of allowed tools and transitions. The agent starts the
execution of the skill on a specific state, and terminates on another
pre-defined state, with max_steps encoding how many steps
it may take before being terminated by the choir-agent
orchestrator.
What immediately benefited from this is that we have explicit
transitions between states, meaning that we have explicit control flow.
This meant that the agent knows exactly which next steps can and cannot
be taken at any given state. And since this was co-designed with
choir-agent, the choir-agent logic encodes how
prompts are built with only the information necessary for a specific
state, preventing unintended prompt pollution.
Emergent Problem
This system solved the problem of skills not being deterministically mediated and not being explicitly explainable. But it was missing something important: there’s no scoping for skills.
If we take a step back and think about our own skills, we usually don’t rely on a single skill when completing tasks; we compose the workflow with them, and we embed smaller skills in larger ones.
Why can’t the agents do the same? How can agents do the same?
Programmable Agents
When thinking more deeply about composition, a new idea emerged.
Let’s think about this with an analogy to computers:
- The agent is a computer in execution.
- The agent loop orchestrator is the machine and OS kernel.
- The agent loop itself is the program.
- The tools are system calls.
Where do skills fit?
And my answer is: in the current sense, skills are routines that the program or other routines use. If we treat them this way, then the structure becomes clear: skills are, in essence, functions.
Then I remembered the main() function, or the program
entry point, which is the root function of all execution, is a
function in itself.
Skills encode the same thing as agent loops encode, the only difference is that agent loops use deterministic logic, while skills are mostly text-based and not well-defined.
So, how can we abstract skills to functions?
Skill to Schema
At this point, comparing the concept of “functions” in agent systems to “skills” is an understatement. Functions cover the entirety of execution, and the name “skill” only encodes a subroutine in the execution. So, I propose a new name: “schema”.
A schema externalizes control flow, capability boundaries, and prompt structure into a deterministic state machine.
State Machine at Its Core
A schema defines execution by providing a state machine, with explicit states and transitions.
Each state contains set of transitions to the next state, the set of sub-schemas it can spawn, and the set of tools it can use.
Embedding state machines is how schemas should compose with each other, exactly like how you would embed functions into other functions as subroutines. A parent schema may spawn sub-schemas, and its state completes only when the child schema exits.
The cleanest way to think about it is to treat schemas as descriptive functional programming languages, where the imperative execution is handled by the LLM while the orchestrator increments the step.
Necessary Components
Each schema should contain the following components as the necessary set:
- Schema prompt: the overall objective of the current state machine.
- States: the states of the state machine.
- Some protection to provide provable termination or handing off control back to the user.
Unlike prompt-based skills, schemas can enforce bounded execution by construction, allowing the orchestrator to guarantee termination within a defined scope.
And each state in a schema should contain the following components:
- State prompt: the objective of the current state.
- Set of transitions with descriptions.
- Set of tools with descriptions.
Prompt Construction
Runtime prompt construction is simple and hierarchical and could be implemented like the following:
prompt := master_instructions
for schema in active_schemas sorted from top to bottom:
prompt.append(schema.prompt)
prompt.append(schema.current_state.prompt)
prompt.append(user_message)
prompt.append(execution_context)
transitions := active_schemas[bottom].transitions
tools := active_schemas[bottom].tools
# load transitions and tools into their respective places
The core idea is that the LLM should get the hierarchical execution stack of the layered schemas, and use those as the directions for navigating the user task. And flexibility is given to optional prompt masking of upper-level schemas and where the context and user message is injected.
Taking It Further
As mentioned earlier, the base agent loop itself is, in the analogy,
the main() function. This means that we can encode
different agent loops as different schemas.
Furthermore, when considering user interactions via non-LLM processed commands in the agent orchestration system (“the kernel”), the user may switch between different agent loops during agent runtime when needed.
This provides the modularity of hot-swappable agent loops, providing flexibility to the user to change the fundamental behavior of the agent.
Closing Thoughts
Reflections, as per my usual, are the main driver that propels me toward my next idea.
This is an idea that arose from reflecting on existing systems, and reflecting on building my own system. When you build, you should think after, think about which borders you challenged. And doing so often grants you new insights into the frictions you faced.
Systems are rarely born complete. They are refined through friction and reflection. Schema was one such attempt at refinement in the world of agent systems.
Keep building, keep reflecting, and keep building more.
