The tool layer is where the agent stops talking and starts doing. Everything upstream — case state, operating model, classification, reasoning — is preparation. The tool call is the moment the world changes. This post is a walk-through of how Korso's tool layer is built, why it looks the way it does, and what trade-offs we made along the way.

If you are building an operations agent and wondering how to organize tools that write to systems your customer depends on, this is the architecture review we wish someone had published a year ago.

What the tool layer is for

Korso's tool layer has twenty-seven tools today, grouped into eight category files: sales, customer, product, manufacturing, purchasing, inventory, accounting, and write. Some examples by category:

Sales: query_sales_order, update_sales_order_status, list_open_quotes
Customer: query_customer, propose_customer_update
Purchasing: create_purchase_order, confirm_purchase_order, update_po_line_quantity
Inventory: query_stock, query_reservation
Write: generic propose_write for novel operations not yet first-classed

A tool is the unit of action. Each one has a typed input schema, a typed output, a tier, idempotency requirements, and automatic logging back into the case history. The agent does not call ERP APIs directly; it calls tools, and the tool layer talks to Hermes (our connector layer) on the agent's behalf.

Why tools, not direct API calls

The most obvious alternative was letting the agent call connector functions directly. We considered it for about a week. Three reasons we rejected it:

Policy. Tool-tier classification needs to apply uniformly. If half the agent's actions go through tools and half go directly to the connector, the policy surface is full of holes. Every write goes through tools, no exceptions.

Idempotency. Every tool call carries an idempotency key derived from the case id, the tool name, and a hash of inputs. The tool executor checks this key against a tool_executions table before firing. Retrying the same tool with the same inputs is a no-op that returns the original result. The agent can retry confidently; the ERP never sees a duplicate. This pattern is intrusive enough that you want it in one place, not twenty-seven.

Logging. Every tool call writes a case_event row before and after execution, capturing inputs, outputs, errors, and rationale. This is the data behind the audit trail. Mandating that every action goes through the tool layer is what makes the audit trail complete, not partial.

Tool tiers and approval

Each tool declares a tier in its definition:

export const createPurchaseOrder: Tool = {
  name: 'create_purchase_order',
  tier: 'tier_2_approval_required',
  schema: PurchaseOrderInputSchema,
  idempotent: true,
  category: 'purchasing',
  execute: async (input, ctx) => {
    // ...
  },
};

Tiers are coarse — tier_0_safe, tier_1_auto, tier_2_approval_required, tier_3_blocked — but the resolution against a specific customer's policy is fine. The tool_policy table maps (customer_id, tool_name) to an effective tier, with per-tool overrides for thresholds. For instance, create_purchase_order is tier_2 by default, but customer Acme has a rule that auto-approves POs under $5,000 with a pre-approved supplier — that rule lives in the policy table, not in the tool's code.

The flow at execution time is:

Agent decides to call create_purchase_order with inputs.
Tool executor resolves effective tier from tool_policy(customer_id, 'create_purchase_order', inputs).
If tier_2_approval_required and not auto-approved → case transitions to awaiting_approval, an approval-pending event is logged, the agent returns from this turn. Temporal will reassess when the approval signal arrives.
If tier_1_auto (either default or auto-approved by policy) → idempotency check, then execute, then log.
If tier_3_blocked → tool throws a structured error; the agent learns it cannot do this and updates the case rationale accordingly.

The agent code does not branch on tier. The agent just calls the tool. The tool layer enforces the policy. This is the abstraction we want — agents reason about what to do, tools reason about whether it is allowed.

Idempotency in practice

Idempotency is one of those things every system claims and very few actually implement correctly. We learned this the hard way.

Our key derivation is:

const idempotencyKey = hash(
  caseId,
  toolName,
  toolVersion,
  canonicalize(inputs),
);

canonicalize sorts object keys, normalizes timestamps to UTC, and strips fields explicitly marked as non-discriminating (updated_at, free-text notes that vary across retries, etc.). The tool's input schema declares which fields are part of the idempotency surface and which are not; this is a design decision we make explicitly per tool.

The tool_executions table stores the key, the original result, and the timestamp. On a retry with the same key, we return the original result and log a tool_retry_deduplicated event. The ERP is never re-hit.

What we got wrong the first time

Our v1 idempotency was naive: hash all the inputs. This sounds correct and is wrong. Real inputs from the agent always contain something timestamp-flavored — a "drafted at" field on a quote, a free-text note that varies because the LLM phrasing varies. Two functionally identical tool calls produced different keys, and the ERP saw duplicates.

The fix was the explicit per-tool declaration of which fields discriminate. It is more work to write a tool, but it is the kind of work you want to be explicit about. We have not had a duplicate-write incident in production since.

Logging and the case-event contract

Every tool call writes two events to atlas_case_events:

// Before execution
{
  case_id: 'case_018f3a...',
  type: 'tool_call_pending',
  tool: 'create_purchase_order',
  inputs: { ... canonicalized inputs ... },
  rationale: 'Supplier Alpha confirmed price at $4,200 < auto-approve threshold $5,000',
  skill: 'po-fanout-on-quote-acceptance',
  step: 'create-po-for-supplier-alpha',
  rule: 'auto-approve-po-under-5k-preapproved-supplier',
}
 
// After execution
{
  case_id: 'case_018f3a...',
  type: 'tool_call_complete',
  tool: 'create_purchase_order',
  result: { po_id: 'PO-2026-04-0173', external_url: '...' },
  duration_ms: 412,
  retry_count: 0,
}

Failures get a tool_call_failed event with the structured error. Approval-required pauses get tool_call_pending_approval. The skill, step, and rule columns are populated automatically by the executor from the agent's current context. The operator's audit experience — open a case, scroll the history — is the direct projection of this table.

Schema mapping, briefly

Tools accept agent-native input shapes and translate them to ERP-native shapes before calling Hermes. The translation is driven by the customer's operating model. For example:

// Agent calls this:
createPurchaseOrder({
  supplier: 'sup_alpha',          // agent uses our customer ids
  line_items: [{ product: 'p_widget_a', quantity: 100, unit_price: 12.50 }],
  delivery_date: '2026-05-01',
  cost_center: 'project_x',       // agent-native, mapped per customer
});
 
// Hermes receives this (for an Odoo customer with custom cost-center field):
{
  partner_id: 1042,
  order_line: [[0, 0, { product_id: 8821, product_qty: 100, price_unit: 12.50 }]],
  date_planned: '2026-05-01 00:00:00',
  x_studio_cost_center: 'project_x',   // the custom field this customer added
};

The mapping is declarative, lives in the operating model, and is versioned per customer. When a customer extends their ERP with a new field, we add a mapping rule — not a code change. This was the single most-valuable architecture decision we made in the tool layer: keep the agent's vocabulary stable, push the per-customer variation into mappings, never let either bleed into the other.

Strict vs. permissive mapping

We support two modes via ATLAS_SCHEMA_MAPPING_STRICT. In strict mode (CI default), a missing mapping throws. In permissive mode (production default while we're maturing), it logs a warning and passes through, so a small mapping gap does not freeze the agent. We migrate customers from permissive to strict once their operating model is fleshed out — usually around the 60–90 day mark.

What this gets us

A few properties fall out of this architecture that we did not initially realize were emergent:

New tools are cheap. Adding a 28th tool is ~150 lines of code: schema, tier, execute. The infrastructure — idempotency, logging, policy resolution, mapping — is shared.
Policy changes do not require deploys. Adjusting a customer's tier overrides is a tool_policy insert.
Audit is free. We did not build an audit subsystem. The audit is the event log.
Replaying a case is a real operation. Because every tool call is logged with inputs and outputs, we can rebuild a case's full timeline from the event log alone. This is what powers our debug tooling and is invaluable when an operator asks "what would have happened if I had approved last Friday?"

What we'd still change

The honest list:

The tier-vs-policy boundary is right in principle but feels noisy in practice. We are exploring collapsing them into a single resolved policy that includes the default tier as a fallback.
Strict mapping mode should be the default sooner. The "we'll migrate when the customer is mature" stance has a long tail.
The propose_write generic-write tool is a useful escape hatch but is over-used. We are working through which patterns should be promoted to first-class tools.

Overall, the tool layer is the part of Korso's architecture we are most confident about. It is also, not coincidentally, the part where we spent the most time on the contract design before writing the implementation. Operations agents live and die on the quality of the tool layer, and getting it right is mostly a design problem, not an engineering one.

Building the tool layer: how Korso writes back to your ERP

What the tool layer is for

Why tools, not direct API calls

Tool tiers and approval

Idempotency in practice

What we got wrong the first time

Logging and the case-event contract

Schema mapping, briefly

Strict vs. permissive mapping

What this gets us

What we'd still change

The case is the unit of work

Why AI-native manufacturing operations are inevitable

From three days to twenty minutes: an RFQ throughput story

What the tool layer is for

Why tools, not direct API calls

Tool tiers and approval

Idempotency in practice

What we got wrong the first time

Logging and the case-event contract

Schema mapping, briefly

Strict vs. permissive mapping

What this gets us

What we'd still change

Read next

The case is the unit of work

Why AI-native manufacturing operations are inevitable

From three days to twenty minutes: an RFQ throughput story