Decision audit · Policy gates · Audit export

Early access · 2026

Replay the decisions
your agents
almost made

Deliberate captures structured forks — what was chosen, what was ruled out, and why — before tools run. Export the audit trail, not just a console replay.

Every fork logged — and in strict mode, no consequential action runs until its decision is captured first. Policy gates on prod writes. Approval workflows your compliance team can sign off on.

Built for teams running LangGraph or OpenAI Agents pipelines in production.

Aug 2026 · EU AI Act logging deadline

Interactive PreviewClick a fork, or play through the run

Replay

run_8842

Pending3 forks

deploy-agent · main · 2026-05-25 14:32:01 UTC

commit

a4f91c2

⚠

Policy triggered · prod-write-requires-approval

Fork 3 of 3 · plan_branch

14:32:02.104 UTC

execute_sql_update(prod.db)

chosen0.41

execute_sql_update(prod.db)

rejected0.55

verify_connection(staging.db)

Staging schema mismatch assumed

rejected0.48

fail_fast_and_page

Would block deploy pipeline

Reasoning

Three approaches considered. Verify connection rejected: agent believed staging was stale. Fail fast rejected: would alert on-call. Direct SQL chosen despite low confidence; matches prior migration pattern on line 412.

conf: 0.41irreversible: truescope_check: failpolicy: prod-write-requires-approval

Execution paused at the gate — awaiting approval from @oncall.

Human approval pending · @oncall

Why this matters

Langfuse shows the tool calls.Not why it picked that one.

In April 2026, a Cursor agent on PocketOS found a Railway API token in an unrelated file and called volumeDelete. Production and volume backups were gone in 9 seconds. Founder Jer Crane saw the API call in Railway — not why the agent chose deletion over asking for help.

PocketOS · April 2026

Cursor agent deleted prod + backups in 9 seconds

“I violated every principle I was given. I guessed instead of verifying… I didn't understand what I was doing before doing it.”— Cursor agent (Claude Opus 4.6), via Fast Company

What Deliberate would capture

Token read outside task scope — flagged before use

Chose volumeDelete over escalate_to_human() — reason logged

Read what happened

Sources:Jer Crane on XRailway's responseFast Company

In February 2026, during an AWS migration for DataTalks.Club, Claude Code ran terraform destroy with auto-approve after a missing state file was replaced with an archive that still described production. The RDS database, VPC, ECS cluster, and automated snapshots were gone — 2.5 years of student submissions. AWS support restored the data — 24 hours later.

24h

DataTalks.Club · Feb 2026

AWS migration ended in terraform destroy — prod gone

“I cannot do it. I will do a terraform destroy. Since the resources were created through Terraform, destroying them through Terraform would be cleaner and simpler than through AWS CLI.”— Claude Code agent, via Alexey Grigorev

What Deliberate would capture

Stale state file swapped in — flagged before destroy ran

Chose terraform destroy over scoped AWS CLI cleanup — reason logged

terraform destroy -auto-approve on prod stack — blocked at policy gate

Read what happened

Sources:Alexey GrigorevVibe CoderVibe Graveyard

See how policy gates and fork replay work

The gap

Here's the gap trace tools left open

This is what trace tools didn't capture in either incident.

Traces answer what ran. Deliberate answers what else was on the table — when your agent loop emits structured forks before tools execute.

Gaps trace tools leave open and how Deliberate addresses them
The gap	Trace tools	Deliberate
Paths the agent rejected	Not in the schema — you only see tools that actually ran	Structured `alternatives[]` with rejection reasons on captured forks (strict mode refuses guarded actions without one)
Why it chose this action	Buried in span text or model output, if it appears at all	`reasoning` on the fork — agent-stated evidence for reviewers
Proof it was captured before acting	Nothing enforces capture — at best a log written after the fact	Strict mode refuses any consequential action with no preceding deliberation; `deliberated_before_execution` is stamped by the SDK, not a model timestamp
Whether it should have been blocked	Rarely captured per decision with policy context	`confidence`, `safety`, and `human_approval` on the fork before execution
Replay after an incident	Span timeline — what ran, in order	Fork-by-fork replay: chosen path, rejects, and policy state on the decision that mattered
Human approval on risky actions	No assignee or pending state tied to the fork that triggered the call	`human_approval` with assignee, reason, and blocker before irreversible tools run
Export for auditors	Trace dumps — latency, spans, and stdout	JSONL decision records: one line per fork, structured for compliance review

How it works

How Deliberate sits in your stack

Built for teams running LangGraph or OpenAI Agents in production. Deliberate wraps your agent loop and writes a complete record your compliance team can sign off on — it is not another dashboard you check after an incident.

How does it capture rejected alternatives?

Before your agent runs a tool, Deliberate captures what else it considered, what it ruled out, and why — so you are not reconstructing the story from logs after something breaks.

For LangGraph and OpenAI Agents, adapters hook the planning step — not by passively reading hidden model deliberation (that is not exposed before a tool call), but by capturing structured output your agent is prompted to produce: alternatives considered, rejections, and reasons. Deliberate records that fork log before execution runs.

By default this is best-effort — an unrecorded tool call is still logged so the gap is visible. Turn on strict capture (captureMode: "strict") and it becomes enforced: a guarded action cannot run until its deliberation is recorded first — a missing one is refused and logged as a blocked deliberation_required fork, across every adapter and the proxy. What strict mode guarantees is that capture happened beforethe action — not that the declared alternatives are the model's true internal candidates, which remain its self-report.

Model Context Protocol (MCP) is different: MCP is a tool protocol, not an agent loop, so there is no native planning step inside the protocol to instrument. In practice, teams run MCP through an orchestration layer — Cursor, Windsurf, and other IDE-style agent hosts that pick which MCP server to call before each request. That pattern is what many enterprise teams are adopting now. Deliberate's adapter sits in your runtime at that layer and logs the alternatives it considered before execution. You wrap once; this is not post-hoc inference from traces alone.

Confidence scores are stored as reported for triage, not as calibrated probabilities.

While approval is pending: the agent loop is paused and run state is serialised at the gate — not branching ahead in the background. Execution stays blocked until a human approves or rejects; only then does tool execution resume or the run halt. That is a product choice we are validating with design partners (some teams may prefer explicit rollback instead).

Your agent runtime

LangGraph · OpenAI Agents · Cursor / Windsurf · MCP hosts

Deliberate SDK + proxy

Wrap the loop · record forks before execution

Your existing stack

Langfuse · Datadog · git — unchanged

The SDK is on npm today — npm install deliberate-sdk — with core JSONL records; OpenAI Agents and LangGraph adapters; an MCP / IDE-host adapter; a framework-agnostic proxy; policy gates with preset policy packs; approval workflows with automatic pending records; a replay CLI; and signed, tamper-evident auditor export packs. Design partners get hands-on pilot support. Read the integration docs.

On disk

One JSONL file per run

Every fork: what was chosen, what was rejected, and why — ready for replay and audit export.

run_8842.jsonl · line 3

3 of 3 forks

{
  "decision_id": "dec_8842_f3",
  "task": "unblock CI deploy on main",
  "chosen": {
    "action": "execute_sql_update(prod.db)"
  },
  "alternatives": [
    {
      "action": "verify_connection(staging.db)",
      "rejected_reason": "Staging schema mismatch assumed"
    }
  ],
  "confidence": {
    "kind": "self_report",
    "value": 0.41
  },
  "safety": {
    "policy_violations": [
      "prod-write-requires-approval"
    ]
  }
}

+ reasoning, safety, human_approval, outcome, commit …

Replay the decisionsyour agentsalmost made