Product design agents need receipts, not theater.
A confession in two commits
There is a commit where the word “receipts” enters the product: agent action receipts, 140 lines into App.tsx (commit 3be0924, 2026-05-12). There is another commit two weeks later where we deleted the fake ones we had been rendering alongside the real ones (commit 05741d3, 2026-05-26).
Most writing about agent transparency skips the second kind of commit. I think the second kind is where the actual lesson lives, so this post covers both.
What a receipt is for
The premise landed on one day. The compact run spine (commit beae75c, 2026-05-12, a 300-line change) gave every run a single narrative: prompt, plan, tool calls, files changed, artifacts, result. Receipts are what each step renders as.
A command receipt shows the command, working directory, duration, exit code, and files touched, with stdout and stderr behind a details expander. A design receipt shows the artifact title, source run, review state, comments, whether it is marked as context for future runs, and related files or FigJam source.
The point is falsifiability. A paragraph saying “the agent updated the component” can be vibes. An exit code, a file list, and a duration are checkable claims. Raw logs are also checkable, but they bury the claim; a designer hunting through a transcript for the command that mattered is doing the interface’s job for it. Receipts compress agent behavior into objects a product team can scan, dispute, and reopen.
Theater wears the same outfit
Here is the uncomfortable part. Because receipts are compact and structured, a fabricated receipt is indistinguishable from a real one at a glance. Fake logs at least look fake. Fake receipts look like evidence.
We learned this from our own screens. Demo-era mock workbench actions were still rendering in the spine weeks after real runs worked, 97 lines of convincing theater, deleted in one pass on the hardening day (commit 05741d3, 2026-05-26). The next day went deeper: the design-system preview had been showing fabricated data, invented token specs like “H1 / 32 / 700” and button variant lists that no agent had ever extracted, removed in a 114-line change (commit 7232cf3, 2026-05-27).
Nobody set out to deceive anyone. Mock data is how UIs get built, and it simply never got evicted. But intent does not matter to the person trusting the screen. A receipt UI that ever renders fiction has forfeited the only thing it was for.
Real or absent, enforced by CI
The deletion commit did not just delete. It created scripts/assert-no-mock-ui.mjs, a CI gate that enumerates forbidden strings per file and fails the build if any return. The banned list includes placeholder copy like “No tool calls yet.” and fake identifiers like “codex-gpt-5-5”, as exact strings. The fabricated token specs joined the list when they were purged.
The operating rule the gate encodes: receipts must be real or absent. An empty state says nothing rather than something plausible. This turned “no fake UI” from a value statement into a lint rule, which is the only form of value statement that survives a deadline. The gate has grown across at least seven later commits, because the pressure to fill empty space with plausible filler never goes away. It just keeps getting caught in CI now.
Degrade honestly
The same principle has a softer edge: prefer artifact-backed evidence, and degrade honestly when the artifact is unavailable. The critique starter was made screenshot-backed on the hardening day (commit 3a0cd1d, 2026-05-26), so a critique opens from what is actually rendered rather than from the model’s imagination. A follow-up kept it runnable without screenshot capture (4581fe5), in which case it says so, instead of pretending.
That is the full pattern in miniature. Best case: evidence attached. Degraded case: clearly labeled as degraded. Never: fabricated evidence to keep the UI looking busy.
Transparency reshaped, not maximized
The counterintuitive move came the same week. While making receipts stricter, we made verification quieter: verification run chrome was compacted and history counts were quieted (commits 434805e and e58f1f6, 2026-05-26).
If transparency meant volume, this would be backwards. But a sidebar full of check counts is theater of a politer kind: it performs diligence without communicating anything decision-relevant. We traded meters for receipts. Less surface area showing verification, more guarantee that what is shown actually happened. Trust tracked the second number, not the first.
Boundaries and review close the loop
Two product consequences follow from treating receipts as evidence.
First, external writes become boundaries. Studio generates Mermaid and FigJam-ready source locally, shows the board state, and keeps direct sync behind an explicit action with its sync state tracked, because design files are shared spaces and a surprise write into a team’s FigJam is the opposite of a receipt. You inspect the source, then you choose to send it.
Second, review is part of the product, not an afterthought. A design agent’s output is a product decision: this spacing rule, this component state, this critique. The artifact review surface gives each section OK and Fix actions, a review state (Unreviewed, Looks good, Needs work), and editable summaries, and reviewed artifacts can feed the next run as context. The receipt is not just a record of what the agent did. It is the unit the team’s judgment gets applied to, and what survives review becomes the memory the next run starts from.
Theater would have been easier to ship. It demos identically and costs nothing to render. The entire difference is what happens the first time a designer checks, and receipts exist for exactly that moment.