---
title: "A working agent app needs OS proof."
description: "The gates behind Studio's definition of working: a 236-line live e2e that boots the packaged runtime binary, a real Codex and Claude Code smoke test with a 240s bound, and a no-mock-ui rule enforced in CI."
publishDate: 2026-06-07T10:00:00.000Z
author: Sarvesh Chidambaram
tags: ["macos", "studio", "runtime", "release", "agents"]
canonical: https://memoire.cv/blog/a-working-agent-app-needs-os-proof
---
When we say Studio works, we mean one specific sentence: the macOS app launches, materializes its runtime on local disk, attaches to or restarts that runtime cleanly, runs Codex and Claude Code for real, streams receipts, quits without orphaning the sidecar, and installs from a signed public artifact.

None of that is visible in a clean React build. We learned this by burning through two weaker definitions of "working" first.

## Definition one: the UI renders

The earliest workbench had mock action cards. They made screenshots look alive and proved nothing. On 2026-05-26 we deleted the mock workbench actions and wrote a gate so they could never return: assert-no-mock-ui.mjs, which enumerates forbidden placeholder strings per file (commit 05741d3). It started as a cleanup and became policy. The gate has grown in at least seven later commits, because fake UI does not arrive as a decision, it arrives as a convenient stub during some unrelated fix.

An agent app with mock cards is worse than one with empty states. The mock teaches you to trust a surface that has never touched a process.

## Definition two: the tests pass

The second trap is subtler: a green suite that exercises the wrong artifact. A gate that runs against your dev build proves your dev build. So CI now fetches the runtime before the hygiene checks run (commit 69a8e38), and the product hygiene gates are enforced in CI rather than suggested locally (commit ce65b82, 2026-06-06). The thing being tested is the thing being shipped.

The release weekend made the cost of definition two concrete. CI was green when the v0.18.0 DMGs went out signed but unnotarized, and spctl rejected them on real Macs (commit 9dcf306, 2026-05-10). CI was green again when v1.0.0 failed at "missing updater signature(s)" because both arches' update tarballs shared one filename and the second overwrote the first (commit ccdded8, 2026-05-10). Both times, "working" had quietly excluded "installs and updates on a stranger's machine". Both fixes went into the release workflow, and the definition of working absorbed the install path permanently.

## Definition three: the OS path holds

The current bar is encoded in scripts/studio-live-e2e.mjs, 236 lines landed on 2026-06-06 (commit 3d1d2fd). It does not mount a component tree. It boots the real packaged runtime binary, memi-studio-runtime-{arch}-apple-darwin, then drives the product surfaces against it. Surface checks get a default 45 second timeout. Live agent runs get 240 seconds. There are explicit flags for skip-surfaces and skip-live-agents, so when the gate is narrowed it is narrowed deliberately, in the command line, not silently in a config nobody reads.

The live agent half started earlier: scripts/studio-live-agent-smoke.mjs, 210 lines (commit c9e7776, 2026-05-26), exists to prove that real Codex and Claude Code sessions complete from the app. Not that a transcript component renders. A session starts, markers are watched for, and the gate fails when the runtime is unreachable.

That gate originally had a flaw worth admitting: it could hang. A hanging smoke test is almost as bad as no smoke test, because it leaves the release state ambiguous, and ambiguity always gets resolved in favor of shipping. It was bounded on 2026-06-06 (commit 7cc3ba2). The pattern we settled on for live proof: start from a known runtime URL, cap request time, capture the session id, assert the expected marker, and fail loudly past the bound.

## The suite, named

The whole proof surface is wired as npm scripts in the Studio package.json: `test:no-mock-ui`, `test:live-agents`, `test:live-e2e`, `test:e2e-surfaces`, `test:runtime-fetch`, `test:workbench-context`, `check:rust`. Seven gates, and each one maps to a failure class we actually hit: fabricated UI, agents that render but do not run, a runtime that does not arrive, surfaces that drift, context that leaks across sessions, and the Tauri shell itself.

I like that the list is boring to read. Every entry used to be an incident.

## Stability before design

There was a tempting version of this spring where we polished design memory, the FigJam source flow, and contextual inspectors first, because that work is more fun than process supervision. The actual sequence went the other way, on purpose: stabilize the runtime and OS behavior, prove live Codex and Claude Code runs, then clean the workbench, then make design context editable, then publish the release chain.

Feature work got faster after the OS layer stopped surprising us, for a mundane reason: every feature sits on launch, spawn, attach, and quit. When those four verbs are proven by gates, a feature bug is a feature bug. When they are not, every bug report starts as a lifecycle investigation.

That is what OS proof buys. Not confidence as a feeling, but a short list of named commands that pass against the packaged artifact before anyone says the word "working" out loud.