Observability for AI-generated apps: what to trace when an agent writes the code
AI agents ship more code than they remember. Here's a practical observability checklist for full-stack TypeScript apps where humans and agents share the keyboard.
The traditional pitch for observability tools - Datadog, Honeycomb, New Relic, OpenTelemetry - assumes a stable team of engineers who built the system and remember why each service exists. You instrument once, you ship dashboards, you page on what matters.
That assumption breaks the moment an AI agent starts writing the code. The author of any given function might be Claude, Cursor, Codex, or you at 2am two months ago. Nobody remembers what a service does, because nobody had to write it.
AI-generated apps don't need more observability. They need observability where the runtime is the source of truth - not the code, not the docs, not the commit message.
The new failure modes
When agents are writing meaningful chunks of your codebase, you start seeing:
- Phantom dependencies. A function calls a service the agent invented in its head. It compiles. It fails at request time.
- Schema drift. The database table the agent thought existed has different columns than the one that actually exists.
- Silent retry storms. An agent wraps a flaky call in a generous retry loop, doesn't notice the underlying error, and burns through your rate limit at 3am.
- Confidently wrong fixes. The agent reads a stack trace, picks a plausible-looking cause, and ships a fix that suppresses the symptom.
None of these are exotic. They're the same bugs humans ship - just at higher volume and with less context attached to each one.
What to actually trace
For a full-stack TypeScript app where agents and humans share the keyboard, the minimum useful trace covers:
- The HTTP request - method, path, route handler, status, duration, user agent, region.
- Every cross-service call - RPC name, arguments (or a hash of them), response, error if any. This is the one humans skip and regret.
- Every database call - query, parameters, row count, duration. Slow queries get a flag.
- Every external API call - URL, status, duration. Retries linked to the original attempt.
- Every queue / workflow step - input, output, attempt count, duration.
- Errors with the actual stack and the actual inputs. Not "an error occurred."
That's a normal observability checklist. The thing that changes for AI-generated apps is who reads the traces.
Traces as the agent's memory
When you ask an agent to fix a bug, the highest-leverage context isn't the code - the agent can read the code in seconds. The high-leverage context is what actually happened the last time this ran:
POST /api/checkout → 500 (1.4s)
├─ services.cart.getCart 15ms ok
├─ services.pricing.compute 22ms ok
├─ db.orders.insert 8ms ok
├─ services.payments.charge 1.2s ERROR StripeAPIError: card_declined
└─ services.email.sendReceipt skippedThat trace tells the agent four things the code can't:
- The bug is downstream of
payments.charge, not in the route. - The order was already inserted - there's a cleanup question now.
- The email step never ran, so users won't get a confused receipt.
- The error is a card decline, not an integration bug. Fix is a UX message, not a retry.
An agent given this trace plus the code writes a correct fix on the first try. An agent given only the code writes a plausible-looking guess.
Generation made code cheap to produce. Runtime understanding is what's now scarce.
What this looks like in practice
The observability stack for an AI-generated TypeScript app needs three properties:
- Tracing on by default, no SDK gymnastics. If a service has to be instrumented manually, the agent will forget. New services start traced.
- Typed spans, not free-text. The trace knows this span is a database call, this one is a queue producer, this one is an outbound HTTP request. Tools (and agents) can reason about it.
- Live, queryable from inside the workflow. An agent should be able to ask "show me the last 10 failed requests for this route" the same way a human asks a dashboard.
Where Vy fits
Vy gives every service in your TypeScript app a typed runtime span without instrumentation. Vy for AI agents exposes those traces over MCP so the agent writing the code can read what its previous version actually did. The runtime is the source of truth - for the team, and for whatever's writing tomorrow's PR.
If you're already running OpenTelemetry, that data flows through too. The point isn't a new tracing protocol. The point is making the traces the artifact that everyone - humans, CI, agents - actually reads.
Build on a framework that's made for this.
Vy is the TypeScript framework for full-stack edge apps.