From Alert to Hotfix: What Happens When Your AI Can See Everything at Once

February 18, 2026

There's a version of this job I remember clearly. A bug gets reported. You open New Relic, start building queries, copy transaction IDs into a Slack message, open the codebase in a separate tab, grep around, form a hypothesis, test it locally, go back to New Relic to validate, write a fix, deploy, and cross your fingers. On a good day, that's two hours. On a complicated day, it's two days.

New Relic was already a superpower. Having real production telemetry — actual response times, error rates, slow transaction traces — meant you were operating on evidence instead of instinct. I've been a believer for years. But even with great observability data, there was always friction: the context switch between the dashboard and the codebase, between the telemetry and the database, between what the system was telling you and where in the code it was actually happening. You'd accumulate the picture piece by piece, holding it all in your head.

What's changed recently is that the picture assembles itself.

The Setup

I run engineering at CasaPerks, a residential engagement and rewards platform built on NestJS, MongoDB, and React. Our stack is instrumented end-to-end with New Relic. Last year I started using Claude Code — Anthropic's agentic CLI tool — as my primary development environment. And once New Relic released their MCP server in beta, something clicked into place that I'm still processing.

MCP (Model Context Protocol) is a standard that lets AI assistants connect to external data sources and tools. In practice, it means Claude Code can query your New Relic account directly from the terminal — no dashboard switching, no manual NRQL construction. It can also connect to MongoDB's MCP server, which means it can inspect your actual schemas, run explain plans, and audit indexes. Combined with full access to the codebase, you have an agent that can simultaneously see your production telemetry, your database, and your code.

That's not a modest upgrade. That's a different category of debugging.

The Incident

During internal testing, we caught an error on the CSV user import flow. The upload appeared to fail — the frontend showed an error after about 60 seconds. But something was off: when we checked the system, the users were there. The welcome emails had gone out. The operation had completed. The error was a lie.

In the old workflow, this kind of misleading symptom is expensive. The error message doesn't point anywhere useful. You have to start from scratch: reproduce it, find the relevant trace in New Relic, identify which part of the operation is slow, cross-reference with the code, trace the call chain from controller to service to external dependency.

Here's what actually happened instead.

I opened a Claude Code session with the New Relic MCP and MongoDB MCP both active, and the full codebase in context. I described what the property manager had reported. Claude queried New Relic production data directly — no dashboard, no copy-pasting — and surfaced the relevant transaction traces for the bulk import endpoint. The endpoint was completing successfully on the backend. Response times for larger imports were regularly exceeding 60 seconds, which was tripping the load balancer timeout. The frontend never received the response and interpreted silence as failure.

That's the what. Then we went to the why.

Claude read through the user service and traced the bulk import method. The problem was immediately visible: for every user in the CSV, the code was calling the welcome email function individually inside the loop. Each call was a separate HTTP request to the Mailgun API. For a 50-user import, that's 50 sequential round trips to an external service — at 200-500ms each, the math is ugly fast.

The fix was straightforward: collect all users flagged for welcome emails into an array, then make a single bulk email call after the loop completes. The bulk method already existed in the codebase — it was actively used by a resend-welcome-email endpoint and a cron job, with full test coverage. It was never wired up here.

That fix was identified, implemented, and deployed in a single session.

The Bonus Find

Here's the part that still gets me.

While Claude was reading through the call chain, it noticed something else. The frontend CSV parser was already handling a column that let property managers control whether a welcome email should fire for each individual user. That data was being parsed, included in the HTTP request, and sent to the backend.

The backend was ignoring it entirely.

The field wasn't wired up on the receiving end, so the value was silently dropped. Every imported user got a welcome email regardless of what the CSV said. This wasn't in any alert. New Relic had no reason to flag it — from the system's perspective, emails were sending successfully. It was invisible unless you were reading the code in the context of the full request lifecycle.

We wired it up in the same session. Property managers now have the control they thought they already had.

No monitoring tool would have caught that. It wasn't an error, it was a silent behavior gap — the kind that lives in a codebase for years and only surfaces when someone eventually asks why the column doesn't seem to do anything. The only reason it came up is that Claude was tracing the entire path, not just the part that was broken.

What's Actually Different

I want to be precise about what changed here, because "AI writes code faster" undersells it and also misses the point.

The speed gain isn't primarily about code generation. It's about context elimination.

Traditional debugging involves constant context switching: telemetry tool to codebase, codebase to database client, database client back to the codebase. Every switch costs you — you lose the thread, you have to re-establish what you knew, you hold an increasingly fragile mental model of what's happening where. The bigger the incident, the more expensive this becomes.

When New Relic MCP and Claude Code are running in the same session alongside the full codebase, there's no switching. The agent queries production data, reads the relevant service files, and traces the call chain from the NestJS controller through to the external API call — in one continuous thread of investigation. The investigation and the fix happen in the same context, with the same agent that understands both the telemetry and the code.

New Relic was already one of the best tools in my stack for understanding what my system was doing. This integration makes it part of a feedback loop that closes in minutes instead of hours.

From "user reported an error" to "fix deployed" was a single session. The bonus bug fix came free.

The Pattern

This wasn't a one-time thing. It's become a repeatable workflow.

Report comes in. Claude Code session with New Relic MCP + MongoDB MCP + full codebase. Query the telemetry, read the code, trace the path, implement the fix. And consistently, when you're reading the full path in context, you find things that weren't in the original report.

If you're already running New Relic, the MCP integration is worth trying the moment it's available to you. What you'll find is that the observability data you've been collecting becomes dramatically more useful when an agent can act on it in the same breath it reads it.

Tags:

permalink

Nate Craddock