How to Build a System Graph From the Tracing and Catalogs You Already Have
A platform engineer's guide to bootstrapping a live system graph from service catalogs, traces, CI/CD config, and ownership data, then curating typed edges.
Start from the catalog: nodes before edges
Your service catalog, Backstage, an internal registry, or even a maintained spreadsheet, is the cheapest seed for the node set. It already enumerates services, their repos, and usually a nominal owner. Import it first and resist the urge to enrich. You want a complete, boring inventory of nodes before you reason about anything connecting them.
Two things go wrong here, and both are worth naming up front:
- The catalog lies by omission. It lists the services someone remembered to register. Shadow services, a Lambda someone stood up last quarter, the legacy monolith nobody wants to claim, these are missing exactly because they are the riskiest. Plan to discover them from traces, not from the catalog.
- Granularity is inconsistent. One team registers a single "platform" service; another registers nine microservices. Pick a node grain, deployable unit is usually the right one, and normalize toward it. A graph where nodes mean different things produces blast-radius estimates you can't trust.
Treat the catalog as a claim, not ground truth. It tells you what should exist. The next two sources tell you what actually does.
Derive runtime edges from traces
Catalogs describe intent. Traces describe behavior. This is where the graph gets real, because a distributed trace is a literal record of which service called which, in production, under real load. If you run OpenTelemetry, Jaeger, Zipkin, or a commercial APM, you are already sitting on the highest-fidelity edge source you will ever get.
The mechanism is straightforward: parent and child spans across a service boundary are a directed call edge. Aggregate spans over a representative window, a week that includes a weekday peak and a weekend, not a quiet Tuesday, and you get a weighted call graph: who calls whom, how often, and on which paths. That weight matters later, because it separates the request path for revenue from a once-a-day batch job.
A few engineering realities to design around:
- Sampling distorts the tail. Head-based sampling drops rare paths, which are often the dangerous ones. If your sampling is aggressive, low-frequency edges will be missing or under-weighted. Note it, and let lower-confidence edges in rather than pretending they don't exist.
- Async breaks the trace. A queue, an event bus, or a cron-triggered job severs span lineage. You will see the producer and the consumer as disconnected nodes when they are tightly coupled. These gaps are the single biggest source of wrong blast-radius math, and they are exactly the edges you will curate by hand later.
- Traces miss the build-time graph entirely. A shared library forty services compile against generates zero spans until something calls it. Runtime data alone will undercount that dependency badly.
That last point is why traces are necessary but not sufficient.
Add the build graph from CI/CD and dependency manifests
Your CI/CD config and package manifests carry the dependencies that never show up in a trace: shared libraries, base images, infrastructure modules, and the deploy ordering your pipeline already encodes. Parse package.json, go.mod, pom.xml, Dockerfiles, and your pipeline definitions, and you recover the static structure runtime data is blind to.
This is also where you capture the *kind* of coupling, which is more useful than the fact of it. A change to a service you call over HTTP fails gracefully if you designed for it. A change to a library you statically link forces a rebuild and redeploy of every dependent. Same diff, very different blast radius. The build graph is what lets the model tell those apart, and lets Testing Fleets scope validation to what a given change can actually reach rather than re-running a static suite that ignores the topology.
Fold the CI/CD layer in last because it is the cheapest to parse and the easiest to keep fresh: it lives in version control, so every merge is a free update.
Curate typed edges, because a blank edge is a lie
Here is the step teams skip, and it is the one that decides whether the graph is an asset or a wall poster. An untyped edge, "A connects to B", is almost useless for reasoning about risk. The control layer needs to know *how* A depends on B to decide what breaks if B changes.
Curate edges into a small, deliberate type system. A workable starter set:
- `calls` (sync), A makes a blocking request to B. B's latency and errors propagate to A in real time.
- `calls` (async), A publishes; B consumes. Failures are deferred and often silent. These are the edges your traces missed; add them by hand from your queue and topic config.
- `depends-on` (build), A compiles or links against B. A change to B requires rebuilding A.
- `reads-from` / `writes-to`, A shares a datastore or schema with B. Schema changes here are high-blast-radius and easy to underestimate.
- `provisions`, infrastructure ownership, captured from your IaC.
Add confidence to each edge: derived-from-trace, parsed-from-config, or human-asserted. When the graph drives an auto-merge decision later, a low-confidence edge should route to a human rather than be silently trusted. This is the same discipline behind reachability-based prioritization, where ranking findings by what's actually reachable can mean 70-90% less exploitable exposure to triage, precise edges are what make reachability computable instead of guessed.
Attach ownership, and make it the source of truth
A dependency graph without ownership tells you what broke but not who decides. Fuse three ownership signals and reconcile them: CODEOWNERS (repo-level, accurate, often stale), on-call rotations (who answers at 3 a.m.), and your catalog's declared team (aspirational). Where they disagree, the graph should surface the conflict, not pick silently, a node with three different owners across three systems is a reliability incident waiting to happen.
Ownership is what turns the graph from a map into a control surface. It routes an approval to a named human, it decides whose policy applies, and it produces the audit trail that records who authorized what against which evidence. This is the foundation of governed autonomy: agents propose changes scoped by the graph, but Governance puts a named owner on every authorization. Agents propose; humans authorize, and ownership metadata is what makes "humans" resolve to a specific accountable person rather than a queue.
Keep it live, or watch it rot
A graph is a depreciating asset. The moment you stop reconciling it, it drifts from reality, and a stale graph produces confidently wrong blast-radius math, worse than no graph, because people trust it. Three reconciliation loops keep it honest:
- CI/CD on every merge refreshes the build graph for free.
- A rolling trace window (daily or weekly) re-derives runtime edges and flags new or vanished services.
- Ownership drift alerts when CODEOWNERS, on-call, and catalog disagree.
What to do Monday morning: import your catalog as nodes, point a week of traces at it to derive call edges, and parse one repo's manifests to prove the build layer. Then hand-type the ten highest-traffic edges and attach owners. You will have a usable graph by Friday, and the gaps you find will tell you exactly where your real coupling hides. The integrations you already run, APM, CI, source control, are the ingestion surface; you are not adding instrumentation, you are fusing it. See how it works for the end-to-end model, and the glossary if any of these terms need pinning down.
The bottom line
続きを読む
Inside a Testing Fleet: How Coordinated Agents Plan, Execute, Observe, and Maintain Validation
An anatomy of the testing fleet: how coordinated agents plan, execute, observe, and maintain validation as a continuous loop instead of a one-shot test run.
The 2026 State of Autonomous Remediation: From Suggestion to Governed Fix
Autonomous remediation is the next frontier beyond test generation. Why governed fixing, not unsupervised autonomy, is the only version enterprises will adopt in 2026.
Rollback-First Remediation: Designing Fixes You Can Always Undo
Safe autonomous fixing means every change ships with a pre-validated undo path. A platform engineer's guide to rollback-first remediation patterns and the autonomy they unlock.
