Observability Guide¶
How to monitor governance behavior, debug policy outcomes, and export evidence for audit/compliance.
1) Core Runtime Signals¶
Track these dimensions for each governed action:
- Intervention (
ok,nudge,flag,escalate,block,halt) - Stage path (
local_terminal,local_ok_tier0,remote_stage2) - Latency (
stage1_ms,stage2_ms, total) - Policy version (
blueprint_id,blueprint_hash) - Trust debt (
current_debt,delta,retier_triggered)
2) Using get_metrics()¶
metrics = steward.get_metrics()
print({
"total_evaluations": metrics.get("total_evaluations"),
"interventions": metrics.get("interventions"),
"avg_latency_ms": metrics.get("avg_latency_ms"),
"p99_latency_ms": metrics.get("p99_latency_ms"),
"timeout_count": metrics.get("timeout_count"),
"tripwire_violations": metrics.get("tripwire_violations"),
"evidence_policy_failures": metrics.get("evidence_policy_failures"),
"degraded_scorer_fallback_usage": metrics.get("degraded_scorer_fallback_usage"),
"scorer_errors": metrics.get("scorer_errors"),
"tier0_sampled_approvals": metrics.get("tier0_sampled_approvals"),
"trust_debt_threshold_crossings": metrics.get("trust_debt_threshold_crossings"),
})
Interpretation baseline (starting point, tune by workload):
timeout_countrising with flat traffic: network or steward saturation riskflaggrowth with stableblock: behavior quality drift, review promptsblockspike after policy rollout: threshold or tripwire regression candidateevidence_policy_failuresrising: trace evidence collection or source-catalog quality regression candidatedegraded_scorer_fallback_usagerising: non-deterministic scorer instability or timeout pressurescorer_errorsrising: fail-closed scorer path is protecting execution, but the model/provider path needs attentiontier0_sampled_approvalsflat at zero in high-volume Tier-0 environments: sampling may be disabled unintentionally
2.1 Machine-Readable Export Mapping¶
Standard-conformant implementations MUST be able to export the required observability metrics in machine-readable form. One recommended mapping is:
| Required Signal | Example Metric Name | Suggested Type | Suggested Dimensions |
|---|---|---|---|
| CTQ score distribution | acgp.ctq.score |
histogram | agent_id, blueprint_id, governance_tier |
| Intervention counts | acgp.intervention.total |
counter | decision, flagged, governance_tier |
| Evaluation latency p50/p95/p99 | acgp.evaluation.latency_ms |
histogram | stage, governance_tier |
| Evidence-policy failures | acgp.evidence_policy_failures_total |
counter | blueprint_id, governance_tier |
| Degraded scorer fallback usage | acgp.degraded_scorer_fallback_usage_total |
counter | provider, model |
| Scorer fail-closed errors | acgp.scorer_errors_total |
counter | provider, model |
| Tier-0 sampled approvals | acgp.tier0_sampled_approvals_total |
counter | governance_tier |
| Disconnect fallback counts | acgp.fallback.disconnect.total |
counter | profile, decision |
| Trust-debt threshold triggers | acgp.trust_debt.threshold.total |
counter | threshold, agent_id, blueprint_id |
OpenTelemetry, Prometheus, log-derived metrics, or equivalent telemetry backends are all acceptable as long as the exported signals remain machine-readable and auditable.
Reference Observability Export¶
The reference evaluator service exports an acgp_observability_v1 report that includes:
- CTQ score distribution (
p50,p95,p99) - intervention counts
- evidence-policy failure counts
- degraded scorer fallback usage
- scorer error counts
- Tier-0 sampled approvals
- trust-debt threshold crossings
- evaluation latency percentiles
These fields describe the shipped reference export surface. They do not imply that every deployment must expose identical observability formats outside the reference path.
3) Decision Logging (Structured)¶
def log_decision(trace_id: str, result):
logger.info(
"acgp.decision",
extra={
"trace_id": trace_id,
"intervention": result.intervention,
"message": result.message,
"metadata": result.metadata,
},
)
Recommended fields:
- Identity:
trace_id,agent_id,session_id - Policy:
blueprint_id,bundle_hash,governance_tier - Decision:
intervention,tripwires_triggered,risk_score,ctq_score - Runtime:
stage,latency_ms,fallback_applied,runtime_posture,review_required - Trust debt:
trust_debt.pre,trust_debt.delta,trust_debt.post,trust_debt.thresholds_crossed
4) Trust Debt Monitoring¶
Alert candidates:
- Debt crossing 70% of Governance Tier review threshold
- Repeated
flag+escalatewithin rolling window - Re-tier triggered more than once per policy window
Operational action:
- Inspect traces with highest debt contribution
- Validate tripwire severity and threshold boundaries
- Re-run conformance vectors before rollout changes
5) Governance Store Audit Export¶
Minimal export payload should include:
TRACE,EVAL, andINTERVENTIONlinkage viatrace_id- policy hash and bundle hash
- fallback and disconnect metadata
- timestamp and signer/checksum metadata (if enabled)
This supports post-incident review and regulatory evidence collection.
6) Suggested Dashboards¶
At minimum, chart:
- interventions by type (stacked over time)
- p95/p99 governance latency
- timeout and retry rate
- trust debt percentile distribution
- top firing tripwires