Observability Guide

How to monitor governance behavior, debug policy outcomes, and export evidence for audit/compliance.


1) Core Runtime Signals

Track these dimensions for each governed action:

  • Intervention (ok, nudge, flag, escalate, block, halt)
  • Stage path (local_terminal, local_ok_tier0, remote_stage2)
  • Latency (stage1_ms, stage2_ms, total)
  • Policy version (blueprint_id, blueprint_hash)
  • Trust debt (current_debt, delta, retier_triggered)

2) Using get_metrics()

metrics = steward.get_metrics()

print({
    "total_evaluations": metrics.get("total_evaluations"),
    "interventions": metrics.get("interventions"),
    "avg_latency_ms": metrics.get("avg_latency_ms"),
    "p99_latency_ms": metrics.get("p99_latency_ms"),
    "timeout_count": metrics.get("timeout_count"),
    "tripwire_violations": metrics.get("tripwire_violations"),
    "evidence_policy_failures": metrics.get("evidence_policy_failures"),
    "degraded_scorer_fallback_usage": metrics.get("degraded_scorer_fallback_usage"),
    "scorer_errors": metrics.get("scorer_errors"),
    "tier0_sampled_approvals": metrics.get("tier0_sampled_approvals"),
    "trust_debt_threshold_crossings": metrics.get("trust_debt_threshold_crossings"),
})

Interpretation baseline (starting point, tune by workload):

  • timeout_count rising with flat traffic: network or steward saturation risk
  • flag growth with stable block: behavior quality drift, review prompts
  • block spike after policy rollout: threshold or tripwire regression candidate
  • evidence_policy_failures rising: trace evidence collection or source-catalog quality regression candidate
  • degraded_scorer_fallback_usage rising: non-deterministic scorer instability or timeout pressure
  • scorer_errors rising: fail-closed scorer path is protecting execution, but the model/provider path needs attention
  • tier0_sampled_approvals flat at zero in high-volume Tier-0 environments: sampling may be disabled unintentionally

2.1 Machine-Readable Export Mapping

Standard-conformant implementations MUST be able to export the required observability metrics in machine-readable form. One recommended mapping is:

Required Signal Example Metric Name Suggested Type Suggested Dimensions
CTQ score distribution acgp.ctq.score histogram agent_id, blueprint_id, governance_tier
Intervention counts acgp.intervention.total counter decision, flagged, governance_tier
Evaluation latency p50/p95/p99 acgp.evaluation.latency_ms histogram stage, governance_tier
Evidence-policy failures acgp.evidence_policy_failures_total counter blueprint_id, governance_tier
Degraded scorer fallback usage acgp.degraded_scorer_fallback_usage_total counter provider, model
Scorer fail-closed errors acgp.scorer_errors_total counter provider, model
Tier-0 sampled approvals acgp.tier0_sampled_approvals_total counter governance_tier
Disconnect fallback counts acgp.fallback.disconnect.total counter profile, decision
Trust-debt threshold triggers acgp.trust_debt.threshold.total counter threshold, agent_id, blueprint_id

OpenTelemetry, Prometheus, log-derived metrics, or equivalent telemetry backends are all acceptable as long as the exported signals remain machine-readable and auditable.

Reference Observability Export

The reference evaluator service exports an acgp_observability_v1 report that includes:

  • CTQ score distribution (p50, p95, p99)
  • intervention counts
  • evidence-policy failure counts
  • degraded scorer fallback usage
  • scorer error counts
  • Tier-0 sampled approvals
  • trust-debt threshold crossings
  • evaluation latency percentiles

These fields describe the shipped reference export surface. They do not imply that every deployment must expose identical observability formats outside the reference path.


3) Decision Logging (Structured)

def log_decision(trace_id: str, result):
    logger.info(
        "acgp.decision",
        extra={
            "trace_id": trace_id,
            "intervention": result.intervention,
            "message": result.message,
            "metadata": result.metadata,
        },
    )

Recommended fields:

  • Identity: trace_id, agent_id, session_id
  • Policy: blueprint_id, bundle_hash, governance_tier
  • Decision: intervention, tripwires_triggered, risk_score, ctq_score
  • Runtime: stage, latency_ms, fallback_applied, runtime_posture, review_required
  • Trust debt: trust_debt.pre, trust_debt.delta, trust_debt.post, trust_debt.thresholds_crossed

4) Trust Debt Monitoring

Alert candidates:

  • Debt crossing 70% of Governance Tier review threshold
  • Repeated flag + escalate within rolling window
  • Re-tier triggered more than once per policy window

Operational action:

  1. Inspect traces with highest debt contribution
  2. Validate tripwire severity and threshold boundaries
  3. Re-run conformance vectors before rollout changes

5) Governance Store Audit Export

Minimal export payload should include:

  • TRACE, EVAL, and INTERVENTION linkage via trace_id
  • policy hash and bundle hash
  • fallback and disconnect metadata
  • timestamp and signer/checksum metadata (if enabled)

This supports post-incident review and regulatory evidence collection.


6) Suggested Dashboards

At minimum, chart:

  • interventions by type (stacked over time)
  • p95/p99 governance latency
  • timeout and retry rate
  • trust debt percentile distribution
  • top firing tripwires