Observability Guide¶

How to monitor governance behavior, debug policy outcomes, and export evidence for audit/compliance.

1) Core Runtime Signals¶

Track these dimensions for each governed action:

Intervention (ok, nudge, flag, escalate, block, halt)
Stage path (local_terminal, local_ok_tier0, remote_stage2)
Latency (stage1_ms, stage2_ms, total)
Policy version (blueprint_id, blueprint_hash)
Trust debt (current_debt, delta, retier_triggered)

2) Using `get_metrics()`¶

metrics = steward.get_metrics()

print({
    "total_evaluations": metrics.get("total_evaluations"),
    "interventions": metrics.get("interventions"),
    "avg_latency_ms": metrics.get("avg_latency_ms"),
    "p99_latency_ms": metrics.get("p99_latency_ms"),
    "timeout_count": metrics.get("timeout_count"),
    "tripwire_violations": metrics.get("tripwire_violations"),
    "evidence_policy_failures": metrics.get("evidence_policy_failures"),
    "degraded_scorer_fallback_usage": metrics.get("degraded_scorer_fallback_usage"),
    "scorer_errors": metrics.get("scorer_errors"),
    "tier0_sampled_approvals": metrics.get("tier0_sampled_approvals"),
    "trust_debt_threshold_crossings": metrics.get("trust_debt_threshold_crossings"),
})

Interpretation baseline (starting point, tune by workload):

timeout_count rising with flat traffic: network or steward saturation risk
flag growth with stable block: behavior quality drift, review prompts
block spike after policy rollout: threshold or tripwire regression candidate
evidence_policy_failures rising: trace evidence collection or source-catalog quality regression candidate
degraded_scorer_fallback_usage rising: non-deterministic scorer instability or timeout pressure
scorer_errors rising: fail-closed scorer path is protecting execution, but the model/provider path needs attention
tier0_sampled_approvals flat at zero in high-volume Tier-0 environments: sampling may be disabled unintentionally

2.1 Machine-Readable Export Mapping¶

Standard-conformant implementations MUST be able to export the required observability metrics in machine-readable form. One recommended mapping is:

Required Signal	Example Metric Name	Suggested Type	Suggested Dimensions
CTQ score distribution	`acgp.ctq.score`	histogram	`agent_id`, `blueprint_id`, `governance_tier`
Intervention counts	`acgp.intervention.total`	counter	`decision`, `flagged`, `governance_tier`
Evaluation latency p50/p95/p99	`acgp.evaluation.latency_ms`	histogram	`stage`, `governance_tier`
Evidence-policy failures	`acgp.evidence_policy_failures_total`	counter	`blueprint_id`, `governance_tier`
Degraded scorer fallback usage	`acgp.degraded_scorer_fallback_usage_total`	counter	`provider`, `model`
Scorer fail-closed errors	`acgp.scorer_errors_total`	counter	`provider`, `model`
Tier-0 sampled approvals	`acgp.tier0_sampled_approvals_total`	counter	`governance_tier`
Disconnect fallback counts	`acgp.fallback.disconnect.total`	counter	`profile`, `decision`
Trust-debt threshold triggers	`acgp.trust_debt.threshold.total`	counter	`threshold`, `agent_id`, `blueprint_id`

OpenTelemetry, Prometheus, log-derived metrics, or equivalent telemetry backends are all acceptable as long as the exported signals remain machine-readable and auditable.

Reference Observability Export¶

The reference evaluator service exports an acgp_observability_v1 report that includes:

CTQ score distribution (p50, p95, p99)
intervention counts
evidence-policy failure counts
degraded scorer fallback usage
scorer error counts
Tier-0 sampled approvals
trust-debt threshold crossings
evaluation latency percentiles

These fields describe the shipped reference export surface. They do not imply that every deployment must expose identical observability formats outside the reference path.

3) Decision Logging (Structured)¶

def log_decision(trace_id: str, result):
    logger.info(
        "acgp.decision",
        extra={
            "trace_id": trace_id,
            "intervention": result.intervention,
            "message": result.message,
            "metadata": result.metadata,
        },
    )

Recommended fields:

Identity: trace_id, agent_id, session_id
Policy: blueprint_id, bundle_hash, governance_tier
Decision: intervention, tripwires_triggered, risk_score, ctq_score
Runtime: stage, latency_ms, fallback_applied, runtime_posture, review_required
Trust debt: trust_debt.pre, trust_debt.delta, trust_debt.post, trust_debt.thresholds_crossed

4) Trust Debt Monitoring¶

Alert candidates:

Debt crossing 70% of Governance Tier review threshold
Repeated flag + escalate within rolling window
Re-tier triggered more than once per policy window

Operational action:

Inspect traces with highest debt contribution
Validate tripwire severity and threshold boundaries
Re-run conformance vectors before rollout changes

5) Governance Store Audit Export¶

Minimal export payload should include:

TRACE, EVAL, and INTERVENTION linkage via trace_id
policy hash and bundle hash
fallback and disconnect metadata
timestamp and signer/checksum metadata (if enabled)

This supports post-incident review and regulatory evidence collection.

6) Suggested Dashboards¶

At minimum, chart:

interventions by type (stacked over time)
p95/p99 governance latency
timeout and retry rate
trust debt percentile distribution
top firing tripwires