Troubleshooting

Common implementation issues and practical remediation steps for ACGP deployments.


Fast Triage Checklist

  1. Confirm active profile (dev, standard, or safety-critical) and Governance Tier.
  2. Validate your blueprint against the canonical schema.
  3. Inspect the intervention result (intervention, message, flags, governance_status).
  4. Correlate latency with tripwire/check execution and backend dependency health.
  5. Verify identity, transport, and storage configuration before changing policy thresholds.

Dev Mode Warnings

Problem: Warnings about missing production features while using Dev Mode.

Why this happens: Dev Mode is intentionally lightweight for development and learning. It omits some production controls by design.

What to do: - Keep Dev Mode for local iteration and short-lived batch jobs. - Move to Standard Conformance for production workloads. - Treat warnings as guidance, not runtime failures.

Conformance naming

For v1.0 claims, use profile names from ACGP-6: Standard and Safety-Critical.


High Latency

Problem: Governance evaluation takes longer than expected.

Diagnostics: - Check P95/P99 latency per action and hook. - Inspect governance_status.completed_tiers for expensive paths. - Identify external dependencies (DB, cache, model inference, network hops).

Remediation: - Reduce expensive checks on high-frequency actions. - Use tighter Tier 0/Tier 1 budgets for interactive actions. - Move non-critical analysis to async follow-up where safe. - Scale steward workers and database connections.

result = steward.evaluate(trace)
status = result.governance_status
if status:
    print("budget_ms", status.budget_consumed_ms)
    print("tiers", status.completed_tiers)

Too Many Escalations or Blocks

Problem: Policy is overly restrictive for current workloads.

Diagnostics: - Identify top failing tripwire IDs and checks. - Compare trace data quality across successful vs blocked decisions. - Verify threshold order in scoring.thresholds.

Remediation: - Adjust noisy rule conditions and add clearer guard conditions. - Rebalance metric weights in checks. - Keep hard-safety tripwires strict; tune soft checks first. - Add actionable reason text in rules to speed human review.


Evidence-Policy Failures

Problem: Decisions keep surfacing failed_evidence_policy or evidence_policy_failures is rising.

Diagnostics: - Inspect result.metadata["evidence_result"] and result.metadata["evidence_summary"]. - Check require_citations, min_sources, and certified_only against the trace payload actually produced. - Verify source timestamps, verification flags, and category metadata in the source catalog.

Remediation: - Fix evidence collection before loosening policy requirements. - Keep evidence-policy failure distinct from tripwire fail-closed behavior; if blocking is required, make it explicit with a tripwire or deployment extension. - Watch evidence_policy_failures in metrics after rollout to confirm the source pipeline is stable.


Scorer Errors or Degraded Fallbacks

Problem: degraded_scorer_fallback_usage or scorer_errors is increasing.

Diagnostics: - Inspect result.metadata["scorer_provenance"] for provider, model, fixture replay, cache, and failure_mode. - Check whether the scorer path degraded with fallback or failed closed. - Compare the timing of the spike with model-provider latency, timeout, or deployment changes.

Remediation: - Treat scorer_errors as production incidents on the non-deterministic scorer path. - Use pinned models/providers and approved regression fixtures for known-good replay. - If fallback is acceptable, verify the fallback score and review the resulting intervention distribution.


Unexpected Flags

Problem: Actions pass but are frequently marked as flagged.

Background: flag is orthogonal and should appear via result.flags.flagged, not as a primary decision value.

Remediation: - Review checks that set flag behavior. - Route flagged actions to review queues without blocking happy-path execution. - Add dashboard separation for blocked vs flagged outcomes.

if result.flags and result.flags.flagged:
    queue_for_review(trace_id=result.trace_id, reason=result.flags.reason)

Authentication or Transport Failures

Problem: Steward connection fails intermittently or consistently.

Diagnostics: - Validate endpoint URL, certificates, API keys, and clock synchronization. - Confirm TLS settings and intermediate certificate chain. - Check firewall and service mesh policy.

Remediation: - Rotate credentials and test with least-privilege service accounts. - Enforce mTLS for production steward endpoints. - Add retry/backoff with idempotency safeguards.


Trust Debt Not Moving as Expected

Problem: Trust debt appears static or counterintuitive.

Diagnostics: - Verify trust_debt.enabled is true. - Check accumulation weights and decay settings. - Confirm intervention events are persisted.

Remediation: - Start from the baseline accumulation/decay profile. - Validate decay_fraction and period_hours against expected half-life. - Audit missing write paths if events are lost under load. - Inspect runtime_posture, trust_debt.pre, trust_debt.delta, trust_debt.post, and trust_debt.thresholds_crossed in structured logs when posture shifts unexpectedly.


Blueprint Load Errors

Problem: Blueprint fails validation or loads with unexpected behavior.

Diagnostics: - Validate required top-level fields. - Check inheritance chain resolution. - Ensure tripwire IDs are unique unless intentional override.

Remediation: - Validate against Blueprint Schema Reference. - Use explicit version pins for inherits. - Keep a changelog for threshold and tripwire edits.


Repro Bundle Template

When reporting a bug, include: - Blueprint snippet (minimal failing example) - Representative trace payload (redacted) - Intervention result JSON - Governance Tier and conformance profile - Steward version and SDK version - Logs for one failing trace and one successful trace