Troubleshooting¶
Common implementation issues and practical remediation steps for ACGP deployments.
Fast Triage Checklist¶
- Confirm active profile (
dev,standard, orsafety-critical) and Governance Tier. - Validate your blueprint against the canonical schema.
- Inspect the intervention result (
intervention,message,flags,governance_status). - Correlate latency with tripwire/check execution and backend dependency health.
- Verify identity, transport, and storage configuration before changing policy thresholds.
Dev Mode Warnings¶
Problem: Warnings about missing production features while using Dev Mode.
Why this happens: Dev Mode is intentionally lightweight for development and learning. It omits some production controls by design.
What to do: - Keep Dev Mode for local iteration and short-lived batch jobs. - Move to Standard Conformance for production workloads. - Treat warnings as guidance, not runtime failures.
Conformance naming
For v1.0 claims, use profile names from ACGP-6: Standard and Safety-Critical.
High Latency¶
Problem: Governance evaluation takes longer than expected.
Diagnostics:
- Check P95/P99 latency per action and hook.
- Inspect governance_status.completed_tiers for expensive paths.
- Identify external dependencies (DB, cache, model inference, network hops).
Remediation: - Reduce expensive checks on high-frequency actions. - Use tighter Tier 0/Tier 1 budgets for interactive actions. - Move non-critical analysis to async follow-up where safe. - Scale steward workers and database connections.
result = steward.evaluate(trace)
status = result.governance_status
if status:
print("budget_ms", status.budget_consumed_ms)
print("tiers", status.completed_tiers)
Too Many Escalations or Blocks¶
Problem: Policy is overly restrictive for current workloads.
Diagnostics:
- Identify top failing tripwire IDs and checks.
- Compare trace data quality across successful vs blocked decisions.
- Verify threshold order in scoring.thresholds.
Remediation:
- Adjust noisy rule conditions and add clearer guard conditions.
- Rebalance metric weights in checks.
- Keep hard-safety tripwires strict; tune soft checks first.
- Add actionable reason text in rules to speed human review.
Evidence-Policy Failures¶
Problem: Decisions keep surfacing failed_evidence_policy or evidence_policy_failures is rising.
Diagnostics:
- Inspect result.metadata["evidence_result"] and result.metadata["evidence_summary"].
- Check require_citations, min_sources, and certified_only against the trace payload actually produced.
- Verify source timestamps, verification flags, and category metadata in the source catalog.
Remediation:
- Fix evidence collection before loosening policy requirements.
- Keep evidence-policy failure distinct from tripwire fail-closed behavior; if blocking is required, make it explicit with a tripwire or deployment extension.
- Watch evidence_policy_failures in metrics after rollout to confirm the source pipeline is stable.
Scorer Errors or Degraded Fallbacks¶
Problem: degraded_scorer_fallback_usage or scorer_errors is increasing.
Diagnostics:
- Inspect result.metadata["scorer_provenance"] for provider, model, fixture replay, cache, and failure_mode.
- Check whether the scorer path degraded with fallback or failed closed.
- Compare the timing of the spike with model-provider latency, timeout, or deployment changes.
Remediation:
- Treat scorer_errors as production incidents on the non-deterministic scorer path.
- Use pinned models/providers and approved regression fixtures for known-good replay.
- If fallback is acceptable, verify the fallback score and review the resulting intervention distribution.
Unexpected Flags¶
Problem: Actions pass but are frequently marked as flagged.
Background: flag is orthogonal and should appear via result.flags.flagged, not as a primary decision value.
Remediation:
- Review checks that set flag behavior.
- Route flagged actions to review queues without blocking happy-path execution.
- Add dashboard separation for blocked vs flagged outcomes.
if result.flags and result.flags.flagged:
queue_for_review(trace_id=result.trace_id, reason=result.flags.reason)
Authentication or Transport Failures¶
Problem: Steward connection fails intermittently or consistently.
Diagnostics: - Validate endpoint URL, certificates, API keys, and clock synchronization. - Confirm TLS settings and intermediate certificate chain. - Check firewall and service mesh policy.
Remediation: - Rotate credentials and test with least-privilege service accounts. - Enforce mTLS for production steward endpoints. - Add retry/backoff with idempotency safeguards.
Trust Debt Not Moving as Expected¶
Problem: Trust debt appears static or counterintuitive.
Diagnostics:
- Verify trust_debt.enabled is true.
- Check accumulation weights and decay settings.
- Confirm intervention events are persisted.
Remediation:
- Start from the baseline accumulation/decay profile.
- Validate decay_fraction and period_hours against expected half-life.
- Audit missing write paths if events are lost under load.
- Inspect runtime_posture, trust_debt.pre, trust_debt.delta, trust_debt.post, and trust_debt.thresholds_crossed in structured logs when posture shifts unexpectedly.
Blueprint Load Errors¶
Problem: Blueprint fails validation or loads with unexpected behavior.
Diagnostics: - Validate required top-level fields. - Check inheritance chain resolution. - Ensure tripwire IDs are unique unless intentional override.
Remediation:
- Validate against Blueprint Schema Reference.
- Use explicit version pins for inherits.
- Keep a changelog for threshold and tripwire edits.
Repro Bundle Template¶
When reporting a bug, include: - Blueprint snippet (minimal failing example) - Representative trace payload (redacted) - Intervention result JSON - Governance Tier and conformance profile - Steward version and SDK version - Logs for one failing trace and one successful trace