Blueprints, Traces & Evaluation¶
Status: Standard-only Alpha (v1.0.0-alpha.2)
Last Updated: 2026-03-18
Spec ID: ACGP-3
Normative Keywords: MUST, SHOULD, MAY (per RFC 2119 and RFC 8174)
Abstract¶
This specification defines Blueprint source artifacts, resolved Blueprint artifacts, canonical checks[] semantics, cognitive trace semantics, the CTQ scorer framework, score-to-intervention threshold mapping, evidence policy, extension descriptors, and trust-policy semantics. It is the primary normative reference for anyone building a Policy Engine.
ACGP-3 is the authoritative normative reference for runtime evaluation ordering, check semantics, CTQ aggregation, threshold mapping, and trust-policy observable semantics.
Requirements Language¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.
1. Scope [NORMATIVE]¶
ACGP-3 defines governance policy semantics for:
- Blueprint source artifact schema and resolved artifact schema
- Base-resolution and merge behavior
- Cognitive trace model and evaluation ordering
- Canonical
checks[]semantics forkind: rule | metric - CTQ metrics, evaluator types, and weighted aggregation
- Score-to-intervention threshold mapping with Governance Tier interaction
- Evidence-policy / source-grounding requirements
- Trust-policy observable semantics, default provider behavior, and provider metadata
ACGP-3 does not define:
- Transport or envelope mechanics (→ ACGP-2)
- Full tripwire DSL grammar and fail-closed semantics (→ ACGP-4)
- Audit retention or privacy controls (→ ACGP-5)
- Conformance test suites (→ ACGP-6)
Minimal Useful Standard Implementation [INFORMATIVE]¶
A minimal useful ACGP v1.0 Standard implementation does not need to implement every advanced deployment pattern.
A practical starting point is: - one Governance Steward - one Operating Agent runtime - one active baseline blueprint - deterministic checks - tripwire evaluation - CTQ aggregation using the five standard metrics - threshold mapping - orthogonal flag support - default trust debt provider behavior - Governance Store or equivalent durable governance persistence
For conformance, implementations MUST correctly process scorer outputs for all standard scorer types. Actual model inference for Tier-2 (cognitive-evaluator) and Tier-3 (hybrid) scorers is implementation-defined and not required by v1.0 conformance test vectors.
This is the minimum useful alpha.2 implementation path: a single active policy, one runtime applying governance before action execution, and durable audit storage for emitted evidence and decisions.
Reading Map [INFORMATIVE]¶
- Blueprint schema and artifact model
- Cognitive trace model
- Evaluation ordering
- Check semantics
- CTQ aggregation
- Threshold mapping
- Evidence policy
- Trust debt and provider observables
- Canonical EVAL artifact
2. Blueprint Schema [NORMATIVE]¶
A Blueprint source artifact MUST be a valid YAML 1.2 or JSON document. Blueprints are the primary mechanism for translating organizational policies into enforceable runtime rules. They are authored by domain experts and loaded by the Governance Steward. Evaluation engines consume resolved Blueprint artifacts derived from source artifacts.
2.1 Required Top-Level Fields¶
| Field | Type | Description |
|---|---|---|
artifact_type |
string | Source-artifact discriminator. MUST be acgp.blueprint |
schema_version |
string | Schema version for the Blueprint source artifact |
id |
string | Unique identifier, RECOMMENDED format: domain/name@version |
version |
string | Semantic version for the source artifact |
title |
string | Human-readable policy title |
description |
string | Concise explanation of the blueprint's purpose |
checks |
array | Canonical rules and metrics to evaluate at runtime |
intervention_policy |
object | Threshold mapping from risk score to intervention |
2.2 Optional Top-Level Fields¶
| Field | Type | Description |
|---|---|---|
base |
object | Parent source Blueprint reference and optional digest pin |
applicability |
object | Restricts activation by tier, tool, domain |
tripwires |
array | Safety checks evaluated before checks[] |
evidence_policy |
object | Source-grounding requirements |
trust_policy |
object | Accumulation/decay policy for trust tracking |
extensions |
object | Public descriptors for required or optional extensions |
annotations |
object | Non-normative annotations |
fixtures |
array | Embedded policy fixtures for expected outcomes |
The following fields are not part of Blueprint core and MUST NOT appear in a conformant Blueprint source artifact:
namectqperformance_budgetfallback_behaviormetadata- raw
inherits tripwire_syntax_version
2.2.1 Extension Metadata [NORMATIVE]¶
Blueprints MAY declare portable extension requirements using an extensions block:
extensions:
required:
- id: "urn:acgp:ext:source-catalog-private@1"
visibility: private
enforcement_scope: remote
fail_mode: reject_activation
attestation:
digest: "sha256:..."
policy_pack_id: "pp:2026-03-10:7"
optional:
- id: "urn:acgp:ext:contracts@1"
visibility: public
Normative rules:
extensions.required[]descriptors MUST be preserved through validation, sync, and activation.- Required extensions MUST declare
enforcement_scopeaslocal,remote, orboth. - The authoritative enforcer is the local runtime for
local, the Governance Steward or remote authoritative runtime forremote, and both runtimes forboth. localscope requires enforcement by the local runtime applying the bundle; if local enforcement is unavailable, the descriptor'sfail_modeMUST apply.remotescope requires preservation and negotiation by the local runtime, including descriptor metadata and any attestation, but MUST NOT by itself block local activation when enforcement is performed only by the remote authoritative runtime.bothscope requires support by both authoritative enforcers on the sync boundary; if either side lacks support, the descriptor'sfail_modeMUST apply.- Unsupported optional extensions MAY be preserved and ignored.
privatedescriptors expose only public negotiation metadata plus optional opaque attestation.localextensions MUST remain deployment-local metadata and MUST NOT create hidden portability requirements in public blueprints.
The consolidated activation/evaluation outcomes are:
| Descriptor | enforcement_scope |
Local support | Remote support | Result |
|---|---|---|---|---|
required |
local |
Yes | N/A | Activate and enforce locally |
required |
local |
No | N/A | Apply fail_mode (reject_activation or deny) |
required |
remote |
No or partial | Yes | Preserve descriptor locally, activate, and rely on remote authoritative enforcement |
required |
remote |
Any | No | Apply fail_mode at activation or decision time, depending on descriptor semantics |
required |
both |
Yes | Yes | Activate and require enforcement on both sides |
required |
both |
No on either side | No on either side | Apply fail_mode |
optional |
local, remote, or both |
Supported | Supported or preserved | Activate and apply where supported |
optional |
local, remote, or both |
Unsupported | Unsupported | Preserve if possible and ignore without failing activation |
This table is the single consolidated truth table for local / remote / both handling. ACGP-1 and ACGP-2 cross-reference this section rather than redefining the matrix.
2.3 Applicability Block¶
applicability MAY restrict blueprint activation by context:
applicability:
governance_tiers: [GT-3, GT-4] # Governance Tier values this blueprint applies to
tools: ["execute_trade"] # Tools governed by this blueprint
domains: ["production"] # Deployment domains
out_of_scope_behavior: block
If omitted, the blueprint is globally applicable.
2.4 Base Resolution and Inheritance¶
If base is present, the child source artifact MUST be resolved against the parent source artifact before evaluation. Resolution produces a first-class resolved Blueprint artifact. The base object uses this minimum shape:
Child values are merged with parent values using these normative rules:
| Field Category | Merge Behavior |
|---|---|
id, version, title, description |
Always child-defined |
base |
Used only for lineage and resolution inputs |
annotations |
Child replaces parent entirely |
applicability |
Child overrides parent entirely when present |
tripwires |
Append; if child defines a tripwire with the same id as parent, child definition replaces that parent tripwire |
checks |
Append; if child defines a check with the same id as parent, child definition replaces that parent check |
intervention_policy.thresholds |
Child overrides parent per key |
evidence_policy |
Child overrides parent per key |
trust_policy |
Child overrides parent per key |
extensions.required |
Append; if child defines the same id, child replaces parent |
extensions.optional |
Append; if child defines the same id, child replaces parent |
Implementations MUST reject blueprints whose inheritance chain contains a cycle.
Cycle detection MUST occur before merge resolution begins.
The recommended error code is CircularBlueprintInheritance.
An implementation MUST NOT attempt partial merge resolution once a cycle has been detected.
Safety note: because tripwires append by default, adding tripwires in a child blueprint never removes parent safety boundaries. To modify a parent tripwire, the child MUST declare a tripwire with the same id, which replaces only that specific tripwire.
Resolved Blueprint artifacts MUST carry at least these additional fields:
source_blueprint:
ref: finance/desk-a@2.0
lineage:
- ref: finance/base@2.0
- ref: finance/desk-a@2.0
resolved_at: 2026-03-18T10:00:00Z
effective:
valid_from: 2026-03-18T10:00:00Z
resolution_metadata:
resolver_version: 2.0.0
Inheritance merge example:
# parent: finance/base@2.0
tripwires:
- id: max_trade
condition: args.trade_value > 50000
on_fail: { decision: "block", reason: "Trade cap exceeded" }
# child: finance/desk-a@2.0
base:
ref: finance/base@2.0
tripwires:
- id: max_trade
condition: args.trade_value > 25000
on_fail: { decision: "block", reason: "Desk-A stricter cap" }
- id: sanctions_check
condition: contains_entity(args.counterparty, "sanctioned_org")
on_fail: { decision: "halt", reason: "Sanctioned counterparty" }
Result: max_trade is replaced by child definition; sanctions_check is appended.
2.4.1 Operational Resource Limits [NORMATIVE]¶
Implementations MUST define and document operational limits for blueprint size, inheritance depth, and tripwire/check counts.
For v1.0 interoperability guidance, the following minimum rejection thresholds are RECOMMENDED:
- maximum inheritance depth: 16
- maximum tripwire count per blueprint: 256
- maximum check count per blueprint: 256
- maximum serialized blueprint size: 1 MiB
2.5 Blueprint Versioning¶
Blueprint versions MUST follow Semantic Versioning 2.0.0:
- MAJOR: Breaking changes (threshold changes, removed checks, semantic changes)
- MINOR: Backward-compatible additions (new optional checks, new tripwires)
- PATCH: Bug fixes, clarifications, documentation
2.6 Resolved Blueprint Consumption¶
Evaluation engines MUST consume resolved Blueprint artifacts rather than ad hoc merged source trees. A source Blueprint MAY be evaluated only after it has been compiled into the resolved form defined in §2.4.
3. Checks Block [NORMATIVE]¶
ACGP v1.0 uses three distinct evaluation constructs:
- Tripwires — Top-level
tripwires[]safety boundaries evaluated beforechecks[]. They may producehaltand short-circuit CTQ evaluation. - Deterministic rule checks —
checks[]entries withkind: rule. They produce direct pass/fail policy outcomes (ok,nudge,escalate,block) and MAY setflags.flagged. - CTQ metric checks —
checks[]entries withkind: metric. They contribute weighted evidence to CTQ aggregation and never producehaltdirectly.
Authoring decision tree:
- If the condition is a hard safety boundary that must fire regardless of CTQ score, use
tripwires[]. - If the condition is a direct pass/fail policy rule with an explicit intervention, use a rule-based check.
- If the condition contributes to runtime quality scoring, use a metric-based check.
The checks array defines the latter two constructs.
3.1 Rule Checks¶
Pass/fail policy enforcement:
checks:
- id: single_trade_volume_cap
kind: rule
when:
hook: "tool_call"
tool: "execute_trade"
condition: "args.trade_value <= 50000"
on_fail:
decision: "block"
reason: "Trade value exceeds $50,000 limit."
flag: false
decision MUST be one of: ok, nudge, escalate, block. Rule checks MUST NOT declare on_fail.decision: "halt". Blueprints containing halt in a rule-check decision MUST be rejected at load time with error code InvalidBlueprintHaltInRule.
Rule checks MAY additionally declare flag: true | false (default false). When flag: true, a matching check sets flags.flagged: true on the intervention without altering the primary decision.
3.2 Metric Checks¶
Weighted scoring that contributes to the final CTQ score:
checks:
- id: trade_rationale_quality
kind: metric
when:
hook: "tool_call"
tool: "execute_trade"
metric:
name: "reasoning_quality"
weight: 0.6
evaluator:
kind: "cognitive-evaluator"
args:
prompt_template: "templates/check_risk_rationale.txt"
3.3 Check Validation Rules¶
The outer checks[] envelope is canonical. Implementations MUST reject any check item that mixes rule-only and metric-only fields. Specifically:
- a
kind: rulecheck MUST declareconditionandon_fail - a
kind: rulecheck MUST NOT declaremetric - a
kind: metriccheck MUST declaremetric - a
kind: metriccheck MUST NOT declareconditionoron_fail
4. Cognitive Trace Model [NORMATIVE]¶
A conformant evaluator MUST operate on a normalized cognitive trace object.
4.1 Minimum Trace Fields¶
| Field | Required | Description |
|---|---|---|
trace_id |
REQUIRED | Unique identifier |
session_id |
REQUIRED | Session context |
hook |
REQUIRED | Lifecycle hook that triggered the trace |
agent_id |
REQUIRED | Stable identifier of the governed agent principal within a deployment-defined trust domain |
action |
REQUIRED | Action being evaluated; MUST be an object with required name and optional parameters |
context |
REQUIRED | Contextual data |
reasoning |
RECOMMENDED | Agent's reasoning chain |
timestamp MUST come from the envelope timestamp (ACGP-2). TRACE payloads MUST NOT carry a top-level payload timestamp.
The model may supply reasoning and action content, but it does not self-authorize its governing identity. agent_id MUST be assigned by runtime or operator control. ACGP distinguishes the governed principal from mutable display names and from the envelope sender.
4.2 Tool Activity Fields¶
When tool activity exists, traces SHOULD include:
| Field | Description |
|---|---|
tool |
Tool name |
args |
Tool arguments |
result |
Tool execution result |
tool_calls |
Array of tool invocations |
ACGP-4 canonical tripwire field roots resolve against this trace model. For the authoritative semantic meaning of trace fields used by tripwire expressions, see ACGP-4 §4.1.
4.3 Evidence Fields¶
When evidence is required, traces SHOULD include source/citation metadata for grounding checks (see Section 8).
4.4 Principal Identity Example¶
The following trace fragment shows an opaque principal identifier with a human-readable label in meta:
{
"trace_id": "01958249-4d55-7000-8000-000000000111",
"agent_id": "urn:acgp:agent:financeops:prod:7f4c9d2a",
"session_id": "01958249-4d55-7000-8000-000000000112",
"hook": "tool_call",
"context": {},
"governance_tier": "GT-3",
"action": {
"name": "approve_invoice",
"parameters": { "invoice_id": "INV-1042" }
},
"meta": {
"agent_label": "FinanceOps-Agent",
"model": "gpt-5-pro"
}
}
5. Scorer Types and Interface [NORMATIVE]¶
Implementations MUST support five standard scorer families:
5.1 rule-based Scorer¶
Fast, deterministic evaluation using predefined rules.
scorer: "rule-based"
parameters:
rules: [string] # List of rule IDs
mode: "all" # Enum: "all" (AND) or "any" (OR). Default: "all"
# Output: 1.0 if rules pass, 0.0 if fail (binary)
- Latency: <10ms | Eval Tier: 0
5.2 cognitive-evaluator Scorer¶
A cognitive-evaluator is a non-deterministic evaluator that scores trace quality using an implementation-defined cognitive assessment backend. In alpha implementations, cognitive-evaluator is commonly backed by an LLM-based scorer, but the protocol intentionally standardizes the evaluator role rather than the model family.
scorer: "cognitive-evaluator"
parameters:
model: string # Model identifier
prompt_template: string # Template ID for evaluation prompt
temperature: float # Sampling temperature (default: 0.0)
max_tokens: int # Max response tokens (default: 256)
# Output: float 0.0-1.0 based on cognitive assessment
- Latency: 500ms-5000ms | Eval Tier: 2
The protocol standardizes the scorer input/output interface. Model selection, accuracy, and latency are implementation-defined.
5.3 source-match Scorer¶
Evaluates grounding against certified sources.
scorer: "source-match"
parameters:
required_sources: [string] # Registry categories required (public profile)
policy_handle: string # Opaque handle for private-source-backed policy
min_citation_ratio: float # Min % of claims with citations
min_source_trust: float # Minimum source trust score
# Output: float 0.0-1.0 based on citation coverage
- Latency: 50ms-200ms | Eval Tier: 1
Public blueprints MUST NOT be required to disclose private source identifiers, document locators, clause references, or derivation provenance for source-match policies. Deployments MAY instead publish an opaque policy_handle or extension attestation that signals the presence of a private source-backed control.
Optional source-match fallback [NORMATIVE]:
- If an optional
source-matchscorer cannot execute because Source Catalog capability is unavailable or the catalog query fails, that scorer MUST be marked unavailable. - The unavailable scorer's weight MUST be redistributed proportionally across the remaining successfully executed scorers in the same aggregation set.
- If no scorers remain in that aggregation set after unavailability handling, the aggregation set MUST fail and the enclosing control's explicit fail semantics apply.
- This fallback does not apply when source evidence is encoded as a required control, a required extension, or a tripwire-backed fail path. Those cases follow their own explicit
fail_modeor tripwire semantics.
5.4 pattern-match Scorer¶
Regex or pattern-based content scanning.
scorer: "pattern-match"
parameters:
patterns:
- pattern: "regex-pattern"
score_on_match: float
score_on_miss: float
aggregation: "min" # Enum: "min", "max", or "avg"
# Output: float 0.0-1.0 based on pattern matching
- Latency: <50ms | Eval Tier: 0
5.5 hybrid Scorer¶
Combines multiple scorer types with weighted aggregation.
scorer: "hybrid"
parameters:
scorers:
- type: "rule-based"
weight: 0.4
parameters: { ... }
- type: "cognitive-evaluator"
weight: 0.6
parameters: { ... }
aggregation: "weighted_average" # Enum: "weighted_average", "min", or "max"
- Latency: Depends on components | Eval Tier: Highest of components
5.6 Scorer Interface [NORMATIVE]¶
Custom scorers MUST conform to this interface:
class Scorer(Protocol):
scorer_type: str
eval_tier: int
latency_budget_ms: int
def evaluate(self, trace: CognitiveTrace, parameters: dict) -> ScorerResult: ...
@dataclass
class ScorerResult:
score: float # 0.0 to 1.0
confidence: float # 0.0 to 1.0
explanation: str # Human-readable
evidence: dict # Supporting data
latency_ms: int # Actual evaluation time
5.7 Eval Tier Definitions [NORMATIVE]¶
| Tier | Strategy | v1.0 Status | Latency Budget |
|---|---|---|---|
| 0 | In-memory / rule-based | Defined | < 100 ms |
| 1 | Stateful lookup (DB/cache) | Defined | < 300 ms |
| 2 | Model inference (e.g., cognitive-evaluator) |
Reserved | Implementation-defined |
| 3 | Human-in-the-loop review | Reserved | Implementation-defined |
v1.0 constraint: Conformance testing validates Tiers 0 and 1 only. Tiers 2 and 3 are reserved, non-core extension space in v1.0. Implementations MAY support Tiers 2–3 as non-normative extensions, but MUST NOT require them for Standard or Safety-Critical conformance in v1.0. See also ACGP-4 §3 (tripwires MUST NOT declare
eval_tier > 1in v1.0 core).
6. CTQ Metrics and Aggregation [NORMATIVE]¶
6.1 Five Standard Dimensions¶
CTQ evaluates runtime quality across five dimensions:
| Dimension | Weight Range | Evaluates |
|---|---|---|
| Reasoning Quality | 0.20-0.30 | Logical consistency, completeness, soundness |
| Knowledge Grounding | 0.15-0.25 | Source reliability, factual correctness, recency |
| Ethical Alignment | 0.15-0.25 | Harm prevention, fairness, transparency, privacy |
| Tool Safety | 0.15-0.25 | Permission validation, input safety, error handling |
| Context Awareness | 0.10-0.20 | Relevance, appropriateness, completeness, timing |
Weights MUST sum to 1.0. Each weight MUST fall within its allowed range.
If declared weights do not sum to 1.0 (within tolerance ±0.001 to accommodate floating-point representation), the blueprint MUST be rejected at load time with error code INVALID_BLUEPRINT_WEIGHTS. Implementations MUST NOT silently normalize weights.
6.1.1 Metric-to-Dimension Mapping [NORMATIVE]¶
Each metric-based check MUST contribute to exactly one of the five standard CTQ dimensions.
Portable v1.0 blueprints SHOULD use the canonical dimension names directly as metric.name:
reasoning_qualityknowledge_groundingethical_alignmenttool_safetycontext_awareness
Deployments MAY use more specific local metric names during authoring or preprocessing, but those names MUST be mapped to exactly one canonical CTQ dimension before runtime validation and conformance testing.
When multiple metric-based checks map to the same dimension, their weighted contributions are additive within the global CTQ calculation. The dimension's effective weight is the sum of the weights of the checks mapped to it.
Worked example:
checks:
- id: rationale_clarity
metric:
name: reasoning_quality
weight: 0.15
- id: plan_completeness
metric:
name: reasoning_quality
weight: 0.10
- id: citation_coverage
metric:
name: knowledge_grounding
weight: 0.20
- id: fairness_review
metric:
name: ethical_alignment
weight: 0.20
- id: permission_check
metric:
name: tool_safety
weight: 0.20
- id: situational_fit
metric:
name: context_awareness
weight: 0.15
If rationale_clarity = 0.80 and plan_completeness = 0.90, the combined weighted contribution to the reasoning_quality dimension is (0.80 × 0.15) + (0.90 × 0.10) = 0.21, and the effective reasoning dimension weight is 0.25.
6.1.2 Dimension Result Status [NORMATIVE]¶
Public CTQ results MUST represent evaluator state separately from content quality.
Each CTQ dimension result MUST use one of these status values:
evaluated— score produced through the normal intended scorer path.degraded— score produced through a declared fallback path.unavailable— intended scorer path could not evaluate the dimension because the capability or infrastructure was absent.error— the scorer was attempted and failed unexpectedly with no declared fallback.failed_evidence_policy— the dimension precondition was not met, so the dimension was gated before scorer execution.
Contributor visibility is part of the public contract:
failed_evidence_policyMUST serializecontributors: [].unavailableMUST serializecontributors: [].errorMUST keep attempted contributor identifiers visible.degradedMUST keep attempted contributor identifiers visible.
Implementations MUST distinguish evaluator state from content-quality score. A missing scorer, timeout, unavailable extension, evidence-system outage, or evaluator failure MUST NOT be silently serialized as an ordinary low-quality score unless the Blueprint or scorer policy explicitly defines that fallback behavior.
ACGP standardizes CTQ dimensions, weight rules, aggregation, scorer identifiers, public result fields, and decision semantics; it does not standardize internal prompts, learned heuristics, calibration datasets, source-ranking logic, or model-routing internals.
6.2 Score Interpretation¶
For each metric:
| Score | Quality | Description |
|---|---|---|
| 0.0-0.3 | Poor | Severe quality issues |
| 0.3-0.6 | Fair | Significant gaps |
| 0.6-0.8 | Good | Generally sound, minor issues |
| 0.8-1.0 | Excellent | Clear, complete, verified |
6.3 CTQ Calculation [NORMATIVE]¶
Risk score is derived as:
When one or more dimensions are not fully evaluated, the public result artifact MUST preserve the actual dimension status.
The Blueprint or scorer policy MUST explicitly determine whether the runtime:
- redistributes weight across available dimensions,
- applies an explicit fallback score, or
- fails closed before producing a final canonical EVAL artifact.
The chosen behavior MUST be visible in the public result artifact and reproducible in conformance vectors.
Numeric Comparison Tolerance [NORMATIVE]: Floating-point results in test vectors MUST match expected values within ±1e-4. Implementations MUST serialize scores to exactly 4 decimal places (round half away from zero). Weight sums MUST match 1.0 within ±0.001 (ACGP-3 §6.1).
An exact threshold-boundary value falls into the less severe category.
Example:
reasoning: 0.90 × 0.25 = 0.225
grounding: 0.80 × 0.20 = 0.160
ethical: 0.85 × 0.20 = 0.170
tool_safety: 0.88 × 0.20 = 0.176
context: 0.82 × 0.15 = 0.123
─────────────────────────────────
CTQ = 0.854 Risk = 0.146
6.4 Weight Declaration in Blueprint¶
Blueprint expresses metric weights only through checks[].metric.weight. There is no normative top-level scoring.weights block and there is no normative ctq shorthand block.
Public Blueprints MUST carry all runtime metric configuration in canonical checks[] entries before validation. Tooling MAY provide private authoring conveniences, but those conveniences MUST compile to the canonical checks[] model and MUST NOT appear in published Blueprint artifacts.
7. Threshold Mapping and Decisions [NORMATIVE]¶
7.1 Threshold Format¶
Threshold intervals are evaluated using closed-upper-bound semantics:
okifrisk_score <= thresholds.oknudgeifthresholds.ok < risk_score <= thresholds.nudgeescalateifthresholds.nudge < risk_score <= thresholds.escalateblockifrisk_score > thresholds.escalate- See ACGP-4 for the tripwire-only
haltinvariant.
intervention_policy.thresholds maps risk score boundaries to interventions:
intervention_policy:
thresholds:
ok: 0.25 # Risk ≤ 0.25 → ok
nudge: 0.40 # 0.25 < Risk ≤ 0.40 → nudge
escalate: 0.55 # 0.40 < Risk ≤ 0.55 → escalate
# Risk > 0.55 → block (implicit)
7.2 Governance-Tier-Specific Thresholds [NORMATIVE]¶
Default risk-score thresholds by Governance Tier:
| Governance Tier | OK | Nudge | Escalate | Block |
|---|---|---|---|---|
| GT-0 | risk ≤ 0.40 |
0.40 < risk ≤ 0.55 |
0.55 < risk ≤ 0.70 |
risk > 0.70 (implicit) |
| GT-1 | risk ≤ 0.30 |
0.30 < risk ≤ 0.45 |
0.45 < risk ≤ 0.60 |
risk > 0.60 (implicit) |
| GT-2 | risk ≤ 0.25 |
0.25 < risk ≤ 0.40 |
0.40 < risk ≤ 0.55 |
risk > 0.55 (implicit) |
| GT-3 | risk ≤ 0.20 |
0.20 < risk ≤ 0.35 |
0.35 < risk ≤ 0.50 |
risk > 0.50 (implicit) |
| GT-4 | risk ≤ 0.15 |
0.15 < risk ≤ 0.30 |
0.30 < risk ≤ 0.45 |
risk > 0.45 (implicit) |
| GT-5 | risk ≤ 0.10 |
0.10 < risk ≤ 0.25 |
0.25 < risk ≤ 0.40 |
risk > 0.40 (implicit) |
3-threshold model: The
blockcolumn above is implicit — risk scores exceeding theescalateboundary always map toblock. There is no separateblockthreshold field in the schema.
7.3 Threshold Override Rules¶
When both blueprint and Governance Tier default thresholds exist, the stricter (lower) boundary applies:
effective_ok = min(blueprint_thresholds.ok, governance_tier_defaults.ok)
effective_nudge = min(blueprint_thresholds.nudge, governance_tier_defaults.nudge)
effective_escalate = min(blueprint_thresholds.escalate, governance_tier_defaults.escalate)
# block = risk > effective_escalate (implicit; no separate field)
Worked example: An agent in Governance Tier GT-5 is evaluated with a permissive blueprint (ok: 0.40, nudge: 0.55, escalate: 0.70). Governance Tier GT-5 defaults are ok: 0.10, nudge: 0.25, escalate: 0.40. Effective thresholds:
| Threshold | Blueprint | Governance Tier GT-5 | Effective (min) |
|---|---|---|---|
| ok | 0.40 | 0.10 | 0.10 |
| nudge | 0.55 | 0.25 | 0.25 |
| escalate | 0.70 | 0.40 | 0.40 |
With risk score 0.30: blueprint alone → ok, but effective thresholds → escalate (0.30 > 0.10 so not ok; 0.30 > 0.25 so not nudge; 0.25 < 0.30 ≤ 0.40 → escalate). The Governance Tier floor prevents a high-risk agent from receiving permissive governance.
7.4 Decision Algorithm [NORMATIVE]¶
def determine_intervention(ctq_score, governance_tier, tripwires):
# Tripwires ALWAYS take precedence
if tripwires:
priority = ["halt", "block", "escalate", "nudge", "ok"]
decisions = {t.on_fail.decision for t in tripwires}
for decision in priority:
if decision in decisions:
return decision
# CTQ-based threshold mapping
risk_score = 1.0 - ctq_score
thresholds = get_effective_thresholds(governance_tier, blueprint)
if risk_score <= thresholds.ok:
return "ok"
elif risk_score <= thresholds.nudge:
return "nudge"
elif risk_score <= thresholds.escalate:
return "escalate"
else:
return "block"
Runtime tripwire precedence is determined only by the triggered on_fail.decision values. severity remains advisory authoring metadata and MUST NOT alter the runtime intervention.
8. Evidence Policy [NORMATIVE]¶
The optional evidence_policy block defines source-grounding requirements:
| Control | Type | Description |
|---|---|---|
require_citations |
boolean | Whether citations are required in governed output |
certified_only |
boolean | Whether only certified/verified sources are acceptable |
min_sources |
integer | Minimum number of qualifying sources required |
If evidence controls are declared, they form a dimension-level admissibility gate for knowledge_grounding. Implementations MUST check the gate before any knowledge-grounding scorers run.
If the knowledge_grounding evidence gate fails:
statusMUST befailed_evidence_policyscoreMUST be0.0contributorsMUST be[]- knowledge-grounding weights MUST NOT be redistributed
If the gate passes, knowledge-grounding scorers MUST run normally.
Evidence policy alone MUST NOT deny the whole governed action. Whole-action denial on evidence failure requires an explicit tripwire outcome or a required extension / required control whose declared fail semantics deny or reject the action.
Evaluators MUST record evidence outcomes in evaluation artifacts. When evidence_policy is declared, evidence_summary SHOULD capture supplementary details such as whether the policy was declared, which controls were checked, and which controls passed or failed. evidence_summary is supplementary detail only and MUST NOT become a second verdict channel.
Richer source-category or trust-floor semantics MAY be implemented through extensions, internal policy compilation, or evaluator-specific private configuration, but they are not part of the canonical Blueprint core schema.
If Source Catalog capability is unavailable and the evidence model relies on optional source-match scoring, the optional scorer fallback in §5.3 applies. If source evidence is required through extensions.required[], fail_mode: deny, or tripwire-backed semantics, those explicit controls override the optional fallback and MUST fail closed as declared.
Advanced source-catalog semantics are defined in the Source Catalog extension.
9. Trust Policy Core (v1.0) [NORMATIVE]¶
Trust debt is a governance state concept exposed through the trust_policy Blueprint block and a provider model. The public observable semantics are normative for v1.0 interoperability. acgp.core.default@1 is the default deterministic provider. Other provider behavior MAY exist through extension boundaries, but implementations MUST preserve the deterministic public boundary defined here even when the underlying provider is private.
9.1 Observable Semantics [NORMATIVE]¶
All conformant implementations MUST preserve these observable semantics:
- Trust debt is updated after threshold mapping and orthogonal flag attachment.
- Trust debt threshold effects are deterministic and MUST follow §9.6.
- Trust debt MUST NEVER relax an intervention's severity.
- Trust debt participates in the canonical evaluation order before Governance Tier review and audit emission.
- Public artifacts MAY expose current debt, debt deltas, provider identifiers, or opaque attestations, but MUST NOT be required to expose provider internals.
- When
flags.flaggedistrue, the trust-debt delta for that evaluation is the sum of the primary decision accumulation weight plus the configuredflagaccumulation weight. - Trust debt, intervention history, review posture, and other principal-scoped governance state MUST be keyed to
agent_id, not tosender_id,agent_label,session_id, or other ephemeral identifiers.
9.2 Provider Configuration Schema¶
trust_policy:
enabled: true
provider:
id: "acgp.core.default@1"
visibility: public
accumulation:
ok: 0.0
flag: 0.1
nudge: 0.5
escalate: 1.0
block: 2.0
halt: 5.0
decay:
decay_fraction: 0.05
period_hours: 1
min_debt: 0.0
thresholds:
elevated_monitoring: 3.0
restricted_mode: 6.0
re_tiering_review: 10.0
Note:
decay_fraction= fraction of debt removed per period (e.g., 0.05 = 5% removed).acgp.core.default@1names the default deterministic provider that reproduces the baseline semantics.
9.3 Default Provider Behavior [NORMATIVE]¶
If no provider is specified, implementations MUST behave as if provider.id = "acgp.core.default@1" were configured. acgp.core.default@1 is the default deterministic provider for v1.0 and preserves the accumulation/decay model used by v1.0 vectors:
def accumulate(current_debt: float, decision: str, flagged: bool, config) -> float:
weight = config.accumulation.get(decision, 0.0)
if flagged:
weight += config.accumulation.get("flag", 0.0)
return current_debt + weight
Note:
okinterventions do NOT accrue debt by default (weight0.0), but a flaggedokstill accrues the configuredflagweight. Omittingokorflagfrom theaccumulationmap is treated as0.0. Debt units are dimensionless and on a linear scale.Rationale:
haltremains part of trust-debt history even though it is terminal for the current governed action. Recording thehaltaccumulation preserves a monotonic governance history for post-incident review, operator scrutiny, and any subsequent session- or identity-level handling that reuses the same trust-debt provider state.
A halt intervention does not, by itself, reset accumulated trust debt unless an explicit reset or credit policy is applied.
Implementations MAY require administrative clearance, policy update, or manual reset before halted runtimes resume, but MUST preserve trust-debt observable semantics unless such a reset action is explicitly taken.
9.4 Provider Boundary [NORMATIVE]¶
The protocol standardizes observable effects, provider identifiers, extension descriptors, and optional opaque attestations. The public boundary is limited to externally verifiable behavior: emitted debt values or deltas, threshold-triggered actions, declared provider identity, and any optional attestation material. The protocol does not require public disclosure of exact private accumulation formulas, proprietary decay math, compounding rules internal to a private provider, or provider-specific derivation logic.
Private providers MUST still honor the observable semantics in §9.1 and MUST support auditable outputs at the public boundary.
9.5 Example Default Decay Algorithm [NORMATIVE FOR acgp.core.default@1]¶
Decay is computed using evaluation time (the steward's wall-clock timestamp at evaluation start), in UTC.
def apply_decay(current_debt: float, last_eval_time, eval_time, config) -> float:
if last_eval_time is None:
return current_debt # No decay on first evaluation
elapsed_hours = (eval_time - last_eval_time).total_seconds() / 3600.0
periods = elapsed_hours / config.decay.period_hours
decayed = current_debt * ((1.0 - config.decay.decay_fraction) ** periods)
return max(decayed, config.decay.min_debt)
Fractional hours are used as-is (no rounding). This eliminates drift from floor/ceil rounding differences between implementations.
9.6 Trust Debt Threshold Effects [NORMATIVE]¶
Threshold handling is deterministic at the public boundary.
elevated_monitoring
- MUST NOT change the primary intervention by itself.
- MUST cause EVAL to include
runtime_posture: "elevated_monitoring". - MUST cause a threshold-crossing event to be recorded in the Governance Store.
restricted_mode
- MUST cause EVAL to include
runtime_posture: "restricted_mode". - MUST apply a post-decision intervention floor of
escalate. okandnudgeMUST becomeescalate.escalate,block, andhaltMUST remain unchanged.- MUST cause a threshold-crossing event to be recorded in the Governance Store.
re_tiering_review
- posture MUST remain
restricted_mode. - MUST cause EVAL to include
review_required: true. - MUST cause a review-trigger event to be recorded in the Governance Store.
- actual review workflow remains deployment-defined.
Guardrails:
- trust debt MUST NEVER produce
halt. - See ACGP-4 for the tripwire-only
haltinvariant.
thresholds_crossed in public EVAL artifacts MUST represent the cumulative current-state set of active threshold labels and MUST NOT be serialized as an incremental per-evaluation list.
9.6.1 Operational Guidance [INFORMATIVE]¶
Deployments MAY increase logging verbosity, tighten internal evaluation heuristics, or notify operators when thresholds are crossed, but those behaviors are informative operator guidance rather than normative runtime semantics.
9.7 Provider Metadata and Attestation [NORMATIVE]¶
Blueprints and evaluation outputs MAY attach provider metadata using either a public descriptor or an opaque attestation:
trust_policy:
provider:
id: "urn:acgp:ext:trust-debt-private@1"
visibility: private
attestation:
digest: "sha256:..."
issued_by: "did:example:steward"
Private provider metadata MUST NOT require disclosure of formulas, source material, or provider-internal logic in public artifacts.
9.8 Example¶
Numeric note: This worked example is informative. Conformance is determined by the trust-debt vectors and the numeric tolerance rules defined in ACGP-6. Example values in this section are serialized to 4 decimal places unless otherwise stated.
Agent in Governance Tier GT-2, config: accumulation.block=2.0, accumulation.flag=0.1, decay.decay_fraction=0.05/1h
10:00 block → debt = 0.0 + 2.0 = 2.0
10:30 block → decay 0.5h: 2.0 × 0.95^0.5 = 1.9494 → + 2.0 = 3.9494
→ elevated_monitoring threshold (3.0) crossed
11:00 nudge + flag → decay 0.5h: 3.9494 × 0.95^0.5 = 3.8494 → + (0.5 + 0.1) = 4.4494
12:00 halt → decay 1.0h: 4.4494 × 0.95^1.0 = 4.2269 → + 5.0 = 9.2269
→ restricted_mode threshold (6.0) crossed
12:10 block → decay ~0.17h: 9.2269 × 0.95^0.17 = 9.1483 → + 2.0 = 11.1483
→ re_tiering_review threshold (10.0) crossed; Governance Tier review is triggered
9.9 Extension Boundary (Advanced Trust Debt)¶
Outside the default provider: piecewise/exponential decay curves, severity multipliers, multi-threshold graduated actions, compounding, and private provider semantics. Implementations MAY provide a TrustDebtProvider hook while preserving the observable semantics in §9.1.
9.10 Trust Debt Threshold Guardrails [NORMATIVE]¶
Blueprint-defined trust debt thresholds MUST NOT exceed the Clarity Baseline defaults by more than 2×. For example, if Clarity Baseline sets re_tiering_review: 10.0, a child blueprint MUST NOT set it above 20.0. Blueprints that exceed this limit MUST be rejected at load time with error code TRUST_DEBT_THRESHOLD_EXCEEDED.
10. Evaluation Flow [NORMATIVE]¶
Implementations MUST apply this sequence: resolve source blueprint -> validate resolved blueprint -> evaluate tripwires -> evaluate deterministic checks -> compute CTQ -> map thresholds -> attach flag state -> update trust policy state -> evaluate Governance Tier review triggers -> emit decision/evidence -> persist audit artifacts. Trust-policy updates for the current evaluation MUST observe the final primary decision plus any orthogonal flag.
For each governed action, implementations MUST follow this ordering:
1. Resolve applicable source Blueprint into a resolved Blueprint artifact
↓
2. Validate resolved Blueprint configuration integrity
↓
3. Evaluate tripwires first (ACGP-4 precedence rules)
→ Apply each triggered tripwire's explicit `on_fail.decision`
→ If multiple tripwires fire, the strictest explicit decision wins (`halt` > `block` > `escalate` > `nudge` > `ok`)
→ `severity` is authoring metadata only and MUST NOT alter runtime decision selection
↓ (no tripwires fired)
3b. Evaluate rule-based checks from checks[] array
→ Rule checks producing `ok`/`nudge`/`escalate`/`block` are treated as direct policy outputs
→ The strictest decision between rule-check outputs and CTQ-derived threshold outputs MUST win
→ Rule checks MUST NOT produce halt (only tripwires may)
→ If a rule check declares on_fail.decision: halt, the Blueprint MUST be rejected at load time
→ Rule checks with flag: true set flags.flagged on the intervention
↓
4. Evaluate CTQ metrics (5 standard dimensions)
→ Apply scorer for each metric
→ Compute weighted average
↓
5. Convert to risk score: Risk = 1.0 - CTQ
→ Apply effective thresholds (stricter of blueprint vs Governance Tier defaults, serialized as `GT-*` values)
↓
6. Evaluate flag conditions:
(a) any rule-based check with flag: true that matched
Flag is orthogonal — it attaches to whatever primary decision was determined in steps 3–5.
↓
7. Update trust debt state
→ Key principal-scoped state to `agent_id`
→ Compute `pre` from the decayed trust-debt state at evaluation start
→ Compute `delta` from the final primary decision plus any orthogonal flag contribution
→ Compute `post = pre + delta`
→ Compute cumulative `thresholds_crossed`
→ Derive `runtime_posture` from the active threshold set
→ If `restricted_mode` is active, apply a post-decision intervention floor of `escalate`
→ If `re_tiering_review` is active, set `review_required: true` while keeping posture `restricted_mode`
↓
8. Emit canonical EVAL artifact + intervention decision
→ Top-level EVAL `intervention` is the final post-floor outcome
→ Any pre-posture intervention MAY appear only in subordinate metadata
↓
9. Log TRACE + EVAL + INTERVENTION to the Governance Store
10.1 Failure Semantics Matrix [NORMATIVE]¶
Each row below defines the trigger condition, immediate behavior, public EVAL consequence, and authoritative source reference.
| Trigger condition | Immediate behavior | EVAL consequence | Authoritative source |
|---|---|---|---|
| Invalid envelope or schema | Reject immediately | No EVAL emitted | ACGP-2 §4 |
| Authentication or signature failure | Reject immediately | No EVAL emitted | ACGP-2 §4.1, §4.4, §9 |
| Blueprint validation error | Reject at load / activation time | No EVAL emitted | ACGP-3 §2, §3.3; ACGP-4 §9 |
Required local extension unavailable with fail_mode: reject_activation |
Reject activation | No EVAL emitted | ACGP-3 §2.2.1 |
Duplicate message_id replay with identical bytes |
Return idempotent result; do not reprocess | No new EVAL emitted | ACGP-2 §8.5 |
| Tripwire runtime evaluation failure | Fail closed per triggered tripwire on_fail |
EVAL, if emitted, MUST reflect the fail-closed tripwire outcome | ACGP-4 §10 |
| Required scorer runtime error without fallback | Preserve evaluator failure explicitly | Dimension status: "error", score: 0.0, attempted contributors visible, no redistribution |
ACGP-3 §6.1.2 |
| Required scorer runtime error with declared fallback | Use declared degraded path | Dimension status: "degraded", fallback score, attempted contributors visible, no redistribution |
ACGP-3 §6.1.2 |
Optional source-match unavailable |
Mark scorer unavailable and redistribute only where allowed | Dimension status: "unavailable", contributors: [], redistribution only where explicitly allowed |
ACGP-3 §5.3 |
| Evidence policy precondition not met | Gate knowledge_grounding before scorer execution |
Dimension status: "failed_evidence_policy", score: 0.0, contributors: [], no redistribution |
ACGP-3 §8 |
Required extension with fail_mode: deny at decision time |
Deny or block as declared by policy | Emitted deny / block outcome with failure recorded in EVAL metadata | ACGP-3 §2.2.1 |
| Steward or session path unavailable | Apply profile fallback | Use deny, allow_and_log, or cached_decision per profile fallback |
ACGP-1 §6.6 |
| Per-evaluation timeout policy | Extension-defined preview behavior only | Out of scope for v1.0 Standard EVAL contract unless preview extension is negotiated | ACGP-1 §6.6; Runtime Governance Contracts preview |
10.2 Canonical EVAL Artifact [NORMATIVE]¶
EVAL is the single canonical governance outcome artifact for a governed action. Implementations MUST NOT introduce a separate peer core receipt artifact that competes with EVAL as the public outcome record.
The top-level intervention field in EVAL MUST represent the final emitted post-floor outcome. If an implementation wishes to preserve the pre-posture intervention for provenance, it MAY expose that value only in an optional subordinate field such as evaluation_metadata.pre_posture_intervention. Implementations MUST NOT emit two peer top-level intervention fields representing pre-floor and post-floor outcomes.
Top-level EVAL fields:
| Field | Requirement | Notes |
|---|---|---|
trace_id |
MUST | Canonical trace identifier |
parent_trace_id |
MUST when present | Immediate parent linkage only |
blueprint_id |
MUST | Resolved or effective blueprint identifier |
governance_tier |
MUST | Serialized as GT-* |
ctq_dimensions |
MUST | Five canonical public dimensions |
ctq_score |
MUST | Final aggregate score |
risk_score |
MUST | 1.0 - ctq_score |
tripwires_triggered |
MUST | Empty array if none |
intervention |
MUST | Final post-floor outcome |
flagged |
MUST | Orthogonal flag state |
runtime_posture |
MUST | normal, elevated_monitoring, or restricted_mode |
review_required |
MUST | Public review-trigger visibility |
trust_debt |
MUST when trust debt is enabled | Core observable trust-debt block |
resolved_blueprint_digest |
SHOULD | Digest of effective blueprint |
evidence_summary |
SHOULD when evidence_policy is declared |
Supplementary evidence detail only |
audit_ref |
SHOULD | Audit linkage handle |
trust_debt_extended |
SHOULD when private-provider metadata exists | Optional extension / private-provider metadata |
When trust debt is enabled, the trust_debt object MUST include:
| Field | Requirement | Notes |
|---|---|---|
provider_id |
MUST | Public provider identifier |
pre |
MUST | Debt value after decay and before current accumulation |
delta |
MUST | Current evaluation contribution |
post |
MUST | Debt value after current accumulation |
thresholds_crossed |
MUST | Cumulative current-state threshold labels |
thresholds_crossed is cumulative current state, not an incremental per-evaluation event list.
Example EVAL with failed_evidence_policy and elevated monitoring:
{
"trace_id": "trace-eval-001",
"blueprint_id": "finance_qa@2.1",
"governance_tier": "GT-2",
"ctq_dimensions": {
"reasoning_quality": {
"score": 0.91,
"weight": 0.25,
"status": "evaluated",
"contributors": ["rationale_clarity", "plan_completeness"]
},
"knowledge_grounding": {
"score": 0.0,
"weight": 0.20,
"status": "failed_evidence_policy",
"contributors": []
},
"ethical_alignment": {
"score": 0.94,
"weight": 0.20,
"status": "evaluated",
"contributors": ["fairness_review"]
},
"tool_safety": {
"score": 0.89,
"weight": 0.20,
"status": "evaluated",
"contributors": ["permission_check"]
},
"context_awareness": {
"score": 0.87,
"weight": 0.15,
"status": "evaluated",
"contributors": ["situational_fit"]
}
},
"ctq_score": 0.8035,
"risk_score": 0.1965,
"tripwires_triggered": [],
"intervention": "ok",
"flagged": false,
"runtime_posture": "elevated_monitoring",
"review_required": false,
"trust_debt": {
"provider_id": "acgp.core.default@1",
"pre": 2.95,
"delta": 0.10,
"post": 3.05,
"thresholds_crossed": ["elevated_monitoring"]
},
"evidence_summary": {
"policy_declared": true,
"controls_checked": ["require_citations", "certified_only", "min_sources"],
"control_results": {
"require_citations": "passed",
"certified_only": "passed",
"min_sources": "failed"
}
}
}
Example EVAL with restricted mode, review trigger, and post-floor intervention:
{
"trace_id": "trace-eval-002",
"parent_trace_id": "trace-root-900",
"blueprint_id": "payments.standard@1.0",
"governance_tier": "GT-3",
"ctq_dimensions": {
"reasoning_quality": {
"score": 0.93,
"weight": 0.25,
"status": "evaluated",
"contributors": ["rationale_clarity", "plan_completeness"]
},
"knowledge_grounding": {
"score": 0.88,
"weight": 0.20,
"status": "degraded",
"contributors": ["citation_coverage"],
"notes": "Fallback path used after catalog latency breach"
},
"ethical_alignment": {
"score": 0.95,
"weight": 0.20,
"status": "evaluated",
"contributors": ["fairness_review"]
},
"tool_safety": {
"score": 0.92,
"weight": 0.20,
"status": "error",
"contributors": ["permission_check"],
"notes": "Primary sandbox evaluator failed without declared fallback"
},
"context_awareness": {
"score": 0.90,
"weight": 0.15,
"status": "unavailable",
"contributors": [],
"notes": "Optional context adapter unavailable; redistribution declared separately"
}
},
"ctq_score": 0.7360,
"risk_score": 0.2640,
"tripwires_triggered": [],
"intervention": "escalate",
"flagged": true,
"runtime_posture": "restricted_mode",
"review_required": true,
"trust_debt": {
"provider_id": "acgp.core.default@1",
"pre": 9.80,
"delta": 0.60,
"post": 10.40,
"thresholds_crossed": ["elevated_monitoring", "restricted_mode", "re_tiering_review"]
},
"evaluation_metadata": {
"pre_posture_intervention": "ok",
"fallback_policy": "mixed_declared_policies",
"review_event": "queued_for_governance_tier_review"
},
"audit_ref": "audit:trace-eval-002"
}
11. Authoring Guidance [INFORMATIVE]¶
Blueprint authors SHOULD:
- Use
basereferences only when the parent artifact is stable and governed - Place hard-stop safety constraints in tripwires (ACGP-4)
- Begin with conservative thresholds and calibrate with benchmarks
- Declare explicit evidence-policy requirements for regulated domains
- Avoid embedding workflow orchestration logic in policy blueprints
- Publish fixtures alongside reusable blueprints so expected outcomes remain testable and reviewable
12. Conformance Requirements¶
A conformant ACGP-3 implementation MUST:
- Parse and validate Blueprint source artifacts in YAML 1.2 and JSON
- Parse and validate resolved Blueprint artifacts when they are exchanged or stored directly
- Implement
baseresolution with the merge semantics defined in §2.4 - Support the scorer interface for all five standard scorer types (accept output structures and process results correctly). Actual model inference (Tier-2
cognitive-evaluator, Tier-3hybrid) is implementation-defined and NOT required for v1.0 conformance. Conformance test vectors supply deterministic scorer outputs. - Evaluate all five CTQ metrics with weighted aggregation (weights summing to 1.0)
- Enforce threshold mapping using risk score (\(1.0 - CTQ\))
- Apply the stricter of blueprint thresholds vs Governance Tier defaults (serialized
GT-*values) - Enforce declared evidence controls
- Preserve the trust debt observable semantics defined in §9.1 and the provider boundary defined in §9.4
- Support the default deterministic provider
acgp.core.default@1and reproduce its public accumulation/decay behavior when that provider is selected - Preserve evaluation ordering: tripwires -> deterministic checks -> CTQ -> thresholds -> flag -> trust debt -> governance-tier review -> audit
- Emit EVAL as the canonical governance outcome artifact, with top-level
interventionrepresenting the final post-floor outcome when trust debt is enabled
13. Extension Boundaries¶
| Extension | Scope |
|---|---|
| Source Catalog | Source catalog integrations, advanced source trust |
| Advanced Trust Debt | Advanced trust debt: decay, recovery, compounding |
Normative References¶
- RFC 2119 — Key words for use in RFCs to Indicate Requirement Levels
- RFC 3339 — Date and Time on the Internet: Timestamps
- RFC 8785 — JSON Canonicalization Scheme (JCS)
- ACGP-1 — Core Concepts & Terminology, v1.0, 2026
- ACGP-2 — Messages & Wire Protocol, v1.0, 2026
- ACGP-3 — Blueprints, Traces & Evaluation, v1.0, 2026
- ACGP-4 — Tripwires & Safety Semantics, v1.0, 2026
- ACGP-5 — Audit & Privacy Controls, v1.0, 2026
- ACGP-6 — Conformance, v1.0, 2026