Planner Autonomy: проверять catalog-alignment в truth harness
This commit is contained in:
parent
e58a9664e0
commit
fe12967e2d
|
|
@ -124,6 +124,7 @@ The following consolidation step added catalog-level chain-template scoring:
|
|||
- `domain_truth_harness` and `scenario_acceptance_policy` now carry the alignment status, top catalog match, and selected-matches-top flag into replay artifacts instead of leaving them buried in raw debug JSON.
|
||||
- truth-harness now raises a warning finding for `selected_lower_rank` and `selected_outside_match_set` alignment states unless the replay spec explicitly marks `allow_catalog_alignment_divergence`.
|
||||
- scenario acceptance now groups that warning under `catalog_alignment_ok`, and `final_status.md` prints the invariant alongside direct-answer, temporal, truth-gate, human-answer, meta-context, and selected-object gates.
|
||||
- truth-harness specs can now assert `expected_catalog_alignment_status`, `expected_catalog_chain_top_match`, and `expected_catalog_selected_matches_top` on each step.
|
||||
|
||||
## Why This Matters
|
||||
|
||||
|
|
@ -269,9 +270,14 @@ Latest validation after catalog-alignment acceptance invariant:
|
|||
- Python replay-tooling tests: passed, `6 passed`
|
||||
- graphify rebuild: `5949 nodes`, `12923 edges`, `136 communities`
|
||||
|
||||
Latest validation after catalog-alignment spec assertions:
|
||||
|
||||
- Python replay-tooling tests: passed, `7 passed`
|
||||
- graphify rebuild: `5951 nodes`, `12926 edges`, `139 communities`
|
||||
|
||||
## Next Step
|
||||
|
||||
The next safe step is still to re-run live replay once the 1C side is actively polling the proxy. In parallel, local-only consolidation can continue by using `alignment_status`, alignment reason-code telemetry, truth-harness artifact surfacing, the soft divergence warning, `catalog_alignment_ok`, and the representative guard to find remaining manual branches where selected chains diverge from reviewed catalog-fabric intent.
|
||||
The next safe step is still to re-run live replay once the 1C side is actively polling the proxy. In parallel, local-only consolidation can continue by using `alignment_status`, alignment reason-code telemetry, truth-harness artifact surfacing, the soft divergence warning, `catalog_alignment_ok`, step-level expected catalog-alignment assertions, and the representative guard to find remaining manual branches where selected chains diverge from reviewed catalog-fabric intent.
|
||||
|
||||
Recommended order:
|
||||
|
||||
|
|
|
|||
|
|
@ -87,6 +87,7 @@ It now documents a turnaround that is already operational in code, already mater
|
|||
- truth-harness and scenario acceptance artifacts now preserve catalog-alignment status/top-match fields, so AGENT replay review can spot planner-vs-catalog divergence directly in `truth_review.md` and `scenario_acceptance_matrix.json`;
|
||||
- truth-harness now emits a warning finding when selected chains fall below or outside the reviewed catalog top match, unless a spec explicitly allows that divergence;
|
||||
- scenario acceptance now exposes `catalog_alignment_ok`, so planner-vs-catalog divergence is a first-class acceptance invariant instead of an ungrouped warning;
|
||||
- truth-harness specs can now assert expected catalog-alignment status/top-match/top-flag per step, so AGENT packs can validate the planner brain's selected chain against the reviewed catalog route fabric;
|
||||
- explicit-counterparty incoming-vs-outgoing data-need graphs now select the reviewed `value_flow_comparison` chain instead of falling back to generic `value_flow`;
|
||||
- live map sync: [20 - planner_autonomy_consolidation_2026-05-01.md](./20%20-%20planner_autonomy_consolidation_2026-05-01.md)
|
||||
|
||||
|
|
@ -99,8 +100,8 @@ Current honest status:
|
|||
- open-world bounded-autonomy readiness: `~85%`
|
||||
- Post-F semantic integrity module progress: `~99%` operationally closed, with remaining risk now treated as next-slice discovery rather than an open blocker inside the closed slice
|
||||
- active inventory-stock breadth slice progress: `100%` for the declared scenario pack, not for arbitrary inventory questions
|
||||
- Planner Autonomy Consolidation progress: `~90%` for the declared module, with catalog-fabric, value-flow arbitration, lifecycle bounded inference, broad-evaluation bridge, inventory catalog templates, inventory runtime-boundary honesty, exact inventory recipe bridging, unambiguous metadata-surface lane inference, catalog chain-template scoring, structured chain-match contract exposure, runtime/debug propagation, subject-aware bidirectional comparison arbitration, structured catalog-alignment verdicts, representative alignment regression guard, catalog-alignment reason-code telemetry, explicit `alignment_status` propagation, truth-harness/acceptance-matrix surfacing, soft divergence warning, and `catalog_alignment_ok` acceptance invariant validated locally, but live replay for the new bridge is currently blocked by missing active 1C polling and broader unfamiliar 1C asks still need replay-backed growth
|
||||
- graph snapshot after latest rebuild: `5949 nodes`, `12923 edges`, `136 communities`
|
||||
- Planner Autonomy Consolidation progress: `~91%` for the declared module, with catalog-fabric, value-flow arbitration, lifecycle bounded inference, broad-evaluation bridge, inventory catalog templates, inventory runtime-boundary honesty, exact inventory recipe bridging, unambiguous metadata-surface lane inference, catalog chain-template scoring, structured chain-match contract exposure, runtime/debug propagation, subject-aware bidirectional comparison arbitration, structured catalog-alignment verdicts, representative alignment regression guard, catalog-alignment reason-code telemetry, explicit `alignment_status` propagation, truth-harness/acceptance-matrix surfacing, soft divergence warning, `catalog_alignment_ok` acceptance invariant, and step-level expected catalog-alignment assertions validated locally, but live replay for the new bridge is currently blocked by missing active 1C polling and broader unfamiliar 1C asks still need replay-backed growth
|
||||
- graph snapshot after latest rebuild: `5951 nodes`, `12926 edges`, `139 communities`
|
||||
- current breakpoint:
|
||||
- the validated hot paths are no longer structurally broken;
|
||||
- flagship continuity collapse is no longer the primary risk;
|
||||
|
|
@ -156,6 +157,7 @@ Latest live proof now includes:
|
|||
- catalog-alignment replay artifact surfacing accepted locally: Python truth-harness/acceptance tests passed `4/4`; graphify rebuilt to `5946 nodes`, `12918 edges`, `136 communities`
|
||||
- catalog-alignment divergence warning accepted locally: Python truth-harness/acceptance tests passed `5/5`; graphify rebuilt to `5947 nodes`, `12920 edges`, `138 communities`
|
||||
- catalog-alignment acceptance invariant accepted locally: Python truth-harness/acceptance tests passed `6/6`; graphify rebuilt to `5949 nodes`, `12923 edges`, `136 communities`
|
||||
- catalog-alignment spec assertions accepted locally: Python truth-harness/acceptance tests passed `7/7`; graphify rebuilt to `5951 nodes`, `12926 edges`, `139 communities`
|
||||
|
||||
Current architectural reading:
|
||||
|
||||
|
|
|
|||
|
|
@ -24,6 +24,9 @@ TECHNICAL_QUESTION_FIELDS = (
|
|||
"expected_capability",
|
||||
"expected_recipe",
|
||||
"expected_result_mode",
|
||||
"expected_catalog_alignment_status",
|
||||
"expected_catalog_chain_top_match",
|
||||
"expected_catalog_selected_matches_top",
|
||||
"required_filters",
|
||||
"forbidden_capabilities",
|
||||
"forbidden_recipes",
|
||||
|
|
@ -89,6 +92,13 @@ def normalize_step_spec(index: int, raw_step: Any) -> dict[str, Any]:
|
|||
normalized_step["allowed_limited_reason_categories"] = normalize_pattern_list(
|
||||
step.get("allowed_limited_reason_categories")
|
||||
)
|
||||
normalized_step["expected_catalog_alignment_status"] = (
|
||||
str(step.get("expected_catalog_alignment_status") or "").strip() or None
|
||||
)
|
||||
normalized_step["expected_catalog_chain_top_match"] = (
|
||||
str(step.get("expected_catalog_chain_top_match") or "").strip() or None
|
||||
)
|
||||
normalized_step["expected_catalog_selected_matches_top"] = step.get("expected_catalog_selected_matches_top")
|
||||
normalized_step["required_answer_patterns_any"] = normalize_pattern_list(step.get("required_answer_patterns_any"))
|
||||
normalized_step["required_answer_patterns_all"] = normalize_pattern_list(step.get("required_answer_patterns_all"))
|
||||
normalized_step["required_direct_answer_patterns_any"] = normalize_pattern_list(
|
||||
|
|
@ -312,6 +322,17 @@ def normalize_actual_filter_value(filter_key: str, raw_value: Any) -> str:
|
|||
return str(raw_value or "").strip()
|
||||
|
||||
|
||||
def normalize_optional_bool(value: Any) -> bool | None:
|
||||
if isinstance(value, bool):
|
||||
return value
|
||||
raw = str(value or "").strip().lower()
|
||||
if raw in {"true", "1", "yes", "y"}:
|
||||
return True
|
||||
if raw in {"false", "0", "no", "n"}:
|
||||
return False
|
||||
return None
|
||||
|
||||
|
||||
def evaluate_truth_step(
|
||||
*,
|
||||
step: dict[str, Any],
|
||||
|
|
@ -328,6 +349,7 @@ def evaluate_truth_step(
|
|||
selected_recipe = str(step_state.get("selected_recipe") or "").strip()
|
||||
capability_id = str(step_state.get("capability_id") or "").strip()
|
||||
catalog_alignment_status = str(step_state.get("mcp_discovery_catalog_chain_alignment_status") or "").strip()
|
||||
catalog_chain_top_match = str(step_state.get("mcp_discovery_catalog_chain_top_match") or "").strip()
|
||||
limited_reason_category = str(step_state.get("limited_reason_category") or "").strip()
|
||||
extracted_filters = (
|
||||
step_state.get("extracted_filters") if isinstance(step_state.get("extracted_filters"), dict) else {}
|
||||
|
|
@ -351,6 +373,64 @@ def evaluate_truth_step(
|
|||
severity="warning",
|
||||
)
|
||||
|
||||
expected_catalog_alignment_status = str(
|
||||
resolve_nested_placeholders(
|
||||
step.get("expected_catalog_alignment_status"),
|
||||
step_results,
|
||||
bindings,
|
||||
runtime_bindings,
|
||||
)
|
||||
or ""
|
||||
).strip()
|
||||
if expected_catalog_alignment_status and catalog_alignment_status != expected_catalog_alignment_status:
|
||||
append_finding(
|
||||
findings,
|
||||
step,
|
||||
"wrong_catalog_alignment_status",
|
||||
"Catalog-chain alignment status does not match the expected planner/catalog verdict for this step.",
|
||||
actual=catalog_alignment_status or None,
|
||||
expected=expected_catalog_alignment_status,
|
||||
)
|
||||
|
||||
expected_catalog_chain_top_match = str(
|
||||
resolve_nested_placeholders(
|
||||
step.get("expected_catalog_chain_top_match"),
|
||||
step_results,
|
||||
bindings,
|
||||
runtime_bindings,
|
||||
)
|
||||
or ""
|
||||
).strip()
|
||||
if expected_catalog_chain_top_match and catalog_chain_top_match != expected_catalog_chain_top_match:
|
||||
append_finding(
|
||||
findings,
|
||||
step,
|
||||
"wrong_catalog_chain_top_match",
|
||||
"Top reviewed catalog-chain match does not match the expected chain for this step.",
|
||||
actual=catalog_chain_top_match or None,
|
||||
expected=expected_catalog_chain_top_match,
|
||||
)
|
||||
|
||||
expected_catalog_selected_matches_top = normalize_optional_bool(
|
||||
resolve_nested_placeholders(
|
||||
step.get("expected_catalog_selected_matches_top"),
|
||||
step_results,
|
||||
bindings,
|
||||
runtime_bindings,
|
||||
)
|
||||
)
|
||||
if expected_catalog_selected_matches_top is not None:
|
||||
actual_catalog_selected_matches_top = step_state.get("mcp_discovery_catalog_chain_selected_matches_top") is True
|
||||
if actual_catalog_selected_matches_top != expected_catalog_selected_matches_top:
|
||||
append_finding(
|
||||
findings,
|
||||
step,
|
||||
"wrong_catalog_selected_matches_top",
|
||||
"Selected chain top-match flag does not match the expected planner/catalog verdict for this step.",
|
||||
actual=actual_catalog_selected_matches_top,
|
||||
expected=expected_catalog_selected_matches_top,
|
||||
)
|
||||
|
||||
if step_state.get("question_resolved") != step["question_template"]:
|
||||
append_finding(
|
||||
findings,
|
||||
|
|
|
|||
|
|
@ -145,7 +145,12 @@ def _is_meta_context_code(code: str) -> bool:
|
|||
|
||||
|
||||
def _is_catalog_alignment_code(code: str) -> bool:
|
||||
return code == "catalog_alignment_divergence"
|
||||
return code in {
|
||||
"catalog_alignment_divergence",
|
||||
"wrong_catalog_alignment_status",
|
||||
"wrong_catalog_chain_top_match",
|
||||
"wrong_catalog_selected_matches_top",
|
||||
}
|
||||
|
||||
|
||||
def _derive_step_invariant_failures(step: dict[str, Any], findings: list[dict[str, Any]]) -> dict[str, bool]:
|
||||
|
|
|
|||
|
|
@ -81,6 +81,39 @@ class DomainCaseLoopStepStateTests(unittest.TestCase):
|
|||
self.assertEqual(reviewed["review_findings"][0]["code"], "catalog_alignment_divergence")
|
||||
self.assertEqual(reviewed["review_findings"][0]["severity"], "warning")
|
||||
|
||||
def test_truth_harness_checks_expected_catalog_alignment_fields(self) -> None:
|
||||
reviewed = dth.evaluate_truth_step(
|
||||
step={
|
||||
"step_id": "step_01",
|
||||
"question_template": "show planner alignment",
|
||||
"criticality": "critical",
|
||||
"allowed_reply_types": [],
|
||||
"expected_catalog_alignment_status": "selected_matches_top",
|
||||
"expected_catalog_chain_top_match": "value_flow_comparison",
|
||||
"expected_catalog_selected_matches_top": True,
|
||||
},
|
||||
step_state={
|
||||
"question_resolved": "show planner alignment",
|
||||
"reply_type": "factual",
|
||||
"assistant_text": "Confirmed answer",
|
||||
"actual_direct_answer": "Confirmed answer",
|
||||
"detected_intent": "counterparty_turnover",
|
||||
"selected_recipe": "counterparty_turnover_by_period",
|
||||
"capability_id": "confirmed_counterparty_turnover",
|
||||
"mcp_discovery_catalog_chain_alignment_status": "selected_matches_top",
|
||||
"mcp_discovery_catalog_chain_top_match": "value_flow",
|
||||
"mcp_discovery_catalog_chain_selected_matches_top": True,
|
||||
"extracted_filters": {},
|
||||
},
|
||||
step_results={},
|
||||
bindings={},
|
||||
runtime_bindings={},
|
||||
)
|
||||
|
||||
self.assertEqual(reviewed["review_status"], "fail")
|
||||
self.assertEqual(reviewed["critical_findings_count"], 1)
|
||||
self.assertEqual(reviewed["review_findings"][0]["code"], "wrong_catalog_chain_top_match")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
unittest.main()
|
||||
|
|
|
|||
Loading…
Reference in New Issue