diff --git a/docs/ARCH/11 - architecture_turnaround/20 - planner_autonomy_consolidation_2026-05-01.md b/docs/ARCH/11 - architecture_turnaround/20 - planner_autonomy_consolidation_2026-05-01.md index 19c3aed..6170c2b 100644 --- a/docs/ARCH/11 - architecture_turnaround/20 - planner_autonomy_consolidation_2026-05-01.md +++ b/docs/ARCH/11 - architecture_turnaround/20 - planner_autonomy_consolidation_2026-05-01.md @@ -124,6 +124,7 @@ The following consolidation step added catalog-level chain-template scoring: - `domain_truth_harness` and `scenario_acceptance_policy` now carry the alignment status, top catalog match, and selected-matches-top flag into replay artifacts instead of leaving them buried in raw debug JSON. - truth-harness now raises a warning finding for `selected_lower_rank` and `selected_outside_match_set` alignment states unless the replay spec explicitly marks `allow_catalog_alignment_divergence`. - scenario acceptance now groups that warning under `catalog_alignment_ok`, and `final_status.md` prints the invariant alongside direct-answer, temporal, truth-gate, human-answer, meta-context, and selected-object gates. +- truth-harness specs can now assert `expected_catalog_alignment_status`, `expected_catalog_chain_top_match`, and `expected_catalog_selected_matches_top` on each step. ## Why This Matters @@ -269,9 +270,14 @@ Latest validation after catalog-alignment acceptance invariant: - Python replay-tooling tests: passed, `6 passed` - graphify rebuild: `5949 nodes`, `12923 edges`, `136 communities` +Latest validation after catalog-alignment spec assertions: + +- Python replay-tooling tests: passed, `7 passed` +- graphify rebuild: `5951 nodes`, `12926 edges`, `139 communities` + ## Next Step -The next safe step is still to re-run live replay once the 1C side is actively polling the proxy. In parallel, local-only consolidation can continue by using `alignment_status`, alignment reason-code telemetry, truth-harness artifact surfacing, the soft divergence warning, `catalog_alignment_ok`, and the representative guard to find remaining manual branches where selected chains diverge from reviewed catalog-fabric intent. +The next safe step is still to re-run live replay once the 1C side is actively polling the proxy. In parallel, local-only consolidation can continue by using `alignment_status`, alignment reason-code telemetry, truth-harness artifact surfacing, the soft divergence warning, `catalog_alignment_ok`, step-level expected catalog-alignment assertions, and the representative guard to find remaining manual branches where selected chains diverge from reviewed catalog-fabric intent. Recommended order: diff --git a/docs/ARCH/11 - architecture_turnaround/README.md b/docs/ARCH/11 - architecture_turnaround/README.md index 6d24827..6fd175c 100644 --- a/docs/ARCH/11 - architecture_turnaround/README.md +++ b/docs/ARCH/11 - architecture_turnaround/README.md @@ -87,6 +87,7 @@ It now documents a turnaround that is already operational in code, already mater - truth-harness and scenario acceptance artifacts now preserve catalog-alignment status/top-match fields, so AGENT replay review can spot planner-vs-catalog divergence directly in `truth_review.md` and `scenario_acceptance_matrix.json`; - truth-harness now emits a warning finding when selected chains fall below or outside the reviewed catalog top match, unless a spec explicitly allows that divergence; - scenario acceptance now exposes `catalog_alignment_ok`, so planner-vs-catalog divergence is a first-class acceptance invariant instead of an ungrouped warning; + - truth-harness specs can now assert expected catalog-alignment status/top-match/top-flag per step, so AGENT packs can validate the planner brain's selected chain against the reviewed catalog route fabric; - explicit-counterparty incoming-vs-outgoing data-need graphs now select the reviewed `value_flow_comparison` chain instead of falling back to generic `value_flow`; - live map sync: [20 - planner_autonomy_consolidation_2026-05-01.md](./20%20-%20planner_autonomy_consolidation_2026-05-01.md) @@ -99,8 +100,8 @@ Current honest status: - open-world bounded-autonomy readiness: `~85%` - Post-F semantic integrity module progress: `~99%` operationally closed, with remaining risk now treated as next-slice discovery rather than an open blocker inside the closed slice - active inventory-stock breadth slice progress: `100%` for the declared scenario pack, not for arbitrary inventory questions -- Planner Autonomy Consolidation progress: `~90%` for the declared module, with catalog-fabric, value-flow arbitration, lifecycle bounded inference, broad-evaluation bridge, inventory catalog templates, inventory runtime-boundary honesty, exact inventory recipe bridging, unambiguous metadata-surface lane inference, catalog chain-template scoring, structured chain-match contract exposure, runtime/debug propagation, subject-aware bidirectional comparison arbitration, structured catalog-alignment verdicts, representative alignment regression guard, catalog-alignment reason-code telemetry, explicit `alignment_status` propagation, truth-harness/acceptance-matrix surfacing, soft divergence warning, and `catalog_alignment_ok` acceptance invariant validated locally, but live replay for the new bridge is currently blocked by missing active 1C polling and broader unfamiliar 1C asks still need replay-backed growth -- graph snapshot after latest rebuild: `5949 nodes`, `12923 edges`, `136 communities` +- Planner Autonomy Consolidation progress: `~91%` for the declared module, with catalog-fabric, value-flow arbitration, lifecycle bounded inference, broad-evaluation bridge, inventory catalog templates, inventory runtime-boundary honesty, exact inventory recipe bridging, unambiguous metadata-surface lane inference, catalog chain-template scoring, structured chain-match contract exposure, runtime/debug propagation, subject-aware bidirectional comparison arbitration, structured catalog-alignment verdicts, representative alignment regression guard, catalog-alignment reason-code telemetry, explicit `alignment_status` propagation, truth-harness/acceptance-matrix surfacing, soft divergence warning, `catalog_alignment_ok` acceptance invariant, and step-level expected catalog-alignment assertions validated locally, but live replay for the new bridge is currently blocked by missing active 1C polling and broader unfamiliar 1C asks still need replay-backed growth +- graph snapshot after latest rebuild: `5951 nodes`, `12926 edges`, `139 communities` - current breakpoint: - the validated hot paths are no longer structurally broken; - flagship continuity collapse is no longer the primary risk; @@ -156,6 +157,7 @@ Latest live proof now includes: - catalog-alignment replay artifact surfacing accepted locally: Python truth-harness/acceptance tests passed `4/4`; graphify rebuilt to `5946 nodes`, `12918 edges`, `136 communities` - catalog-alignment divergence warning accepted locally: Python truth-harness/acceptance tests passed `5/5`; graphify rebuilt to `5947 nodes`, `12920 edges`, `138 communities` - catalog-alignment acceptance invariant accepted locally: Python truth-harness/acceptance tests passed `6/6`; graphify rebuilt to `5949 nodes`, `12923 edges`, `136 communities` +- catalog-alignment spec assertions accepted locally: Python truth-harness/acceptance tests passed `7/7`; graphify rebuilt to `5951 nodes`, `12926 edges`, `139 communities` Current architectural reading: diff --git a/scripts/domain_truth_harness.py b/scripts/domain_truth_harness.py index 7b29aa3..12a705e 100644 --- a/scripts/domain_truth_harness.py +++ b/scripts/domain_truth_harness.py @@ -24,6 +24,9 @@ TECHNICAL_QUESTION_FIELDS = ( "expected_capability", "expected_recipe", "expected_result_mode", + "expected_catalog_alignment_status", + "expected_catalog_chain_top_match", + "expected_catalog_selected_matches_top", "required_filters", "forbidden_capabilities", "forbidden_recipes", @@ -89,6 +92,13 @@ def normalize_step_spec(index: int, raw_step: Any) -> dict[str, Any]: normalized_step["allowed_limited_reason_categories"] = normalize_pattern_list( step.get("allowed_limited_reason_categories") ) + normalized_step["expected_catalog_alignment_status"] = ( + str(step.get("expected_catalog_alignment_status") or "").strip() or None + ) + normalized_step["expected_catalog_chain_top_match"] = ( + str(step.get("expected_catalog_chain_top_match") or "").strip() or None + ) + normalized_step["expected_catalog_selected_matches_top"] = step.get("expected_catalog_selected_matches_top") normalized_step["required_answer_patterns_any"] = normalize_pattern_list(step.get("required_answer_patterns_any")) normalized_step["required_answer_patterns_all"] = normalize_pattern_list(step.get("required_answer_patterns_all")) normalized_step["required_direct_answer_patterns_any"] = normalize_pattern_list( @@ -312,6 +322,17 @@ def normalize_actual_filter_value(filter_key: str, raw_value: Any) -> str: return str(raw_value or "").strip() +def normalize_optional_bool(value: Any) -> bool | None: + if isinstance(value, bool): + return value + raw = str(value or "").strip().lower() + if raw in {"true", "1", "yes", "y"}: + return True + if raw in {"false", "0", "no", "n"}: + return False + return None + + def evaluate_truth_step( *, step: dict[str, Any], @@ -328,6 +349,7 @@ def evaluate_truth_step( selected_recipe = str(step_state.get("selected_recipe") or "").strip() capability_id = str(step_state.get("capability_id") or "").strip() catalog_alignment_status = str(step_state.get("mcp_discovery_catalog_chain_alignment_status") or "").strip() + catalog_chain_top_match = str(step_state.get("mcp_discovery_catalog_chain_top_match") or "").strip() limited_reason_category = str(step_state.get("limited_reason_category") or "").strip() extracted_filters = ( step_state.get("extracted_filters") if isinstance(step_state.get("extracted_filters"), dict) else {} @@ -351,6 +373,64 @@ def evaluate_truth_step( severity="warning", ) + expected_catalog_alignment_status = str( + resolve_nested_placeholders( + step.get("expected_catalog_alignment_status"), + step_results, + bindings, + runtime_bindings, + ) + or "" + ).strip() + if expected_catalog_alignment_status and catalog_alignment_status != expected_catalog_alignment_status: + append_finding( + findings, + step, + "wrong_catalog_alignment_status", + "Catalog-chain alignment status does not match the expected planner/catalog verdict for this step.", + actual=catalog_alignment_status or None, + expected=expected_catalog_alignment_status, + ) + + expected_catalog_chain_top_match = str( + resolve_nested_placeholders( + step.get("expected_catalog_chain_top_match"), + step_results, + bindings, + runtime_bindings, + ) + or "" + ).strip() + if expected_catalog_chain_top_match and catalog_chain_top_match != expected_catalog_chain_top_match: + append_finding( + findings, + step, + "wrong_catalog_chain_top_match", + "Top reviewed catalog-chain match does not match the expected chain for this step.", + actual=catalog_chain_top_match or None, + expected=expected_catalog_chain_top_match, + ) + + expected_catalog_selected_matches_top = normalize_optional_bool( + resolve_nested_placeholders( + step.get("expected_catalog_selected_matches_top"), + step_results, + bindings, + runtime_bindings, + ) + ) + if expected_catalog_selected_matches_top is not None: + actual_catalog_selected_matches_top = step_state.get("mcp_discovery_catalog_chain_selected_matches_top") is True + if actual_catalog_selected_matches_top != expected_catalog_selected_matches_top: + append_finding( + findings, + step, + "wrong_catalog_selected_matches_top", + "Selected chain top-match flag does not match the expected planner/catalog verdict for this step.", + actual=actual_catalog_selected_matches_top, + expected=expected_catalog_selected_matches_top, + ) + if step_state.get("question_resolved") != step["question_template"]: append_finding( findings, diff --git a/scripts/scenario_acceptance_policy.py b/scripts/scenario_acceptance_policy.py index 6ec6b81..2d263b6 100644 --- a/scripts/scenario_acceptance_policy.py +++ b/scripts/scenario_acceptance_policy.py @@ -145,7 +145,12 @@ def _is_meta_context_code(code: str) -> bool: def _is_catalog_alignment_code(code: str) -> bool: - return code == "catalog_alignment_divergence" + return code in { + "catalog_alignment_divergence", + "wrong_catalog_alignment_status", + "wrong_catalog_chain_top_match", + "wrong_catalog_selected_matches_top", + } def _derive_step_invariant_failures(step: dict[str, Any], findings: list[dict[str, Any]]) -> dict[str, bool]: diff --git a/scripts/test_domain_case_loop_step_state.py b/scripts/test_domain_case_loop_step_state.py index 725f03b..104fae6 100644 --- a/scripts/test_domain_case_loop_step_state.py +++ b/scripts/test_domain_case_loop_step_state.py @@ -81,6 +81,39 @@ class DomainCaseLoopStepStateTests(unittest.TestCase): self.assertEqual(reviewed["review_findings"][0]["code"], "catalog_alignment_divergence") self.assertEqual(reviewed["review_findings"][0]["severity"], "warning") + def test_truth_harness_checks_expected_catalog_alignment_fields(self) -> None: + reviewed = dth.evaluate_truth_step( + step={ + "step_id": "step_01", + "question_template": "show planner alignment", + "criticality": "critical", + "allowed_reply_types": [], + "expected_catalog_alignment_status": "selected_matches_top", + "expected_catalog_chain_top_match": "value_flow_comparison", + "expected_catalog_selected_matches_top": True, + }, + step_state={ + "question_resolved": "show planner alignment", + "reply_type": "factual", + "assistant_text": "Confirmed answer", + "actual_direct_answer": "Confirmed answer", + "detected_intent": "counterparty_turnover", + "selected_recipe": "counterparty_turnover_by_period", + "capability_id": "confirmed_counterparty_turnover", + "mcp_discovery_catalog_chain_alignment_status": "selected_matches_top", + "mcp_discovery_catalog_chain_top_match": "value_flow", + "mcp_discovery_catalog_chain_selected_matches_top": True, + "extracted_filters": {}, + }, + step_results={}, + bindings={}, + runtime_bindings={}, + ) + + self.assertEqual(reviewed["review_status"], "fail") + self.assertEqual(reviewed["critical_findings_count"], 1) + self.assertEqual(reviewed["review_findings"][0]["code"], "wrong_catalog_chain_top_match") + if __name__ == "__main__": unittest.main()