diff --git a/docs/ARCH/11 - architecture_turnaround/20 - planner_autonomy_consolidation_2026-05-01.md b/docs/ARCH/11 - architecture_turnaround/20 - planner_autonomy_consolidation_2026-05-01.md index d957983..d93eb92 100644 --- a/docs/ARCH/11 - architecture_turnaround/20 - planner_autonomy_consolidation_2026-05-01.md +++ b/docs/ARCH/11 - architecture_turnaround/20 - planner_autonomy_consolidation_2026-05-01.md @@ -122,6 +122,7 @@ The following consolidation step added catalog-level chain-template scoring: - planner reason codes now emit stable catalog-alignment telemetry for evaluated top-match, selected-equals-top, selected-lower-rank, selected-outside-match-set, and unscored selected-chain states. - `catalog_chain_template_alignment.alignment_status` now carries the same verdict as one enum-like field, and debug summary exposes it as `mcp_discovery_catalog_chain_alignment_status`. - `domain_truth_harness` and `scenario_acceptance_policy` now carry the alignment status, top catalog match, and selected-matches-top flag into replay artifacts instead of leaving them buried in raw debug JSON. +- truth-harness now raises a warning finding for `selected_lower_rank` and `selected_outside_match_set` alignment states unless the replay spec explicitly marks `allow_catalog_alignment_divergence`. ## Why This Matters @@ -257,9 +258,14 @@ Latest validation after truth-harness catalog-alignment artifact surfacing: - Python replay-tooling tests: passed, `4 passed` - graphify rebuild: `5946 nodes`, `12918 edges`, `136 communities` +Latest validation after catalog-alignment divergence warning gate: + +- Python replay-tooling tests: passed, `5 passed` +- graphify rebuild: `5947 nodes`, `12920 edges`, `138 communities` + ## Next Step -The next safe step is still to re-run live replay once the 1C side is actively polling the proxy. In parallel, local-only consolidation can continue by using `alignment_status`, alignment reason-code telemetry, truth-harness artifact surfacing, and the representative guard to find remaining manual branches where selected chains diverge from reviewed catalog-fabric intent. +The next safe step is still to re-run live replay once the 1C side is actively polling the proxy. In parallel, local-only consolidation can continue by using `alignment_status`, alignment reason-code telemetry, truth-harness artifact surfacing, the soft divergence warning, and the representative guard to find remaining manual branches where selected chains diverge from reviewed catalog-fabric intent. Recommended order: diff --git a/docs/ARCH/11 - architecture_turnaround/README.md b/docs/ARCH/11 - architecture_turnaround/README.md index 47dbf0b..07e9bf6 100644 --- a/docs/ARCH/11 - architecture_turnaround/README.md +++ b/docs/ARCH/11 - architecture_turnaround/README.md @@ -85,6 +85,7 @@ It now documents a turnaround that is already operational in code, already mater - planner reason codes now also emit stable catalog-alignment telemetry, so automated replay review can filter top-match, lower-rank, outside-match, and unscored selected-chain states without hand-parsing debug JSON; - catalog-alignment now carries a single `alignment_status` verdict through planner/runtime/debug, making replay divergence detection explicit instead of reconstructing it from booleans; - truth-harness and scenario acceptance artifacts now preserve catalog-alignment status/top-match fields, so AGENT replay review can spot planner-vs-catalog divergence directly in `truth_review.md` and `scenario_acceptance_matrix.json`; + - truth-harness now emits a warning finding when selected chains fall below or outside the reviewed catalog top match, unless a spec explicitly allows that divergence; - explicit-counterparty incoming-vs-outgoing data-need graphs now select the reviewed `value_flow_comparison` chain instead of falling back to generic `value_flow`; - live map sync: [20 - planner_autonomy_consolidation_2026-05-01.md](./20%20-%20planner_autonomy_consolidation_2026-05-01.md) @@ -97,8 +98,8 @@ Current honest status: - open-world bounded-autonomy readiness: `~85%` - Post-F semantic integrity module progress: `~99%` operationally closed, with remaining risk now treated as next-slice discovery rather than an open blocker inside the closed slice - active inventory-stock breadth slice progress: `100%` for the declared scenario pack, not for arbitrary inventory questions -- Planner Autonomy Consolidation progress: `~88%` for the declared module, with catalog-fabric, value-flow arbitration, lifecycle bounded inference, broad-evaluation bridge, inventory catalog templates, inventory runtime-boundary honesty, exact inventory recipe bridging, unambiguous metadata-surface lane inference, catalog chain-template scoring, structured chain-match contract exposure, runtime/debug propagation, subject-aware bidirectional comparison arbitration, structured catalog-alignment verdicts, representative alignment regression guard, catalog-alignment reason-code telemetry, explicit `alignment_status` propagation, and truth-harness/acceptance-matrix surfacing validated locally, but live replay for the new bridge is currently blocked by missing active 1C polling and broader unfamiliar 1C asks still need replay-backed growth -- graph snapshot after latest rebuild: `5946 nodes`, `12918 edges`, `136 communities` +- Planner Autonomy Consolidation progress: `~89%` for the declared module, with catalog-fabric, value-flow arbitration, lifecycle bounded inference, broad-evaluation bridge, inventory catalog templates, inventory runtime-boundary honesty, exact inventory recipe bridging, unambiguous metadata-surface lane inference, catalog chain-template scoring, structured chain-match contract exposure, runtime/debug propagation, subject-aware bidirectional comparison arbitration, structured catalog-alignment verdicts, representative alignment regression guard, catalog-alignment reason-code telemetry, explicit `alignment_status` propagation, truth-harness/acceptance-matrix surfacing, and soft divergence warning validated locally, but live replay for the new bridge is currently blocked by missing active 1C polling and broader unfamiliar 1C asks still need replay-backed growth +- graph snapshot after latest rebuild: `5947 nodes`, `12920 edges`, `138 communities` - current breakpoint: - the validated hot paths are no longer structurally broken; - flagship continuity collapse is no longer the primary risk; @@ -152,6 +153,7 @@ Latest live proof now includes: - catalog-alignment reason-code telemetry accepted locally: planner/runtime slice passed `53/53`; full MCP-discovery suite passed `283/283` with `9` skipped; build passed; graphify rebuilt to `5943 nodes`, `12915 edges`, `136 communities` - catalog-alignment status verdict accepted locally: planner/runtime/debug slice passed `55/55`; full MCP-discovery suite passed `283/283` with `9` skipped; build passed; graphify rebuilt to `5943 nodes`, `12915 edges`, `136 communities` - catalog-alignment replay artifact surfacing accepted locally: Python truth-harness/acceptance tests passed `4/4`; graphify rebuilt to `5946 nodes`, `12918 edges`, `136 communities` +- catalog-alignment divergence warning accepted locally: Python truth-harness/acceptance tests passed `5/5`; graphify rebuilt to `5947 nodes`, `12920 edges`, `138 communities` Current architectural reading: diff --git a/scripts/domain_truth_harness.py b/scripts/domain_truth_harness.py index 2315f0b..7b29aa3 100644 --- a/scripts/domain_truth_harness.py +++ b/scripts/domain_truth_harness.py @@ -285,11 +285,12 @@ def append_finding( *, actual: Any = None, expected: Any = None, + severity: str | None = None, ) -> None: findings.append( { "code": code, - "severity": step.get("criticality") or DEFAULT_CRITICALITY, + "severity": severity or step.get("criticality") or DEFAULT_CRITICALITY, "message": message, "actual": actual, "expected": expected, @@ -326,11 +327,30 @@ def evaluate_truth_step( detected_intent = str(step_state.get("detected_intent") or "").strip() selected_recipe = str(step_state.get("selected_recipe") or "").strip() capability_id = str(step_state.get("capability_id") or "").strip() + catalog_alignment_status = str(step_state.get("mcp_discovery_catalog_chain_alignment_status") or "").strip() limited_reason_category = str(step_state.get("limited_reason_category") or "").strip() extracted_filters = ( step_state.get("extracted_filters") if isinstance(step_state.get("extracted_filters"), dict) else {} ) + if ( + catalog_alignment_status in {"selected_lower_rank", "selected_outside_match_set"} + and not bool(step.get("allow_catalog_alignment_divergence")) + ): + append_finding( + findings, + step, + "catalog_alignment_divergence", + "Planner selected chain diverges from the top reviewed catalog-chain match and needs semantic review.", + actual={ + "alignment_status": catalog_alignment_status, + "top_match": step_state.get("mcp_discovery_catalog_chain_top_match"), + "selected_matches_top": step_state.get("mcp_discovery_catalog_chain_selected_matches_top"), + }, + expected="selected_matches_top or explicit allow_catalog_alignment_divergence", + severity="warning", + ) + if step_state.get("question_resolved") != step["question_template"]: append_finding( findings, diff --git a/scripts/test_domain_case_loop_step_state.py b/scripts/test_domain_case_loop_step_state.py index d2ed7a2..725f03b 100644 --- a/scripts/test_domain_case_loop_step_state.py +++ b/scripts/test_domain_case_loop_step_state.py @@ -8,6 +8,7 @@ from pathlib import Path sys.path.insert(0, str(Path(__file__).resolve().parent)) import domain_case_loop as dcl +import domain_truth_harness as dth class DomainCaseLoopStepStateTests(unittest.TestCase): @@ -49,6 +50,37 @@ class DomainCaseLoopStepStateTests(unittest.TestCase): self.assertEqual(step_state["mcp_discovery_catalog_chain_top_match"], "value_flow") self.assertTrue(step_state["mcp_discovery_catalog_chain_selected_matches_top"]) + def test_truth_harness_warns_on_catalog_alignment_divergence(self) -> None: + reviewed = dth.evaluate_truth_step( + step={ + "step_id": "step_01", + "question_template": "show planner alignment", + "criticality": "critical", + "allowed_reply_types": [], + }, + step_state={ + "question_resolved": "show planner alignment", + "reply_type": "factual", + "assistant_text": "Confirmed answer", + "actual_direct_answer": "Confirmed answer", + "detected_intent": "counterparty_turnover", + "selected_recipe": "counterparty_turnover_by_period", + "capability_id": "confirmed_counterparty_turnover", + "mcp_discovery_catalog_chain_alignment_status": "selected_outside_match_set", + "mcp_discovery_catalog_chain_top_match": "value_flow_comparison", + "mcp_discovery_catalog_chain_selected_matches_top": False, + "extracted_filters": {}, + }, + step_results={}, + bindings={}, + runtime_bindings={}, + ) + + self.assertEqual(reviewed["review_status"], "warning") + self.assertEqual(reviewed["warning_findings_count"], 1) + self.assertEqual(reviewed["review_findings"][0]["code"], "catalog_alignment_divergence") + self.assertEqual(reviewed["review_findings"][0]["severity"], "warning") + if __name__ == "__main__": unittest.main()