Planner Autonomy: проверять catalog-alignment в truth harness

2026-05-01 16:25:22 +03:00 · 2026-05-01 16:25:22 +03:00 · fe12967e2d
parent e58a9664e0
commit fe12967e2d
5 changed files with 130 additions and 4 deletions
--- a/planner_autonomy_consolidation_2026-05-01.md
+++ b/planner_autonomy_consolidation_2026-05-01.md
@ -124,6 +124,7 @@ The following consolidation step added catalog-level chain-template scoring:
 - `domain_truth_harness` and `scenario_acceptance_policy` now carry the alignment status, top catalog match, and selected-matches-top flag into replay artifacts instead of leaving them buried in raw debug JSON.
 - truth-harness now raises a warning finding for `selected_lower_rank` and `selected_outside_match_set` alignment states unless the replay spec explicitly marks `allow_catalog_alignment_divergence`.
 - scenario acceptance now groups that warning under `catalog_alignment_ok`, and `final_status.md` prints the invariant alongside direct-answer, temporal, truth-gate, human-answer, meta-context, and selected-object gates.
+- truth-harness specs can now assert `expected_catalog_alignment_status`, `expected_catalog_chain_top_match`, and `expected_catalog_selected_matches_top` on each step.

 ## Why This Matters

@ -269,9 +270,14 @@ Latest validation after catalog-alignment acceptance invariant:
 - Python replay-tooling tests: passed, `6 passed`
 - graphify rebuild: `5949 nodes`, `12923 edges`, `136 communities`

+Latest validation after catalog-alignment spec assertions:
+
+- Python replay-tooling tests: passed, `7 passed`
+- graphify rebuild: `5951 nodes`, `12926 edges`, `139 communities`
+
 ## Next Step

-The next safe step is still to re-run live replay once the 1C side is actively polling the proxy. In parallel, local-only consolidation can continue by using `alignment_status`, alignment reason-code telemetry, truth-harness artifact surfacing, the soft divergence warning, `catalog_alignment_ok`, and the representative guard to find remaining manual branches where selected chains diverge from reviewed catalog-fabric intent.
+The next safe step is still to re-run live replay once the 1C side is actively polling the proxy. In parallel, local-only consolidation can continue by using `alignment_status`, alignment reason-code telemetry, truth-harness artifact surfacing, the soft divergence warning, `catalog_alignment_ok`, step-level expected catalog-alignment assertions, and the representative guard to find remaining manual branches where selected chains diverge from reviewed catalog-fabric intent.

 Recommended order:

--- a/architecture_turnaround/README.md
+++ b/architecture_turnaround/README.md
@ -87,6 +87,7 @@ It now documents a turnaround that is already operational in code, already mater
  - truth-harness and scenario acceptance artifacts now preserve catalog-alignment status/top-match fields, so AGENT replay review can spot planner-vs-catalog divergence directly in `truth_review.md` and `scenario_acceptance_matrix.json`;
  - truth-harness now emits a warning finding when selected chains fall below or outside the reviewed catalog top match, unless a spec explicitly allows that divergence;
  - scenario acceptance now exposes `catalog_alignment_ok`, so planner-vs-catalog divergence is a first-class acceptance invariant instead of an ungrouped warning;
+  - truth-harness specs can now assert expected catalog-alignment status/top-match/top-flag per step, so AGENT packs can validate the planner brain's selected chain against the reviewed catalog route fabric;
  - explicit-counterparty incoming-vs-outgoing data-need graphs now select the reviewed `value_flow_comparison` chain instead of falling back to generic `value_flow`;
  - live map sync: [20 - planner_autonomy_consolidation_2026-05-01.md](./20%20-%20planner_autonomy_consolidation_2026-05-01.md)

@ -99,8 +100,8 @@ Current honest status:
 - open-world bounded-autonomy readiness: `~85%`
 - Post-F semantic integrity module progress: `~99%` operationally closed, with remaining risk now treated as next-slice discovery rather than an open blocker inside the closed slice
 - active inventory-stock breadth slice progress: `100%` for the declared scenario pack, not for arbitrary inventory questions
- Planner Autonomy Consolidation progress: `~90%` for the declared module, with catalog-fabric, value-flow arbitration, lifecycle bounded inference, broad-evaluation bridge, inventory catalog templates, inventory runtime-boundary honesty, exact inventory recipe bridging, unambiguous metadata-surface lane inference, catalog chain-template scoring, structured chain-match contract exposure, runtime/debug propagation, subject-aware bidirectional comparison arbitration, structured catalog-alignment verdicts, representative alignment regression guard, catalog-alignment reason-code telemetry, explicit `alignment_status` propagation, truth-harness/acceptance-matrix surfacing, soft divergence warning, and `catalog_alignment_ok` acceptance invariant validated locally, but live replay for the new bridge is currently blocked by missing active 1C polling and broader unfamiliar 1C asks still need replay-backed growth
- graph snapshot after latest rebuild: `5949 nodes`, `12923 edges`, `136 communities`
+- Planner Autonomy Consolidation progress: `~91%` for the declared module, with catalog-fabric, value-flow arbitration, lifecycle bounded inference, broad-evaluation bridge, inventory catalog templates, inventory runtime-boundary honesty, exact inventory recipe bridging, unambiguous metadata-surface lane inference, catalog chain-template scoring, structured chain-match contract exposure, runtime/debug propagation, subject-aware bidirectional comparison arbitration, structured catalog-alignment verdicts, representative alignment regression guard, catalog-alignment reason-code telemetry, explicit `alignment_status` propagation, truth-harness/acceptance-matrix surfacing, soft divergence warning, `catalog_alignment_ok` acceptance invariant, and step-level expected catalog-alignment assertions validated locally, but live replay for the new bridge is currently blocked by missing active 1C polling and broader unfamiliar 1C asks still need replay-backed growth
+- graph snapshot after latest rebuild: `5951 nodes`, `12926 edges`, `139 communities`
 - current breakpoint:
  - the validated hot paths are no longer structurally broken;
  - flagship continuity collapse is no longer the primary risk;
@ -156,6 +157,7 @@ Latest live proof now includes:
 - catalog-alignment replay artifact surfacing accepted locally: Python truth-harness/acceptance tests passed `4/4`; graphify rebuilt to `5946 nodes`, `12918 edges`, `136 communities`
 - catalog-alignment divergence warning accepted locally: Python truth-harness/acceptance tests passed `5/5`; graphify rebuilt to `5947 nodes`, `12920 edges`, `138 communities`
 - catalog-alignment acceptance invariant accepted locally: Python truth-harness/acceptance tests passed `6/6`; graphify rebuilt to `5949 nodes`, `12923 edges`, `136 communities`
+- catalog-alignment spec assertions accepted locally: Python truth-harness/acceptance tests passed `7/7`; graphify rebuilt to `5951 nodes`, `12926 edges`, `139 communities`

 Current architectural reading:

--- a/scripts/domain_truth_harness.py
+++ b/scripts/domain_truth_harness.py
@ -24,6 +24,9 @@ TECHNICAL_QUESTION_FIELDS = (
    "expected_capability",
    "expected_recipe",
    "expected_result_mode",
+    "expected_catalog_alignment_status",
+    "expected_catalog_chain_top_match",
+    "expected_catalog_selected_matches_top",
    "required_filters",
    "forbidden_capabilities",
    "forbidden_recipes",
@ -89,6 +92,13 @@ def normalize_step_spec(index: int, raw_step: Any) -> dict[str, Any]:
    normalized_step["allowed_limited_reason_categories"] = normalize_pattern_list(
        step.get("allowed_limited_reason_categories")
    )
+    normalized_step["expected_catalog_alignment_status"] = (
+        str(step.get("expected_catalog_alignment_status") or "").strip() or None
+    )
+    normalized_step["expected_catalog_chain_top_match"] = (
+        str(step.get("expected_catalog_chain_top_match") or "").strip() or None
+    )
+    normalized_step["expected_catalog_selected_matches_top"] = step.get("expected_catalog_selected_matches_top")
    normalized_step["required_answer_patterns_any"] = normalize_pattern_list(step.get("required_answer_patterns_any"))
    normalized_step["required_answer_patterns_all"] = normalize_pattern_list(step.get("required_answer_patterns_all"))
    normalized_step["required_direct_answer_patterns_any"] = normalize_pattern_list(
@ -312,6 +322,17 @@ def normalize_actual_filter_value(filter_key: str, raw_value: Any) -> str:
    return str(raw_value or "").strip()


+def normalize_optional_bool(value: Any) -> bool | None:
+    if isinstance(value, bool):
+        return value
+    raw = str(value or "").strip().lower()
+    if raw in {"true", "1", "yes", "y"}:
+        return True
+    if raw in {"false", "0", "no", "n"}:
+        return False
+    return None
+
+
 def evaluate_truth_step(
    *,
    step: dict[str, Any],
@ -328,6 +349,7 @@ def evaluate_truth_step(
    selected_recipe = str(step_state.get("selected_recipe") or "").strip()
    capability_id = str(step_state.get("capability_id") or "").strip()
    catalog_alignment_status = str(step_state.get("mcp_discovery_catalog_chain_alignment_status") or "").strip()
+    catalog_chain_top_match = str(step_state.get("mcp_discovery_catalog_chain_top_match") or "").strip()
    limited_reason_category = str(step_state.get("limited_reason_category") or "").strip()
    extracted_filters = (
        step_state.get("extracted_filters") if isinstance(step_state.get("extracted_filters"), dict) else {}
@ -351,6 +373,64 @@ def evaluate_truth_step(
            severity="warning",
        )

+    expected_catalog_alignment_status = str(
+        resolve_nested_placeholders(
+            step.get("expected_catalog_alignment_status"),
+            step_results,
+            bindings,
+            runtime_bindings,
+        )
+        or ""
+    ).strip()
+    if expected_catalog_alignment_status and catalog_alignment_status != expected_catalog_alignment_status:
+        append_finding(
+            findings,
+            step,
+            "wrong_catalog_alignment_status",
+            "Catalog-chain alignment status does not match the expected planner/catalog verdict for this step.",
+            actual=catalog_alignment_status or None,
+            expected=expected_catalog_alignment_status,
+        )
+
+    expected_catalog_chain_top_match = str(
+        resolve_nested_placeholders(
+            step.get("expected_catalog_chain_top_match"),
+            step_results,
+            bindings,
+            runtime_bindings,
+        )
+        or ""
+    ).strip()
+    if expected_catalog_chain_top_match and catalog_chain_top_match != expected_catalog_chain_top_match:
+        append_finding(
+            findings,
+            step,
+            "wrong_catalog_chain_top_match",
+            "Top reviewed catalog-chain match does not match the expected chain for this step.",
+            actual=catalog_chain_top_match or None,
+            expected=expected_catalog_chain_top_match,
+        )
+
+    expected_catalog_selected_matches_top = normalize_optional_bool(
+        resolve_nested_placeholders(
+            step.get("expected_catalog_selected_matches_top"),
+            step_results,
+            bindings,
+            runtime_bindings,
+        )
+    )
+    if expected_catalog_selected_matches_top is not None:
+        actual_catalog_selected_matches_top = step_state.get("mcp_discovery_catalog_chain_selected_matches_top") is True
+        if actual_catalog_selected_matches_top != expected_catalog_selected_matches_top:
+            append_finding(
+                findings,
+                step,
+                "wrong_catalog_selected_matches_top",
+                "Selected chain top-match flag does not match the expected planner/catalog verdict for this step.",
+                actual=actual_catalog_selected_matches_top,
+                expected=expected_catalog_selected_matches_top,
+            )
+
    if step_state.get("question_resolved") != step["question_template"]:
        append_finding(
            findings,
--- a/scripts/scenario_acceptance_policy.py
+++ b/scripts/scenario_acceptance_policy.py
@ -145,7 +145,12 @@ def _is_meta_context_code(code: str) -> bool:


 def _is_catalog_alignment_code(code: str) -> bool:
-    return code == "catalog_alignment_divergence"
+    return code in {
+        "catalog_alignment_divergence",
+        "wrong_catalog_alignment_status",
+        "wrong_catalog_chain_top_match",
+        "wrong_catalog_selected_matches_top",
+    }


 def _derive_step_invariant_failures(step: dict[str, Any], findings: list[dict[str, Any]]) -> dict[str, bool]:
--- a/scripts/test_domain_case_loop_step_state.py
+++ b/scripts/test_domain_case_loop_step_state.py
@ -81,6 +81,39 @@ class DomainCaseLoopStepStateTests(unittest.TestCase):
        self.assertEqual(reviewed["review_findings"][0]["code"], "catalog_alignment_divergence")
        self.assertEqual(reviewed["review_findings"][0]["severity"], "warning")

+    def test_truth_harness_checks_expected_catalog_alignment_fields(self) -> None:
+        reviewed = dth.evaluate_truth_step(
+            step={
+                "step_id": "step_01",
+                "question_template": "show planner alignment",
+                "criticality": "critical",
+                "allowed_reply_types": [],
+                "expected_catalog_alignment_status": "selected_matches_top",
+                "expected_catalog_chain_top_match": "value_flow_comparison",
+                "expected_catalog_selected_matches_top": True,
+            },
+            step_state={
+                "question_resolved": "show planner alignment",
+                "reply_type": "factual",
+                "assistant_text": "Confirmed answer",
+                "actual_direct_answer": "Confirmed answer",
+                "detected_intent": "counterparty_turnover",
+                "selected_recipe": "counterparty_turnover_by_period",
+                "capability_id": "confirmed_counterparty_turnover",
+                "mcp_discovery_catalog_chain_alignment_status": "selected_matches_top",
+                "mcp_discovery_catalog_chain_top_match": "value_flow",
+                "mcp_discovery_catalog_chain_selected_matches_top": True,
+                "extracted_filters": {},
+            },
+            step_results={},
+            bindings={},
+            runtime_bindings={},
+        )
+
+        self.assertEqual(reviewed["review_status"], "fail")
+        self.assertEqual(reviewed["critical_findings_count"], 1)
+        self.assertEqual(reviewed["review_findings"][0]["code"], "wrong_catalog_chain_top_match")
+

 if __name__ == "__main__":
    unittest.main()