Planner Autonomy: вывести catalog-alignment в replay artifacts
This commit is contained in:
parent
8b5104a2c6
commit
a63742f0d6
|
|
@ -121,6 +121,7 @@ The following consolidation step added catalog-level chain-template scoring:
|
|||
- `catalog_chain_template_alignment` now records whether the selected chain is the top catalog match, its rank, and whether it appeared in the catalog search results; runtime loop state and debug summary expose the same verdict.
|
||||
- planner reason codes now emit stable catalog-alignment telemetry for evaluated top-match, selected-equals-top, selected-lower-rank, selected-outside-match-set, and unscored selected-chain states.
|
||||
- `catalog_chain_template_alignment.alignment_status` now carries the same verdict as one enum-like field, and debug summary exposes it as `mcp_discovery_catalog_chain_alignment_status`.
|
||||
- `domain_truth_harness` and `scenario_acceptance_policy` now carry the alignment status, top catalog match, and selected-matches-top flag into replay artifacts instead of leaving them buried in raw debug JSON.
|
||||
|
||||
## Why This Matters
|
||||
|
||||
|
|
@ -251,9 +252,14 @@ Latest validation after explicit catalog-alignment status propagation:
|
|||
- `npm.cmd run build`: passed
|
||||
- graphify rebuild: `5943 nodes`, `12915 edges`, `136 communities`
|
||||
|
||||
Latest validation after truth-harness catalog-alignment artifact surfacing:
|
||||
|
||||
- Python replay-tooling tests: passed, `4 passed`
|
||||
- graphify rebuild: `5946 nodes`, `12918 edges`, `136 communities`
|
||||
|
||||
## Next Step
|
||||
|
||||
The next safe step is still to re-run live replay once the 1C side is actively polling the proxy. In parallel, local-only consolidation can continue by using `alignment_status`, alignment reason-code telemetry, and the representative guard to find remaining manual branches where selected chains diverge from reviewed catalog-fabric intent.
|
||||
The next safe step is still to re-run live replay once the 1C side is actively polling the proxy. In parallel, local-only consolidation can continue by using `alignment_status`, alignment reason-code telemetry, truth-harness artifact surfacing, and the representative guard to find remaining manual branches where selected chains diverge from reviewed catalog-fabric intent.
|
||||
|
||||
Recommended order:
|
||||
|
||||
|
|
|
|||
|
|
@ -84,6 +84,7 @@ It now documents a turnaround that is already operational in code, already mater
|
|||
- planner/runtime/debug surfaces now expose `catalog_chain_template_alignment`, so semantic replay can see whether selected chains match the catalog top match, fall back to a lower-ranked template, or bypass catalog search;
|
||||
- planner reason codes now also emit stable catalog-alignment telemetry, so automated replay review can filter top-match, lower-rank, outside-match, and unscored selected-chain states without hand-parsing debug JSON;
|
||||
- catalog-alignment now carries a single `alignment_status` verdict through planner/runtime/debug, making replay divergence detection explicit instead of reconstructing it from booleans;
|
||||
- truth-harness and scenario acceptance artifacts now preserve catalog-alignment status/top-match fields, so AGENT replay review can spot planner-vs-catalog divergence directly in `truth_review.md` and `scenario_acceptance_matrix.json`;
|
||||
- explicit-counterparty incoming-vs-outgoing data-need graphs now select the reviewed `value_flow_comparison` chain instead of falling back to generic `value_flow`;
|
||||
- live map sync: [20 - planner_autonomy_consolidation_2026-05-01.md](./20%20-%20planner_autonomy_consolidation_2026-05-01.md)
|
||||
|
||||
|
|
@ -96,8 +97,8 @@ Current honest status:
|
|||
- open-world bounded-autonomy readiness: `~85%`
|
||||
- Post-F semantic integrity module progress: `~99%` operationally closed, with remaining risk now treated as next-slice discovery rather than an open blocker inside the closed slice
|
||||
- active inventory-stock breadth slice progress: `100%` for the declared scenario pack, not for arbitrary inventory questions
|
||||
- Planner Autonomy Consolidation progress: `~87%` for the declared module, with catalog-fabric, value-flow arbitration, lifecycle bounded inference, broad-evaluation bridge, inventory catalog templates, inventory runtime-boundary honesty, exact inventory recipe bridging, unambiguous metadata-surface lane inference, catalog chain-template scoring, structured chain-match contract exposure, runtime/debug propagation, subject-aware bidirectional comparison arbitration, structured catalog-alignment verdicts, representative alignment regression guard, catalog-alignment reason-code telemetry, and explicit `alignment_status` propagation validated locally, but live replay for the new bridge is currently blocked by missing active 1C polling and broader unfamiliar 1C asks still need replay-backed growth
|
||||
- graph snapshot after latest rebuild: `5943 nodes`, `12915 edges`, `136 communities`
|
||||
- Planner Autonomy Consolidation progress: `~88%` for the declared module, with catalog-fabric, value-flow arbitration, lifecycle bounded inference, broad-evaluation bridge, inventory catalog templates, inventory runtime-boundary honesty, exact inventory recipe bridging, unambiguous metadata-surface lane inference, catalog chain-template scoring, structured chain-match contract exposure, runtime/debug propagation, subject-aware bidirectional comparison arbitration, structured catalog-alignment verdicts, representative alignment regression guard, catalog-alignment reason-code telemetry, explicit `alignment_status` propagation, and truth-harness/acceptance-matrix surfacing validated locally, but live replay for the new bridge is currently blocked by missing active 1C polling and broader unfamiliar 1C asks still need replay-backed growth
|
||||
- graph snapshot after latest rebuild: `5946 nodes`, `12918 edges`, `136 communities`
|
||||
- current breakpoint:
|
||||
- the validated hot paths are no longer structurally broken;
|
||||
- flagship continuity collapse is no longer the primary risk;
|
||||
|
|
@ -150,6 +151,7 @@ Latest live proof now includes:
|
|||
- representative catalog-alignment regression guard accepted locally: planner slice passed `37/37`; full MCP-discovery slice passed `283/283` with `9` skipped; build passed; graphify rebuilt to `5942 nodes`, `12912 edges`, `140 communities`
|
||||
- catalog-alignment reason-code telemetry accepted locally: planner/runtime slice passed `53/53`; full MCP-discovery suite passed `283/283` with `9` skipped; build passed; graphify rebuilt to `5943 nodes`, `12915 edges`, `136 communities`
|
||||
- catalog-alignment status verdict accepted locally: planner/runtime/debug slice passed `55/55`; full MCP-discovery suite passed `283/283` with `9` skipped; build passed; graphify rebuilt to `5943 nodes`, `12915 edges`, `136 communities`
|
||||
- catalog-alignment replay artifact surfacing accepted locally: Python truth-harness/acceptance tests passed `4/4`; graphify rebuilt to `5946 nodes`, `12918 edges`, `136 communities`
|
||||
|
||||
Current architectural reading:
|
||||
|
||||
|
|
|
|||
|
|
@ -1727,6 +1727,9 @@ def build_scenario_step_state(
|
|||
"selected_recipe": debug.get("selected_recipe"),
|
||||
"capability_id": debug.get("capability_id"),
|
||||
"capability_route_mode": debug.get("capability_route_mode"),
|
||||
"mcp_discovery_catalog_chain_alignment_status": debug.get("mcp_discovery_catalog_chain_alignment_status"),
|
||||
"mcp_discovery_catalog_chain_top_match": debug.get("mcp_discovery_catalog_chain_top_match"),
|
||||
"mcp_discovery_catalog_chain_selected_matches_top": debug.get("mcp_discovery_catalog_chain_selected_matches_top"),
|
||||
"route_expectation_status": debug.get("route_expectation_status"),
|
||||
"result_mode": debug.get("result_mode"),
|
||||
"response_type": debug.get("response_type"),
|
||||
|
|
|
|||
|
|
@ -679,6 +679,9 @@ def build_truth_review_markdown(spec: dict[str, Any], scenario_state: dict[str,
|
|||
f"intent: `{step_state.get('detected_intent') or 'n/a'}`",
|
||||
f"recipe: `{step_state.get('selected_recipe') or 'n/a'}`",
|
||||
f"capability: `{step_state.get('capability_id') or 'n/a'}`",
|
||||
f"catalog_alignment_status: `{step_state.get('mcp_discovery_catalog_chain_alignment_status') or 'n/a'}`",
|
||||
f"catalog_top_match: `{step_state.get('mcp_discovery_catalog_chain_top_match') or 'n/a'}`",
|
||||
f"catalog_selected_matches_top: `{step_state.get('mcp_discovery_catalog_chain_selected_matches_top')}`",
|
||||
f"limited_reason_category: `{step_state.get('limited_reason_category') or 'n/a'}`",
|
||||
f"filters: `{dump_json(step_state.get('extracted_filters') or {})}`",
|
||||
f"direct_answer: {step_state.get('actual_direct_answer') or 'n/a'}",
|
||||
|
|
|
|||
|
|
@ -198,6 +198,9 @@ def build_scenario_acceptance_matrix(
|
|||
"reply_type": step_state.get("reply_type"),
|
||||
"detected_intent": step_state.get("detected_intent"),
|
||||
"capability_id": step_state.get("capability_id"),
|
||||
"mcp_discovery_catalog_chain_alignment_status": step_state.get("mcp_discovery_catalog_chain_alignment_status"),
|
||||
"mcp_discovery_catalog_chain_top_match": step_state.get("mcp_discovery_catalog_chain_top_match"),
|
||||
"mcp_discovery_catalog_chain_selected_matches_top": step_state.get("mcp_discovery_catalog_chain_selected_matches_top"),
|
||||
"selected_object_step": _has_selected_object_signal(step),
|
||||
"meta_context_step": _has_meta_context_signal(step),
|
||||
"highest_unresolved_priority": highest_priority,
|
||||
|
|
@ -330,6 +333,9 @@ def build_scenario_acceptance_matrix_markdown(acceptance_matrix: dict[str, Any])
|
|||
f" review_status: `{row.get('review_status')}`",
|
||||
f" criticality: `{row.get('criticality')}`",
|
||||
f" semantic_tags: {', '.join(row.get('semantic_tags') or []) or 'none'}",
|
||||
f" catalog_alignment_status: `{row.get('mcp_discovery_catalog_chain_alignment_status') or 'n/a'}`",
|
||||
f" catalog_top_match: `{row.get('mcp_discovery_catalog_chain_top_match') or 'n/a'}`",
|
||||
f" catalog_selected_matches_top: `{row.get('mcp_discovery_catalog_chain_selected_matches_top')}`",
|
||||
f" highest_unresolved_priority: `{row.get('highest_unresolved_priority')}`",
|
||||
f" selected_object_step: `{row.get('selected_object_step')}`",
|
||||
f" meta_context_step: `{row.get('meta_context_step')}`",
|
||||
|
|
|
|||
|
|
@ -0,0 +1,54 @@
|
|||
from __future__ import annotations
|
||||
|
||||
import sys
|
||||
import unittest
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).resolve().parent))
|
||||
|
||||
import domain_case_loop as dcl
|
||||
|
||||
|
||||
class DomainCaseLoopStepStateTests(unittest.TestCase):
|
||||
def test_preserves_mcp_catalog_alignment_debug_fields(self) -> None:
|
||||
step_state = dcl.build_scenario_step_state(
|
||||
scenario_id="planner_alignment_demo",
|
||||
domain="planner_autonomy",
|
||||
step={
|
||||
"step_id": "step_01",
|
||||
"title": "Alignment visibility",
|
||||
"depends_on": [],
|
||||
"question_template": "show planner alignment",
|
||||
},
|
||||
step_index=1,
|
||||
question_resolved="show planner alignment",
|
||||
analysis_context={},
|
||||
turn_artifact={
|
||||
"assistant_message": {
|
||||
"reply_type": "factual",
|
||||
"text": "Confirmed answer",
|
||||
"message_id": "msg-1",
|
||||
"trace_id": "trace-1",
|
||||
},
|
||||
"technical_debug_payload": {
|
||||
"detected_mode": "address_query",
|
||||
"detected_intent": "counterparty_turnover",
|
||||
"selected_recipe": "counterparty_turnover_by_period",
|
||||
"capability_id": "confirmed_counterparty_turnover",
|
||||
"mcp_discovery_catalog_chain_alignment_status": "selected_matches_top",
|
||||
"mcp_discovery_catalog_chain_top_match": "value_flow",
|
||||
"mcp_discovery_catalog_chain_selected_matches_top": True,
|
||||
},
|
||||
"session_summary": {},
|
||||
},
|
||||
entries=[],
|
||||
)
|
||||
|
||||
self.assertEqual(step_state["mcp_discovery_catalog_chain_alignment_status"], "selected_matches_top")
|
||||
self.assertEqual(step_state["mcp_discovery_catalog_chain_top_match"], "value_flow")
|
||||
self.assertTrue(step_state["mcp_discovery_catalog_chain_selected_matches_top"])
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
unittest.main()
|
||||
|
|
@ -84,6 +84,9 @@ class ScenarioAcceptancePolicyTests(unittest.TestCase):
|
|||
"reply_type": "factual",
|
||||
"detected_intent": "inventory_on_hand_as_of_date",
|
||||
"capability_id": "confirmed_inventory_on_hand_as_of_date",
|
||||
"mcp_discovery_catalog_chain_alignment_status": "selected_matches_top",
|
||||
"mcp_discovery_catalog_chain_top_match": "inventory_stock_snapshot",
|
||||
"mcp_discovery_catalog_chain_selected_matches_top": True,
|
||||
"review_findings": [],
|
||||
}
|
||||
},
|
||||
|
|
@ -104,6 +107,15 @@ class ScenarioAcceptancePolicyTests(unittest.TestCase):
|
|||
self.assertTrue(pack_state["acceptance_gate_passed"])
|
||||
self.assertTrue(pack_state["critical_path_green"])
|
||||
self.assertTrue(all(pack_state["invariants"].values()))
|
||||
self.assertEqual(
|
||||
acceptance_matrix["rows"][0]["mcp_discovery_catalog_chain_alignment_status"],
|
||||
"selected_matches_top",
|
||||
)
|
||||
self.assertEqual(
|
||||
acceptance_matrix["rows"][0]["mcp_discovery_catalog_chain_top_match"],
|
||||
"inventory_stock_snapshot",
|
||||
)
|
||||
self.assertTrue(acceptance_matrix["rows"][0]["mcp_discovery_catalog_chain_selected_matches_top"])
|
||||
|
||||
def test_flags_meta_context_integrity_when_meta_step_leaks_technical_answer_shape(self) -> None:
|
||||
spec = {
|
||||
|
|
|
|||
Loading…
Reference in New Issue