Автоматизировать агентную проверку GUI-прогонов и stage-loop
This commit is contained in:
parent
7c77db2c8d
commit
931251d1eb
|
|
@ -53,6 +53,78 @@ Pack artifacts live under:
|
|||
- `final_status.md`
|
||||
- `scenarios/<scenario_id>/...`
|
||||
|
||||
## AGENT autorun save gate
|
||||
|
||||
`scripts/save_agent_semantic_run.py` is a post-validation persistence tool, not a replay executor.
|
||||
The normal path is:
|
||||
|
||||
1. build/update the truth-harness spec;
|
||||
2. run `python scripts/domain_truth_harness.py run-live --spec ... --output-dir artifacts/domain_runs/<run_id>`;
|
||||
3. inspect `truth_review.md`, `business_review.md`, `pack_state.json`, and `final_status.md`;
|
||||
4. save to GUI autoruns only with `python scripts/save_agent_semantic_run.py --spec ... --validated-run-dir artifacts/domain_runs/<run_id>`.
|
||||
|
||||
The save gate requires:
|
||||
- `pack_state.final_status = accepted`;
|
||||
- `pack_state.acceptance_gate_passed = true`;
|
||||
- `truth_review.summary.overall_status = pass`;
|
||||
- `business_review.overall_business_status = pass`;
|
||||
- zero unresolved P0 and zero business-answer failures.
|
||||
|
||||
If a pack must be saved as a deliberate manual draft before live acceptance, use
|
||||
`--allow-unvalidated --unvalidated-reason "<why this is intentionally not accepted>"`.
|
||||
That path is explicitly marked as unvalidated and must not be treated as semantic proof.
|
||||
|
||||
## Stage-level AGENT loop
|
||||
|
||||
`scripts/stage_agent_loop.py` wraps the domain pack loop into the development-stage workflow:
|
||||
|
||||
1. take the current global/local stage manifest;
|
||||
2. run `scripts/domain_case_loop.py run-pack-loop` for that stage pack;
|
||||
3. let the loop iterate through pack replay, business-first analyst verdict, coder patch, and rerun until the objective gate is accepted, blocked, or a real user decision is required;
|
||||
4. if accepted, persist the validated AGENT pack into GUI autoruns through `scripts/save_agent_semantic_run.py --validated-run-dir`;
|
||||
5. write `stage_loop_summary.json` and `stage_loop_handoff.md` for the final human visual confirmation.
|
||||
|
||||
The stage manifest schema is `docs/orchestration/schemas/stage_agent_loop_manifest.schema.json`.
|
||||
The default stage gate is intentionally stricter than a narrow case gate: `target_score = 88`, no unresolved P0/P1 repair targets, accepted analyst verdict, clean business usefulness, direct-answer, temporal-honesty, field-truth, and answer-layering flags.
|
||||
|
||||
Canonical commands:
|
||||
|
||||
```powershell
|
||||
python scripts/stage_agent_loop.py plan --manifest docs/orchestration/<stage_loop>.json
|
||||
python scripts/stage_agent_loop.py run --manifest docs/orchestration/<stage_loop>.json
|
||||
python scripts/stage_agent_loop.py summarize --manifest docs/orchestration/<stage_loop>.json
|
||||
```
|
||||
|
||||
This is the intended path for “implement the stage, generate/check stage questions, analyze business answers, patch code, rerun, then ask the user for final visual confirmation”.
|
||||
|
||||
## GUI run review bridge
|
||||
|
||||
When a manual or GUI autorun already exists, `scripts/review_assistant_stage1_run.py` turns the run id into the same machine-readable review surface.
|
||||
|
||||
Canonical command:
|
||||
|
||||
```powershell
|
||||
python scripts/review_assistant_stage1_run.py assistant-stage1-<id> --print-summary
|
||||
```
|
||||
|
||||
The script resolves:
|
||||
- `llm_normalizer/reports/assistant-stage1-<id>.md`;
|
||||
- `llm_normalizer/data/assistant_sessions/assistant-stage1-<id>-*.json`.
|
||||
|
||||
It writes:
|
||||
- `artifacts/domain_runs/gui_run_reviews/assistant-stage1-<id>/run_review.json`;
|
||||
- `artifacts/domain_runs/gui_run_reviews/assistant-stage1-<id>/run_review.md`;
|
||||
- `conversation_pairs.json`;
|
||||
- `question_quality_review.json`;
|
||||
- `repair_targets.json`.
|
||||
|
||||
This bridge is intentionally business-first:
|
||||
- the user's question and visible assistant answer are reviewed before route ids and debug fields;
|
||||
- noisy direct answers, missing first-line answers, technical garbage, and over-broad business answers become findings;
|
||||
- generated question packs get a deterministic quality review for follow-up density, direct questions, report-style analysis, domain diversity, duplicates, and weak business anchors.
|
||||
|
||||
Use this bridge when the operator would otherwise say “чекни прогон `assistant-stage1-...`”. The expected next step is no longer manual eyeballing first; it is: review by id, inspect `run_review.md`, map `repair_targets.json` into the current stage loop, patch, and rerun.
|
||||
|
||||
## Placeholder contract
|
||||
|
||||
Scenario questions can reference earlier step outputs with placeholders such as:
|
||||
|
|
|
|||
|
|
@ -0,0 +1,72 @@
|
|||
{
|
||||
"$schema": "http://json-schema.org/draft-07/schema#",
|
||||
"title": "Stage Agent Loop Manifest",
|
||||
"type": "object",
|
||||
"additionalProperties": true,
|
||||
"required": ["stage_id", "module_name", "title", "pack_manifest"],
|
||||
"properties": {
|
||||
"schema_version": {
|
||||
"type": "string",
|
||||
"enum": ["stage_agent_loop_manifest_v1"]
|
||||
},
|
||||
"stage_id": {
|
||||
"type": "string",
|
||||
"minLength": 1
|
||||
},
|
||||
"module_name": {
|
||||
"type": "string",
|
||||
"minLength": 1
|
||||
},
|
||||
"title": {
|
||||
"type": "string",
|
||||
"minLength": 1
|
||||
},
|
||||
"architecture_phase": {
|
||||
"type": "string"
|
||||
},
|
||||
"agent_focus": {
|
||||
"type": "string"
|
||||
},
|
||||
"current_stage_status": {
|
||||
"type": "string"
|
||||
},
|
||||
"global_plan_refs": {
|
||||
"type": "array",
|
||||
"items": {
|
||||
"type": "string"
|
||||
}
|
||||
},
|
||||
"pack_manifest": {
|
||||
"type": "string",
|
||||
"description": "Path to a domain_case_loop run-pack manifest with scenarios for the stage gate."
|
||||
},
|
||||
"loop_id": {
|
||||
"type": "string"
|
||||
},
|
||||
"target_score": {
|
||||
"type": "integer",
|
||||
"minimum": 0,
|
||||
"maximum": 100,
|
||||
"default": 88
|
||||
},
|
||||
"max_iterations": {
|
||||
"type": "integer",
|
||||
"minimum": 1,
|
||||
"default": 6
|
||||
},
|
||||
"acceptance_invariants": {
|
||||
"type": "array",
|
||||
"items": {
|
||||
"type": "string"
|
||||
}
|
||||
},
|
||||
"save_autorun_on_accept": {
|
||||
"type": "boolean",
|
||||
"default": true
|
||||
},
|
||||
"manual_confirmation_required_after_accept": {
|
||||
"type": "boolean",
|
||||
"default": true
|
||||
}
|
||||
}
|
||||
}
|
||||
|
|
@ -88,6 +88,48 @@ TOP_LEVEL_NOISE_PATTERNS = (
|
|||
re.compile(r"^(?:подтверждение|опорные документы|сервисно)\b", re.IGNORECASE),
|
||||
)
|
||||
|
||||
BUSINESS_DIRECT_QUESTION_MARKERS = (
|
||||
"\u0441\u043a\u043e\u043b\u044c\u043a\u043e",
|
||||
"\u0441\u043a\u043e\u043a",
|
||||
"\u043a\u0430\u043a\u043e\u0439",
|
||||
"\u043a\u0430\u043a\u0430\u044f",
|
||||
"\u043a\u0430\u043a\u0438\u0435",
|
||||
"\u043a\u0442\u043e",
|
||||
"\u043a\u043e\u043c\u0443",
|
||||
"\u043a\u043e\u0433\u0434\u0430",
|
||||
"\u0433\u0434\u0435",
|
||||
"\u043a\u0443\u0434\u0430",
|
||||
"\u043f\u043e\u0447\u0435\u043c\u0443",
|
||||
"\u0437\u0430\u0447\u0435\u043c",
|
||||
"\u043a\u0430\u043a\u0438\u043c \u0434\u043e\u043a\u0443\u043c\u0435\u043d\u0442\u043e\u043c",
|
||||
"\u043f\u043e\u043a\u0430\u0436\u0438",
|
||||
)
|
||||
BUSINESS_REPORT_REQUEST_MARKERS = (
|
||||
"\u043e\u0431\u0437\u043e\u0440",
|
||||
"\u0430\u043d\u0430\u043b\u0438\u0437",
|
||||
"\u043f\u043e\u0434\u0440\u043e\u0431",
|
||||
"\u0440\u0430\u0437\u0432\u0435\u0440\u043d",
|
||||
"\u043e\u0446\u0435\u043d",
|
||||
"\u0430\u0443\u0434\u0438\u0442",
|
||||
)
|
||||
BUSINESS_TOP_LINE_SCAFFOLD_MARKERS = (
|
||||
"\u043e\u0433\u0440\u0430\u043d\u0438\u0447\u0435\u043d\u043d\u044b\u0439 \u0431\u0438\u0437\u043d\u0435\u0441-\u043e\u0431\u0437\u043e\u0440",
|
||||
"\u0447\u0442\u043e \u043f\u043e\u0434\u0442\u0432\u0435\u0440\u0436\u0434\u0435\u043d\u043e",
|
||||
"\u043f\u0440\u043e\u0432\u0435\u0440\u0435\u043d\u043d\u044b\u0435 \u043a\u043e\u043d\u0442\u0443\u0440\u044b",
|
||||
"\u0431\u043b\u043e\u043a 1",
|
||||
"\u0441\u0442\u0430\u0442\u0443\u0441",
|
||||
)
|
||||
BUSINESS_TECHNICAL_GARBAGE_MARKERS = (
|
||||
"mcp_discovery",
|
||||
"runtime_",
|
||||
"capability_id",
|
||||
"selected_chain_id",
|
||||
"business_overview_route_template_v1",
|
||||
"query_movements",
|
||||
"query_documents",
|
||||
)
|
||||
BUSINESS_DIRECT_ANSWER_SOFT_LIMIT = 1800
|
||||
|
||||
DEFAULT_INVARIANT_SEVERITY: dict[str, str] = {
|
||||
"wrong_intent": "P0",
|
||||
"wrong_capability": "P0",
|
||||
|
|
@ -104,6 +146,10 @@ DEFAULT_INVARIANT_SEVERITY: dict[str, str] = {
|
|||
"wrong_date_scope_state": "P0",
|
||||
"direct_answer_missing": "P0",
|
||||
"top_level_noise_present": "P0",
|
||||
"business_direct_answer_missing": "P0",
|
||||
"technical_garbage_in_answer": "P0",
|
||||
"answer_layering_noise": "P1",
|
||||
"business_answer_too_verbose": "P1",
|
||||
}
|
||||
|
||||
REPAIR_TARGET_SEVERITY_ORDER = {"P0": 0, "P1": 1, "P2": 2}
|
||||
|
|
@ -114,11 +160,12 @@ REPAIR_TARGET_PROBLEM_ORDER = {
|
|||
"object_memory_gap": 3,
|
||||
"route_gap": 4,
|
||||
"answer_shape_mismatch": 5,
|
||||
"presentation_gap": 6,
|
||||
"domain_anchor_gap": 7,
|
||||
"capability_gap": 8,
|
||||
"evidence_gap": 9,
|
||||
"other": 10,
|
||||
"business_utility_gap": 6,
|
||||
"presentation_gap": 7,
|
||||
"domain_anchor_gap": 8,
|
||||
"capability_gap": 9,
|
||||
"evidence_gap": 10,
|
||||
"other": 11,
|
||||
}
|
||||
|
||||
REPAIR_TARGET_FILE_HINTS: dict[str, list[str]] = {
|
||||
|
|
@ -157,6 +204,16 @@ REPAIR_TARGET_FILE_HINTS: dict[str, list[str]] = {
|
|||
"llm_normalizer/backend/src/services/address_runtime/composeStage.ts",
|
||||
"llm_normalizer/backend/src/services/assistantService.ts",
|
||||
],
|
||||
"answer_shape_mismatch": [
|
||||
"llm_normalizer/backend/src/services/address_runtime/composeStage.ts",
|
||||
"llm_normalizer/backend/src/services/assistantMcpDiscoveryResponseCandidate.ts",
|
||||
"llm_normalizer/backend/src/services/assistantService.ts",
|
||||
],
|
||||
"business_utility_gap": [
|
||||
"llm_normalizer/backend/src/services/address_runtime/composeStage.ts",
|
||||
"llm_normalizer/backend/src/services/assistantMcpDiscoveryResponseCandidate.ts",
|
||||
"llm_normalizer/backend/src/services/assistantService.ts",
|
||||
],
|
||||
"evidence_gap": [
|
||||
"llm_normalizer/backend/src/services/addressQueryService.ts",
|
||||
"llm_normalizer/backend/src/services/addressRecipeCatalog.ts",
|
||||
|
|
@ -1526,6 +1583,79 @@ def should_require_direct_answer(step_state: dict[str, Any]) -> bool:
|
|||
return str(step_state.get("node_role") or "").strip() in {"root", "critical_child"}
|
||||
|
||||
|
||||
def _review_text(value: Any) -> str:
|
||||
return str(value or "").strip().lower()
|
||||
|
||||
|
||||
def _marker_hits(text: str, markers: tuple[str, ...]) -> list[str]:
|
||||
lowered = _review_text(text)
|
||||
return [marker for marker in markers if marker and marker in lowered]
|
||||
|
||||
|
||||
def is_report_style_business_question(question: str) -> bool:
|
||||
return bool(_marker_hits(question, BUSINESS_REPORT_REQUEST_MARKERS))
|
||||
|
||||
|
||||
def is_direct_style_business_question(question: str) -> bool:
|
||||
if is_report_style_business_question(question):
|
||||
return False
|
||||
return bool(_marker_hits(question, BUSINESS_DIRECT_QUESTION_MARKERS))
|
||||
|
||||
|
||||
def build_business_first_review(step_state: dict[str, Any]) -> dict[str, Any]:
|
||||
question = str(step_state.get("question_resolved") or step_state.get("question_template") or "").strip()
|
||||
assistant_text = str(step_state.get("assistant_text") or "")
|
||||
top_lines = step_state.get("top_non_empty_lines") if isinstance(step_state.get("top_non_empty_lines"), list) else []
|
||||
first_line = str(top_lines[0] if top_lines else step_state.get("actual_direct_answer") or "").strip()
|
||||
direct_answer_required = should_require_direct_answer(step_state) or is_direct_style_business_question(question)
|
||||
report_style_question = is_report_style_business_question(question)
|
||||
technical_hits = _marker_hits(assistant_text, BUSINESS_TECHNICAL_GARBAGE_MARKERS)
|
||||
first_line_technical_hits = _marker_hits(first_line, BUSINESS_TECHNICAL_GARBAGE_MARKERS)
|
||||
scaffold_hits = _marker_hits(first_line, BUSINESS_TOP_LINE_SCAFFOLD_MARKERS)
|
||||
top_noise = bool(first_line and is_top_level_noise_line(first_line))
|
||||
direct_answer_first_ok = bool(first_line) and not top_noise and not scaffold_hits and not first_line_technical_hits
|
||||
too_verbose_for_direct = bool(
|
||||
direct_answer_required
|
||||
and not report_style_question
|
||||
and len(assistant_text) > BUSINESS_DIRECT_ANSWER_SOFT_LIMIT
|
||||
)
|
||||
issue_codes: list[str] = []
|
||||
if technical_hits:
|
||||
issue_codes.append("technical_garbage_in_answer")
|
||||
if direct_answer_required and not direct_answer_first_ok:
|
||||
issue_codes.append("business_direct_answer_missing")
|
||||
if scaffold_hits or top_noise:
|
||||
issue_codes.append("answer_layering_noise")
|
||||
if too_verbose_for_direct:
|
||||
issue_codes.append("business_answer_too_verbose")
|
||||
|
||||
root_cause_layers: list[str] = []
|
||||
if "business_direct_answer_missing" in issue_codes or "answer_layering_noise" in issue_codes:
|
||||
root_cause_layers.append("answer_shape_mismatch")
|
||||
if "business_answer_too_verbose" in issue_codes or "technical_garbage_in_answer" in issue_codes:
|
||||
root_cause_layers.append("business_utility_gap")
|
||||
|
||||
return {
|
||||
"schema_version": "business_first_step_review_v1",
|
||||
"question": question,
|
||||
"direct_answer_required": direct_answer_required,
|
||||
"report_style_question": report_style_question,
|
||||
"answer_length_chars": len(assistant_text),
|
||||
"answer_line_count": len([line for line in assistant_text.splitlines() if line.strip()]),
|
||||
"actual_direct_answer": first_line or None,
|
||||
"direct_answer_first_ok": (not direct_answer_required) or direct_answer_first_ok,
|
||||
"answer_layering_ok": not scaffold_hits and not top_noise,
|
||||
"technical_garbage_present": bool(technical_hits),
|
||||
"technical_garbage_hits": technical_hits,
|
||||
"top_line_scaffold_present": bool(scaffold_hits or top_noise),
|
||||
"top_line_scaffold_hits": scaffold_hits,
|
||||
"too_verbose_for_direct_question": too_verbose_for_direct,
|
||||
"business_usefulness_ok": not issue_codes,
|
||||
"issue_codes": issue_codes,
|
||||
"suggested_root_cause_layers": list(dict.fromkeys(root_cause_layers)),
|
||||
}
|
||||
|
||||
|
||||
def is_top_level_noise_line(line: str) -> bool:
|
||||
cleaned = str(line or "").strip()
|
||||
if not cleaned:
|
||||
|
|
@ -1563,6 +1693,8 @@ def validate_step_contract(step_state: dict[str, Any]) -> dict[str, Any]:
|
|||
date_scope = state.get("date_scope") if isinstance(state.get("date_scope"), dict) else {}
|
||||
violated_invariants: list[str] = []
|
||||
warnings: list[str] = []
|
||||
business_review = build_business_first_review(state)
|
||||
state["business_first_review"] = business_review
|
||||
|
||||
expected_intents = normalize_string_list(state.get("expected_intents"))
|
||||
if expected_intents and not identifier_in_list(state.get("detected_intent"), expected_intents):
|
||||
|
|
@ -1645,6 +1777,13 @@ def validate_step_contract(step_state: dict[str, Any]) -> dict[str, Any]:
|
|||
if first_top_line and is_top_level_noise_line(first_top_line):
|
||||
violated_invariants.append("top_level_noise_present")
|
||||
|
||||
for issue_code in normalize_string_list(business_review.get("issue_codes")):
|
||||
if issue_code == "business_answer_too_verbose":
|
||||
warnings.append(issue_code)
|
||||
violated_invariants.append(issue_code)
|
||||
continue
|
||||
violated_invariants.append(issue_code)
|
||||
|
||||
forbidden_answer_patterns = normalize_string_list(state.get("forbidden_answer_patterns"))
|
||||
if forbidden_answer_patterns and top_non_empty_lines:
|
||||
joined_top_block = "\n".join(str(line) for line in top_non_empty_lines)
|
||||
|
|
@ -2697,6 +2836,7 @@ def compact_step_output_for_review(step_output: Any) -> dict[str, Any]:
|
|||
"result_mode": step_output.get("result_mode"),
|
||||
"answer_shape": step_output.get("answer_shape"),
|
||||
"actual_direct_answer": step_output.get("actual_direct_answer"),
|
||||
"business_first_review": step_output.get("business_first_review"),
|
||||
"violated_invariants": step_output.get("violated_invariants"),
|
||||
"warnings": step_output.get("warnings"),
|
||||
"fallback_type": step_output.get("fallback_type"),
|
||||
|
|
@ -2742,6 +2882,9 @@ def derive_repair_target_severity(step_output: dict[str, Any]) -> str:
|
|||
return "P1"
|
||||
if execution_status in {"partial", "needs_exact_capability"} or reply_type == "partial_coverage":
|
||||
return "P1"
|
||||
violated_invariants = normalize_string_list(step_output.get("violated_invariants"))
|
||||
if any(derive_invariant_severity(step_output, code) == "P1" for code in violated_invariants):
|
||||
return "P1"
|
||||
if normalize_string_list(step_output.get("warnings")):
|
||||
return "P2"
|
||||
return "P2"
|
||||
|
|
@ -2772,6 +2915,10 @@ def derive_repair_problem_type(step_output: dict[str, Any]) -> str:
|
|||
"forbidden_recipe_selected",
|
||||
} & violated:
|
||||
return "route_gap"
|
||||
if {"business_direct_answer_missing", "answer_layering_noise"} & violated:
|
||||
return "answer_shape_mismatch"
|
||||
if {"business_answer_too_verbose", "technical_garbage_in_answer"} & violated:
|
||||
return "business_utility_gap"
|
||||
if {"direct_answer_missing", "top_level_noise_present"} & violated:
|
||||
return "presentation_gap"
|
||||
if mcp_call_status == "materialized_but_not_anchor_matched":
|
||||
|
|
@ -2808,6 +2955,13 @@ def derive_repair_root_cause_layers(step_output: dict[str, Any], problem_type: s
|
|||
layers.append("business_utility_gap")
|
||||
if str(step_output.get("required_answer_shape") or "").strip():
|
||||
layers.append("answer_shape_mismatch")
|
||||
elif problem_type == "answer_shape_mismatch":
|
||||
layers.append("answer_shape_mismatch")
|
||||
layers.append("business_utility_gap")
|
||||
elif problem_type == "business_utility_gap":
|
||||
layers.append("business_utility_gap")
|
||||
if "answer_layering_noise" in violated:
|
||||
layers.append("answer_shape_mismatch")
|
||||
elif problem_type == "evidence_gap":
|
||||
layers.append("runtime_capability_gap")
|
||||
elif problem_type == "domain_anchor_gap":
|
||||
|
|
@ -2833,6 +2987,10 @@ def build_repair_fix_goal(step_output: dict[str, Any], problem_type: str) -> str
|
|||
return f"Enable an exact route for `{question}` so the loop no longer falls back to partial or unsupported behavior."
|
||||
if problem_type == "presentation_gap":
|
||||
return f"Make `{question}` answer-first: direct business answer in the first line, proof second, service notes last."
|
||||
if problem_type == "answer_shape_mismatch":
|
||||
return f"Make `{question}` start with the exact business answer requested, then put proof and caveats after it."
|
||||
if problem_type == "business_utility_gap":
|
||||
return f"Make `{question}` useful for a business reader: remove technical/scaffold noise and keep direct answers compact."
|
||||
if problem_type == "evidence_gap":
|
||||
return f"Return grounded evidence for `{question}` instead of a limited empty response when the correct route already fires."
|
||||
if problem_type == "domain_anchor_gap":
|
||||
|
|
|
|||
|
|
@ -309,6 +309,46 @@ def append_finding(
|
|||
)
|
||||
|
||||
|
||||
BUSINESS_REVIEW_FINDING_MESSAGES = {
|
||||
"technical_garbage_in_answer": "User-facing answer leaked internal runtime or MCP identifiers.",
|
||||
"business_direct_answer_missing": "The answer did not put the direct business answer first.",
|
||||
"answer_layering_noise": "The answer opened with scaffolding or report framing instead of a clean business result.",
|
||||
"business_answer_too_verbose": "The answer is too verbose for a direct business question.",
|
||||
}
|
||||
|
||||
BUSINESS_REVIEW_FINDING_SEVERITY = {
|
||||
"technical_garbage_in_answer": "critical",
|
||||
"business_direct_answer_missing": "critical",
|
||||
"answer_layering_noise": "critical",
|
||||
"business_answer_too_verbose": "warning",
|
||||
}
|
||||
|
||||
|
||||
def append_business_review_findings(findings: list[dict[str, Any]], step: dict[str, Any], step_state: dict[str, Any]) -> None:
|
||||
business_review = step_state.get("business_first_review")
|
||||
if not isinstance(business_review, dict):
|
||||
return
|
||||
for issue_code in dcl.normalize_string_list(business_review.get("issue_codes")):
|
||||
append_finding(
|
||||
findings,
|
||||
step,
|
||||
f"business_review:{issue_code}",
|
||||
BUSINESS_REVIEW_FINDING_MESSAGES.get(issue_code, "Business-first answer review detected a semantic quality issue."),
|
||||
actual={
|
||||
"direct_answer": business_review.get("actual_direct_answer"),
|
||||
"answer_length_chars": business_review.get("answer_length_chars"),
|
||||
"technical_garbage_hits": business_review.get("technical_garbage_hits"),
|
||||
"top_line_scaffold_hits": business_review.get("top_line_scaffold_hits"),
|
||||
},
|
||||
expected={
|
||||
"direct_answer_first_ok": True,
|
||||
"business_usefulness_ok": True,
|
||||
"answer_layering_ok": True,
|
||||
},
|
||||
severity=BUSINESS_REVIEW_FINDING_SEVERITY.get(issue_code, step.get("criticality") or DEFAULT_CRITICALITY),
|
||||
)
|
||||
|
||||
|
||||
def matches_any_pattern(text: str, patterns: list[str]) -> bool:
|
||||
return any(re.search(pattern, text, flags=re.IGNORECASE) for pattern in patterns if pattern)
|
||||
|
||||
|
|
@ -355,6 +395,7 @@ def evaluate_truth_step(
|
|||
extracted_filters = (
|
||||
step_state.get("extracted_filters") if isinstance(step_state.get("extracted_filters"), dict) else {}
|
||||
)
|
||||
append_business_review_findings(findings, step, step_state)
|
||||
|
||||
if (
|
||||
catalog_alignment_status in {"selected_lower_rank", "selected_outside_match_set"}
|
||||
|
|
@ -751,6 +792,101 @@ def build_truth_review_summary(spec: dict[str, Any], scenario_state: dict[str, A
|
|||
}
|
||||
|
||||
|
||||
def build_business_review_summary(spec: dict[str, Any], scenario_state: dict[str, Any]) -> dict[str, Any]:
|
||||
step_outputs = scenario_state.get("step_outputs") if isinstance(scenario_state.get("step_outputs"), dict) else {}
|
||||
steps: list[dict[str, Any]] = []
|
||||
issue_counts: dict[str, int] = {}
|
||||
for index, step in enumerate(spec["steps"], start=1):
|
||||
step_state = step_outputs.get(step["step_id"], {})
|
||||
business_review = (
|
||||
step_state.get("business_first_review")
|
||||
if isinstance(step_state, dict) and isinstance(step_state.get("business_first_review"), dict)
|
||||
else {}
|
||||
)
|
||||
issue_codes = dcl.normalize_string_list(business_review.get("issue_codes"))
|
||||
for issue_code in issue_codes:
|
||||
issue_counts[issue_code] = issue_counts.get(issue_code, 0) + 1
|
||||
steps.append(
|
||||
{
|
||||
"index": index,
|
||||
"step_id": step["step_id"],
|
||||
"question": step["question_template"],
|
||||
"review_status": step_state.get("review_status") if isinstance(step_state, dict) else None,
|
||||
"direct_answer": business_review.get("actual_direct_answer"),
|
||||
"answer_length_chars": business_review.get("answer_length_chars"),
|
||||
"direct_answer_required": business_review.get("direct_answer_required"),
|
||||
"direct_answer_first_ok": business_review.get("direct_answer_first_ok"),
|
||||
"business_usefulness_ok": business_review.get("business_usefulness_ok"),
|
||||
"answer_layering_ok": business_review.get("answer_layering_ok"),
|
||||
"technical_garbage_present": business_review.get("technical_garbage_present"),
|
||||
"too_verbose_for_direct_question": business_review.get("too_verbose_for_direct_question"),
|
||||
"issue_codes": issue_codes,
|
||||
"suggested_root_cause_layers": business_review.get("suggested_root_cause_layers") or [],
|
||||
}
|
||||
)
|
||||
failed = sum(
|
||||
1
|
||||
for step in steps
|
||||
if any(
|
||||
issue in {"technical_garbage_in_answer", "business_direct_answer_missing", "answer_layering_noise"}
|
||||
for issue in step["issue_codes"]
|
||||
)
|
||||
)
|
||||
warnings = sum(1 for step in steps if "business_answer_too_verbose" in step["issue_codes"])
|
||||
return {
|
||||
"schema_version": "business_first_run_review_v1",
|
||||
"scenario_id": spec["scenario_id"],
|
||||
"domain": spec["domain"],
|
||||
"title": spec["title"],
|
||||
"session_id": scenario_state.get("session_id"),
|
||||
"steps_total": len(steps),
|
||||
"steps_with_business_failures": failed,
|
||||
"steps_with_business_warnings": warnings,
|
||||
"issue_counts": issue_counts,
|
||||
"overall_business_status": "fail" if failed else ("warning" if warnings else "pass"),
|
||||
"steps": steps,
|
||||
}
|
||||
|
||||
|
||||
def build_business_review_markdown(business_review: dict[str, Any]) -> str:
|
||||
lines = [
|
||||
"# Business-first review",
|
||||
"",
|
||||
f"- scenario_id: `{business_review.get('scenario_id') or 'n/a'}`",
|
||||
f"- domain: `{business_review.get('domain') or 'n/a'}`",
|
||||
f"- title: {business_review.get('title') or 'n/a'}",
|
||||
f"- session_id: `{business_review.get('session_id') or 'n/a'}`",
|
||||
f"- overall_business_status: `{business_review.get('overall_business_status') or 'n/a'}`",
|
||||
f"- steps_total: `{business_review.get('steps_total')}`",
|
||||
f"- steps_with_business_failures: `{business_review.get('steps_with_business_failures')}`",
|
||||
f"- steps_with_business_warnings: `{business_review.get('steps_with_business_warnings')}`",
|
||||
f"- issue_counts: `{dump_json(business_review.get('issue_counts') or {})}`",
|
||||
"",
|
||||
"## Human Answer Surface",
|
||||
]
|
||||
for step in business_review.get("steps") or []:
|
||||
if not isinstance(step, dict):
|
||||
continue
|
||||
lines.extend(
|
||||
[
|
||||
f"{step.get('index')}. `{step.get('step_id')}` - {step.get('question')}",
|
||||
f"review_status: `{step.get('review_status') or 'n/a'}`",
|
||||
f"direct_answer: {step.get('direct_answer') or 'n/a'}",
|
||||
f"answer_length_chars: `{step.get('answer_length_chars')}`",
|
||||
f"direct_answer_required: `{step.get('direct_answer_required')}`",
|
||||
f"direct_answer_first_ok: `{step.get('direct_answer_first_ok')}`",
|
||||
f"business_usefulness_ok: `{step.get('business_usefulness_ok')}`",
|
||||
f"answer_layering_ok: `{step.get('answer_layering_ok')}`",
|
||||
f"technical_garbage_present: `{step.get('technical_garbage_present')}`",
|
||||
f"too_verbose_for_direct_question: `{step.get('too_verbose_for_direct_question')}`",
|
||||
f"issue_codes: `{', '.join(step.get('issue_codes') or []) or 'none'}`",
|
||||
f"suggested_root_cause_layers: `{', '.join(step.get('suggested_root_cause_layers') or []) or 'none'}`",
|
||||
"",
|
||||
]
|
||||
)
|
||||
return "\n".join(lines).strip() + "\n"
|
||||
|
||||
|
||||
def build_truth_review_markdown(spec: dict[str, Any], scenario_state: dict[str, Any], review_summary: dict[str, Any]) -> str:
|
||||
lines = [
|
||||
"# Truth harness review",
|
||||
|
|
@ -772,6 +908,11 @@ def build_truth_review_markdown(spec: dict[str, Any], scenario_state: dict[str,
|
|||
for index, step in enumerate(spec["steps"], start=1):
|
||||
step_state = step_outputs.get(step["step_id"], {})
|
||||
findings = step_state.get("review_findings") if isinstance(step_state.get("review_findings"), list) else []
|
||||
business_review = (
|
||||
step_state.get("business_first_review")
|
||||
if isinstance(step_state, dict) and isinstance(step_state.get("business_first_review"), dict)
|
||||
else {}
|
||||
)
|
||||
lines.extend(
|
||||
[
|
||||
f"{index}. `{step['step_id']}` - {step['question_template']}",
|
||||
|
|
@ -786,6 +927,11 @@ def build_truth_review_markdown(spec: dict[str, Any], scenario_state: dict[str,
|
|||
f"limited_reason_category: `{step_state.get('limited_reason_category') or 'n/a'}`",
|
||||
f"filters: `{dump_json(step_state.get('extracted_filters') or {})}`",
|
||||
f"direct_answer: {step_state.get('actual_direct_answer') or 'n/a'}",
|
||||
f"business_first: status=`{business_review.get('business_usefulness_ok')}`, "
|
||||
f"direct_first=`{business_review.get('direct_answer_first_ok')}`, "
|
||||
f"layering=`{business_review.get('answer_layering_ok')}`, "
|
||||
f"length=`{business_review.get('answer_length_chars')}`, "
|
||||
f"issues=`{', '.join(business_review.get('issue_codes') or []) or 'none'}`",
|
||||
]
|
||||
)
|
||||
if step.get("notes"):
|
||||
|
|
@ -964,9 +1110,12 @@ def review_export(spec: dict[str, Any], export_path: Path, output_dir: Path) ->
|
|||
scenario_state["updated_at"] = datetime.now(timezone.utc).replace(microsecond=0).isoformat()
|
||||
review_summary = build_truth_review_summary(spec, scenario_state, f"export:{export_path}")
|
||||
review_markdown = build_truth_review_markdown(spec, scenario_state, review_summary)
|
||||
business_review = build_business_review_summary(spec, scenario_state)
|
||||
write_json(output_dir / "scenario_state.json", scenario_state)
|
||||
write_json(output_dir / "truth_review.json", {"summary": review_summary, "steps": scenario_state["step_outputs"]})
|
||||
write_text(output_dir / "truth_review.md", review_markdown)
|
||||
write_json(output_dir / "business_review.json", business_review)
|
||||
write_text(output_dir / "business_review.md", build_business_review_markdown(business_review))
|
||||
acceptance_bundle = write_acceptance_artifacts(output_dir, spec, scenario_state, review_summary)
|
||||
return {
|
||||
"scenario_state": scenario_state,
|
||||
|
|
@ -1056,10 +1205,13 @@ def run_live(spec: dict[str, Any], output_dir: Path, args: argparse.Namespace) -
|
|||
|
||||
review_summary = build_truth_review_summary(spec, scenario_state, "live_strict_replay")
|
||||
review_markdown = build_truth_review_markdown(spec, scenario_state, review_summary)
|
||||
business_review = build_business_review_summary(spec, scenario_state)
|
||||
write_text(output_dir / "session_id.txt", f"{scenario_state.get('session_id') or ''}\n")
|
||||
write_json(output_dir / "scenario_state.json", scenario_state)
|
||||
write_json(output_dir / "truth_review.json", {"summary": review_summary, "steps": scenario_state["step_outputs"]})
|
||||
write_text(output_dir / "truth_review.md", review_markdown)
|
||||
write_json(output_dir / "business_review.json", business_review)
|
||||
write_text(output_dir / "business_review.md", build_business_review_markdown(business_review))
|
||||
acceptance_bundle = write_acceptance_artifacts(output_dir, spec, scenario_state, review_summary)
|
||||
print(f"[truth-harness] saved artifacts to {output_dir}")
|
||||
print(f"[truth-harness] overall_status={review_summary['overall_status']}")
|
||||
|
|
|
|||
|
|
@ -0,0 +1,634 @@
|
|||
#!/usr/bin/env python3
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import re
|
||||
import sys
|
||||
from collections import Counter
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
import domain_case_loop as dcl
|
||||
|
||||
|
||||
REPO_ROOT = Path(__file__).resolve().parents[1]
|
||||
DEFAULT_SESSIONS_DIR = REPO_ROOT / "llm_normalizer" / "data" / "assistant_sessions"
|
||||
DEFAULT_REPORTS_DIR = REPO_ROOT / "llm_normalizer" / "reports"
|
||||
DEFAULT_OUTPUT_ROOT = REPO_ROOT / "artifacts" / "domain_runs" / "gui_run_reviews"
|
||||
RUN_REVIEW_SCHEMA_VERSION = "assistant_stage1_run_review_v1"
|
||||
QUESTION_QUALITY_SCHEMA_VERSION = "assistant_stage1_question_quality_v1"
|
||||
|
||||
DOMAIN_MARKERS: dict[str, tuple[str, ...]] = {
|
||||
"vat": ("ндс", "налог", "вычет", "счет-фактур"),
|
||||
"money": ("деньг", "заработ", "доход", "выруч", "поступлен", "оплат", "оборот"),
|
||||
"counterparty": ("контрагент", "клиент", "покупател", "поставщик", "группа свк", "свк", "чепурнов", "альтернатива"),
|
||||
"inventory": ("склад", "товар", "остат", "закуп", "продаж", "номенклатур"),
|
||||
"debt": ("долг", "должен", "должны", "должн", "дебитор", "кредитор", "счет 60", "счет 62", "хвост"),
|
||||
"documents": ("документ", "доки", "накладн", "акт", "платеж", "реализац", "поступлени"),
|
||||
}
|
||||
SMALLTALK_MARKERS = ("привет", "как дела", "что умеешь", "что можешь", "расскажи что можешь")
|
||||
FOLLOWUP_MARKERS = (
|
||||
"по ней",
|
||||
"по нему",
|
||||
"по этой",
|
||||
"по этому",
|
||||
"по выбран",
|
||||
"теперь",
|
||||
"тогда",
|
||||
"давай на",
|
||||
"а еще",
|
||||
"еще",
|
||||
"эту",
|
||||
"его",
|
||||
"ее",
|
||||
"этот",
|
||||
"эта",
|
||||
"сравни",
|
||||
"а если",
|
||||
"а нам",
|
||||
"почему",
|
||||
"а кому",
|
||||
"кому ",
|
||||
)
|
||||
DATE_ONLY_FOLLOWUP_PATTERN = re.compile(
|
||||
r"^\s*(?:давай\s+)?(?:на\s+)?(?:январ[ьяе]|феврал[ьяе]|март[ае]?|апрел[ьяе]|ма[йяе]|июн[ьяе]|июл[ьяе]|август[ае]?|сентябр[ьяе]|октябр[ьяе]|ноябр[ьяе]|декабр[ьяе])\s+\d{4}\s*$",
|
||||
re.IGNORECASE,
|
||||
)
|
||||
FALSE_CATASTROPHE_MARKERS = (
|
||||
"все сломалось",
|
||||
"разъехалось",
|
||||
"разъеб",
|
||||
"пиздец",
|
||||
"хуйня",
|
||||
"косяк",
|
||||
"неправильно",
|
||||
)
|
||||
BUSINESS_NOUN_MARKERS = tuple(sorted({item for values in DOMAIN_MARKERS.values() for item in values}))
|
||||
|
||||
|
||||
def now_iso() -> str:
|
||||
return datetime.now(timezone.utc).replace(microsecond=0).isoformat()
|
||||
|
||||
|
||||
def load_json(path: Path) -> Any:
|
||||
return json.loads(path.read_text(encoding="utf-8-sig"))
|
||||
|
||||
|
||||
def write_text(path: Path, text: str) -> None:
|
||||
path.parent.mkdir(parents=True, exist_ok=True)
|
||||
path.write_text(text, encoding="utf-8")
|
||||
|
||||
|
||||
def write_json(path: Path, payload: Any) -> None:
|
||||
write_text(path, json.dumps(payload, ensure_ascii=False, indent=2) + "\n")
|
||||
|
||||
|
||||
def repo_relative(path: Path) -> str:
|
||||
try:
|
||||
return str(path.resolve().relative_to(REPO_ROOT))
|
||||
except ValueError:
|
||||
return str(path.resolve())
|
||||
|
||||
|
||||
def normalize_text(value: Any) -> str:
|
||||
return re.sub(r"\s+", " ", str(value or "").strip().lower())
|
||||
|
||||
|
||||
def compact_preview(value: Any, limit: int = 260) -> str:
|
||||
text = re.sub(r"\s+", " ", str(value or "").strip())
|
||||
if len(text) <= limit:
|
||||
return text
|
||||
return text[: limit - 1].rstrip() + "..."
|
||||
|
||||
|
||||
def has_any(text: str, markers: tuple[str, ...]) -> bool:
|
||||
lowered = normalize_text(text)
|
||||
return any(marker in lowered for marker in markers)
|
||||
|
||||
|
||||
def run_id_from_value(value: str) -> str:
|
||||
text = str(value or "").strip()
|
||||
match = re.search(r"(assistant-stage1-[A-Za-z0-9_-]+)", text)
|
||||
if not match:
|
||||
raise RuntimeError(f"Cannot parse assistant-stage1 run id from: {value}")
|
||||
return match.group(1)
|
||||
|
||||
|
||||
def parse_report_metadata(report_path: Path) -> dict[str, Any]:
|
||||
if not report_path.exists():
|
||||
return {}
|
||||
metadata: dict[str, Any] = {"report_path": repo_relative(report_path)}
|
||||
for line in report_path.read_text(encoding="utf-8-sig").splitlines()[:80]:
|
||||
match = re.match(r"^-\s*([^:]+):\s*(.*)$", line.strip())
|
||||
if match:
|
||||
metadata[match.group(1).strip()] = match.group(2).strip()
|
||||
return metadata
|
||||
|
||||
|
||||
def resolve_session_files(
|
||||
*,
|
||||
run_id: str,
|
||||
sessions_dir: Path,
|
||||
explicit_session_file: Path | None = None,
|
||||
) -> list[Path]:
|
||||
if explicit_session_file is not None:
|
||||
if not explicit_session_file.exists():
|
||||
raise RuntimeError(f"Session file not found: {explicit_session_file}")
|
||||
return [explicit_session_file]
|
||||
candidates = sorted(sessions_dir.glob(f"{run_id}-*.json"))
|
||||
if not candidates:
|
||||
raise RuntimeError(f"No assistant session files found for {run_id} in {sessions_dir}")
|
||||
return candidates
|
||||
|
||||
|
||||
def load_session(path: Path) -> dict[str, Any]:
|
||||
payload = load_json(path)
|
||||
if not isinstance(payload, dict):
|
||||
raise RuntimeError(f"Assistant session must be a JSON object: {path}")
|
||||
conversation = payload.get("conversation")
|
||||
if not isinstance(conversation, list):
|
||||
raise RuntimeError(f"Assistant session has no conversation[]: {path}")
|
||||
return payload
|
||||
|
||||
|
||||
def build_conversation_pairs(conversation: list[dict[str, Any]]) -> list[dict[str, Any]]:
|
||||
pairs: list[dict[str, Any]] = []
|
||||
for index, item in enumerate(conversation):
|
||||
if not isinstance(item, dict) or item.get("role") != "user":
|
||||
continue
|
||||
assistant_item: dict[str, Any] | None = None
|
||||
for candidate in conversation[index + 1 :]:
|
||||
if isinstance(candidate, dict) and candidate.get("role") == "assistant":
|
||||
assistant_item = candidate
|
||||
break
|
||||
if isinstance(candidate, dict) and candidate.get("role") == "user":
|
||||
break
|
||||
pairs.append(
|
||||
{
|
||||
"pair_index": len(pairs) + 1,
|
||||
"user": item,
|
||||
"assistant": assistant_item,
|
||||
}
|
||||
)
|
||||
return pairs
|
||||
|
||||
|
||||
def classify_question(question: str, pair_index: int) -> dict[str, Any]:
|
||||
normalized = normalize_text(question)
|
||||
tags: list[str] = []
|
||||
domains: list[str] = []
|
||||
if has_any(normalized, SMALLTALK_MARKERS):
|
||||
tags.append("smalltalk_or_meta")
|
||||
if dcl.is_direct_style_business_question(normalized):
|
||||
tags.append("direct_business_question")
|
||||
if dcl.is_report_style_business_question(normalized):
|
||||
tags.append("report_or_analysis_request")
|
||||
date_only_followup = pair_index > 1 and bool(DATE_ONLY_FOLLOWUP_PATTERN.match(question))
|
||||
if has_any(normalized, FOLLOWUP_MARKERS) or date_only_followup:
|
||||
tags.append("contextual_followup")
|
||||
if has_any(normalized, FALSE_CATASTROPHE_MARKERS):
|
||||
tags.append("false_catastrophe_or_negative_pressure")
|
||||
for domain, markers in DOMAIN_MARKERS.items():
|
||||
if has_any(normalized, markers):
|
||||
domains.append(domain)
|
||||
if domains:
|
||||
tags.append("domain_grounded")
|
||||
if not tags:
|
||||
tags.append("unclassified")
|
||||
|
||||
weak_flags: list[str] = []
|
||||
if pair_index == 1 and "contextual_followup" in tags and "smalltalk_or_meta" not in tags:
|
||||
weak_flags.append("root_question_requires_missing_context")
|
||||
if len(question) > 500:
|
||||
weak_flags.append("question_too_long")
|
||||
if (
|
||||
"smalltalk_or_meta" not in tags
|
||||
and "contextual_followup" not in tags
|
||||
and "report_or_analysis_request" not in tags
|
||||
and not domains
|
||||
and not has_any(normalized, BUSINESS_NOUN_MARKERS)
|
||||
):
|
||||
weak_flags.append("low_business_anchor")
|
||||
|
||||
return {
|
||||
"question": question,
|
||||
"tags": tags,
|
||||
"domains": domains,
|
||||
"weak_flags": weak_flags,
|
||||
"length_chars": len(question),
|
||||
}
|
||||
|
||||
|
||||
def build_question_quality_review(pairs: list[dict[str, Any]]) -> dict[str, Any]:
|
||||
question_reviews: list[dict[str, Any]] = []
|
||||
question_counter: Counter[str] = Counter()
|
||||
for pair in pairs:
|
||||
question = str(pair.get("user", {}).get("text") or "")
|
||||
normalized = normalize_text(question)
|
||||
if normalized:
|
||||
question_counter[normalized] += 1
|
||||
question_reviews.append(classify_question(question, int(pair["pair_index"])))
|
||||
|
||||
tag_counts = Counter(tag for item in question_reviews for tag in item["tags"])
|
||||
domain_counts = Counter(domain for item in question_reviews for domain in item["domains"])
|
||||
weak_flag_counts = Counter(flag for item in question_reviews for flag in item["weak_flags"])
|
||||
duplicate_questions = [question for question, count in question_counter.items() if count > 1]
|
||||
if duplicate_questions:
|
||||
weak_flag_counts["duplicate_questions"] += len(duplicate_questions)
|
||||
if tag_counts["contextual_followup"] < 2 and len(question_reviews) >= 8:
|
||||
weak_flag_counts["too_few_contextual_followups"] += 1
|
||||
if tag_counts["direct_business_question"] < 3 and len(question_reviews) >= 8:
|
||||
weak_flag_counts["too_few_direct_business_questions"] += 1
|
||||
if tag_counts["report_or_analysis_request"] < 1 and len(question_reviews) >= 8:
|
||||
weak_flag_counts["missing_report_or_analysis_request"] += 1
|
||||
if len(domain_counts) < 3 and len(question_reviews) >= 8:
|
||||
weak_flag_counts["low_domain_diversity"] += 1
|
||||
|
||||
score = 100
|
||||
score -= min(30, weak_flag_counts["low_business_anchor"] * 6)
|
||||
score -= min(20, weak_flag_counts["question_too_long"] * 5)
|
||||
score -= min(20, weak_flag_counts["duplicate_questions"] * 5)
|
||||
score -= 12 if weak_flag_counts["too_few_contextual_followups"] else 0
|
||||
score -= 12 if weak_flag_counts["too_few_direct_business_questions"] else 0
|
||||
score -= 10 if weak_flag_counts["missing_report_or_analysis_request"] else 0
|
||||
score -= 10 if weak_flag_counts["low_domain_diversity"] else 0
|
||||
score -= 20 if weak_flag_counts["root_question_requires_missing_context"] else 0
|
||||
score = max(0, min(100, score))
|
||||
|
||||
if score >= 85:
|
||||
status = "strong"
|
||||
elif score >= 70:
|
||||
status = "usable_with_gaps"
|
||||
else:
|
||||
status = "weak"
|
||||
|
||||
return {
|
||||
"schema_version": QUESTION_QUALITY_SCHEMA_VERSION,
|
||||
"status": status,
|
||||
"score": score,
|
||||
"turns_total": len(question_reviews),
|
||||
"tag_counts": dict(sorted(tag_counts.items())),
|
||||
"domain_counts": dict(sorted(domain_counts.items())),
|
||||
"weak_flag_counts": dict(sorted(weak_flag_counts.items())),
|
||||
"duplicate_questions": duplicate_questions[:20],
|
||||
"questions": question_reviews,
|
||||
}
|
||||
|
||||
|
||||
def build_step_for_pair(pair: dict[str, Any]) -> dict[str, Any]:
|
||||
pair_index = int(pair["pair_index"])
|
||||
question = str(pair.get("user", {}).get("text") or "").strip()
|
||||
title = compact_preview(question, limit=80) or f"Turn {pair_index}"
|
||||
return {
|
||||
"step_id": f"turn_{pair_index:03d}",
|
||||
"title": title,
|
||||
"depends_on": [],
|
||||
"question_template": question,
|
||||
"invariant_severity": {
|
||||
"answer_layering_noise": "P1",
|
||||
"business_answer_too_verbose": "P1",
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
def build_step_state_for_pair(
|
||||
*,
|
||||
run_id: str,
|
||||
session: dict[str, Any],
|
||||
pair: dict[str, Any],
|
||||
) -> dict[str, Any]:
|
||||
pair_index = int(pair["pair_index"])
|
||||
question = str(pair.get("user", {}).get("text") or "").strip()
|
||||
assistant_item = pair.get("assistant") if isinstance(pair.get("assistant"), dict) else {}
|
||||
assistant_text = str(assistant_item.get("text") or "")
|
||||
debug = assistant_item.get("debug") if isinstance(assistant_item.get("debug"), dict) else {}
|
||||
turn_artifact = {
|
||||
"schema_version": "assistant_stage1_gui_turn_artifact_v1",
|
||||
"run_id": run_id,
|
||||
"session_id": session.get("session_id"),
|
||||
"pair_index": pair_index,
|
||||
"user_message": pair.get("user"),
|
||||
"assistant_message": assistant_item,
|
||||
"technical_debug_payload": debug,
|
||||
"session_summary": {
|
||||
"session_id": session.get("session_id"),
|
||||
"started_at": session.get("started_at"),
|
||||
"updated_at": session.get("updated_at"),
|
||||
"address_navigation_state": session.get("address_navigation_state"),
|
||||
"investigation_state": session.get("investigation_state"),
|
||||
"counters": session.get("counters"),
|
||||
"reply_types": session.get("reply_types"),
|
||||
},
|
||||
}
|
||||
entries = dcl.extract_structured_entries(assistant_text)
|
||||
return dcl.build_scenario_step_state(
|
||||
scenario_id=run_id,
|
||||
domain="assistant_stage1_gui_run",
|
||||
step=build_step_for_pair(pair),
|
||||
step_index=pair_index,
|
||||
question_resolved=question,
|
||||
analysis_context={},
|
||||
turn_artifact=turn_artifact,
|
||||
entries=entries,
|
||||
)
|
||||
|
||||
|
||||
def severity_rank(severity: str) -> int:
|
||||
return {"P0": 0, "P1": 1, "P2": 2, "WARNING": 3}.get(str(severity or "").upper(), 4)
|
||||
|
||||
|
||||
def max_issue_severity(step_state: dict[str, Any], issue_codes: list[str]) -> str:
|
||||
if not issue_codes:
|
||||
return "none"
|
||||
severities = [dcl.derive_invariant_severity(step_state, code) for code in issue_codes]
|
||||
return sorted(severities, key=severity_rank)[0]
|
||||
|
||||
|
||||
def build_finding(step_state: dict[str, Any], session_id: str | None) -> dict[str, Any] | None:
|
||||
review = step_state.get("business_first_review") if isinstance(step_state.get("business_first_review"), dict) else {}
|
||||
issue_codes = [str(item) for item in review.get("issue_codes", []) if str(item).strip()]
|
||||
if not issue_codes:
|
||||
return None
|
||||
issue_severities = {code: dcl.derive_invariant_severity(step_state, code) for code in issue_codes}
|
||||
severity = max_issue_severity(step_state, issue_codes)
|
||||
return {
|
||||
"finding_type": "business_answer_quality",
|
||||
"severity": severity,
|
||||
"issue_severities": issue_severities,
|
||||
"session_id": session_id,
|
||||
"turn_index": step_state.get("step_index"),
|
||||
"step_id": step_state.get("step_id"),
|
||||
"question": step_state.get("question_resolved"),
|
||||
"assistant_first_line": review.get("actual_direct_answer"),
|
||||
"issue_codes": issue_codes,
|
||||
"suggested_root_cause_layers": review.get("suggested_root_cause_layers") or [],
|
||||
"answer_length_chars": review.get("answer_length_chars"),
|
||||
"reply_type": step_state.get("reply_type"),
|
||||
"trace_id": step_state.get("trace_id"),
|
||||
"capability_id": step_state.get("capability_id"),
|
||||
"selected_recipe": step_state.get("selected_recipe"),
|
||||
}
|
||||
|
||||
|
||||
def build_repair_targets(findings: list[dict[str, Any]]) -> list[dict[str, Any]]:
|
||||
grouped: dict[tuple[str, str], dict[str, Any]] = {}
|
||||
for finding in findings:
|
||||
issue_codes = [str(item) for item in finding.get("issue_codes", []) if str(item).strip()]
|
||||
layers = [str(item) for item in finding.get("suggested_root_cause_layers", []) if str(item).strip()]
|
||||
if not layers:
|
||||
layers = ["business_answer_quality_gap"]
|
||||
for issue_code in issue_codes:
|
||||
issue_severity = (
|
||||
finding.get("issue_severities", {}).get(issue_code)
|
||||
if isinstance(finding.get("issue_severities"), dict)
|
||||
else finding.get("severity")
|
||||
)
|
||||
for layer in layers:
|
||||
key = (layer, issue_code)
|
||||
target = grouped.setdefault(
|
||||
key,
|
||||
{
|
||||
"problem_layer": layer,
|
||||
"issue_code": issue_code,
|
||||
"severity": issue_severity,
|
||||
"occurrences": 0,
|
||||
"sample_turns": [],
|
||||
},
|
||||
)
|
||||
target["occurrences"] += 1
|
||||
if severity_rank(str(issue_severity)) < severity_rank(str(target.get("severity"))):
|
||||
target["severity"] = issue_severity
|
||||
if len(target["sample_turns"]) < 5:
|
||||
target["sample_turns"].append(
|
||||
{
|
||||
"session_id": finding.get("session_id"),
|
||||
"turn_index": finding.get("turn_index"),
|
||||
"question": finding.get("question"),
|
||||
"assistant_first_line": finding.get("assistant_first_line"),
|
||||
}
|
||||
)
|
||||
return sorted(
|
||||
grouped.values(),
|
||||
key=lambda item: (severity_rank(str(item.get("severity"))), -int(item.get("occurrences") or 0), str(item.get("issue_code"))),
|
||||
)
|
||||
|
||||
|
||||
def build_run_review(
|
||||
*,
|
||||
run_id: str,
|
||||
session_files: list[Path],
|
||||
report_path: Path,
|
||||
) -> dict[str, Any]:
|
||||
sessions_review: list[dict[str, Any]] = []
|
||||
all_pairs: list[dict[str, Any]] = []
|
||||
all_step_states: list[dict[str, Any]] = []
|
||||
findings: list[dict[str, Any]] = []
|
||||
for session_file in session_files:
|
||||
session = load_session(session_file)
|
||||
conversation = [item for item in session.get("conversation", []) if isinstance(item, dict)]
|
||||
pairs = build_conversation_pairs(conversation)
|
||||
session_step_states: list[dict[str, Any]] = []
|
||||
for pair in pairs:
|
||||
step_state = build_step_state_for_pair(run_id=run_id, session=session, pair=pair)
|
||||
session_step_states.append(step_state)
|
||||
all_step_states.append(step_state)
|
||||
pair_record = {
|
||||
"session_id": session.get("session_id"),
|
||||
"pair_index": pair["pair_index"],
|
||||
"user_text": pair.get("user", {}).get("text"),
|
||||
"assistant_text": (pair.get("assistant") or {}).get("text") if isinstance(pair.get("assistant"), dict) else None,
|
||||
"assistant_reply_type": (pair.get("assistant") or {}).get("reply_type") if isinstance(pair.get("assistant"), dict) else None,
|
||||
"assistant_trace_id": (pair.get("assistant") or {}).get("trace_id") if isinstance(pair.get("assistant"), dict) else None,
|
||||
}
|
||||
all_pairs.append(pair_record)
|
||||
finding = build_finding(step_state, str(session.get("session_id") or ""))
|
||||
if finding is not None:
|
||||
findings.append(finding)
|
||||
sessions_review.append(
|
||||
{
|
||||
"session_file": repo_relative(session_file),
|
||||
"session_id": session.get("session_id"),
|
||||
"conversation_items": len(conversation),
|
||||
"pairs_total": len(pairs),
|
||||
"business_issue_turns": sum(
|
||||
1
|
||||
for item in session_step_states
|
||||
if (item.get("business_first_review") or {}).get("issue_codes")
|
||||
),
|
||||
}
|
||||
)
|
||||
|
||||
issue_counter = Counter(code for finding in findings for code in finding.get("issue_codes", []))
|
||||
severity_counter = Counter(str(finding.get("severity") or "none") for finding in findings)
|
||||
runtime_status_counts = Counter(str(item.get("execution_status") or "unknown") for item in all_step_states)
|
||||
p0_findings = [item for item in findings if item.get("severity") == "P0"]
|
||||
p1_findings = [item for item in findings if item.get("severity") == "P1"]
|
||||
if p0_findings:
|
||||
overall_status = "fail"
|
||||
elif p1_findings:
|
||||
overall_status = "warning"
|
||||
else:
|
||||
overall_status = "pass"
|
||||
|
||||
question_quality = build_question_quality_review(
|
||||
[
|
||||
{
|
||||
"pair_index": item["pair_index"],
|
||||
"user": {"text": item.get("user_text")},
|
||||
}
|
||||
for item in all_pairs
|
||||
]
|
||||
)
|
||||
repair_targets = build_repair_targets(findings)
|
||||
report_metadata = parse_report_metadata(report_path)
|
||||
|
||||
return {
|
||||
"schema_version": RUN_REVIEW_SCHEMA_VERSION,
|
||||
"run_id": run_id,
|
||||
"reviewed_at": now_iso(),
|
||||
"source": {
|
||||
"report_path": repo_relative(report_path) if report_path.exists() else None,
|
||||
"report_metadata": report_metadata,
|
||||
"session_files": [repo_relative(path) for path in session_files],
|
||||
},
|
||||
"summary": {
|
||||
"overall_business_status": overall_status,
|
||||
"sessions_total": len(session_files),
|
||||
"turn_pairs_total": len(all_pairs),
|
||||
"business_issue_turns": len(findings),
|
||||
"p0_findings": len(p0_findings),
|
||||
"p1_findings": len(p1_findings),
|
||||
"issue_counts": dict(sorted(issue_counter.items())),
|
||||
"severity_counts": dict(sorted(severity_counter.items())),
|
||||
"runtime_status_counts": dict(sorted(runtime_status_counts.items())),
|
||||
"question_quality_status": question_quality["status"],
|
||||
"question_quality_score": question_quality["score"],
|
||||
},
|
||||
"sessions": sessions_review,
|
||||
"question_quality_review": question_quality,
|
||||
"findings": findings,
|
||||
"repair_targets": repair_targets,
|
||||
"conversation_pairs": all_pairs,
|
||||
"step_states": all_step_states,
|
||||
}
|
||||
|
||||
|
||||
def build_review_markdown(review: dict[str, Any]) -> str:
|
||||
summary = review.get("summary") if isinstance(review.get("summary"), dict) else {}
|
||||
question_quality = (
|
||||
review.get("question_quality_review")
|
||||
if isinstance(review.get("question_quality_review"), dict)
|
||||
else {}
|
||||
)
|
||||
lines = [
|
||||
"# Assistant Stage 1 GUI Run Review",
|
||||
"",
|
||||
f"- run_id: `{review.get('run_id')}`",
|
||||
f"- overall_business_status: `{summary.get('overall_business_status')}`",
|
||||
f"- turn_pairs_total: `{summary.get('turn_pairs_total')}`",
|
||||
f"- business_issue_turns: `{summary.get('business_issue_turns')}`",
|
||||
f"- p0_findings: `{summary.get('p0_findings')}`",
|
||||
f"- p1_findings: `{summary.get('p1_findings')}`",
|
||||
f"- question_quality: `{summary.get('question_quality_status')}` / `{summary.get('question_quality_score')}`",
|
||||
"",
|
||||
"## Question Quality",
|
||||
"",
|
||||
f"- status: `{question_quality.get('status')}`",
|
||||
f"- score: `{question_quality.get('score')}`",
|
||||
f"- tag_counts: `{json.dumps(question_quality.get('tag_counts') or {}, ensure_ascii=False, sort_keys=True)}`",
|
||||
f"- domain_counts: `{json.dumps(question_quality.get('domain_counts') or {}, ensure_ascii=False, sort_keys=True)}`",
|
||||
f"- weak_flag_counts: `{json.dumps(question_quality.get('weak_flag_counts') or {}, ensure_ascii=False, sort_keys=True)}`",
|
||||
"",
|
||||
"## Business Findings",
|
||||
]
|
||||
findings = review.get("findings") if isinstance(review.get("findings"), list) else []
|
||||
if not findings:
|
||||
lines.append("")
|
||||
lines.append("- no business-first answer quality findings")
|
||||
else:
|
||||
for finding in findings[:80]:
|
||||
lines.extend(
|
||||
[
|
||||
"",
|
||||
f"### Turn {finding.get('turn_index')} - {finding.get('severity')}",
|
||||
"",
|
||||
f"- issue_codes: `{', '.join(str(item) for item in finding.get('issue_codes') or [])}`",
|
||||
f"- root_cause_layers: `{', '.join(str(item) for item in finding.get('suggested_root_cause_layers') or []) or 'n/a'}`",
|
||||
f"- reply_type: `{finding.get('reply_type') or 'n/a'}`",
|
||||
f"- capability_id: `{finding.get('capability_id') or 'n/a'}`",
|
||||
f"- selected_recipe: `{finding.get('selected_recipe') or 'n/a'}`",
|
||||
f"- question: {compact_preview(finding.get('question'), 500)}",
|
||||
f"- assistant_first_line: {compact_preview(finding.get('assistant_first_line'), 500) or 'n/a'}",
|
||||
]
|
||||
)
|
||||
lines.extend(["", "## Repair Targets"])
|
||||
repair_targets = review.get("repair_targets") if isinstance(review.get("repair_targets"), list) else []
|
||||
if not repair_targets:
|
||||
lines.append("")
|
||||
lines.append("- no repair targets")
|
||||
else:
|
||||
for target in repair_targets[:30]:
|
||||
lines.append(
|
||||
f"- `{target.get('severity')}` `{target.get('problem_layer')}` / `{target.get('issue_code')}`: "
|
||||
f"{target.get('occurrences')} occurrence(s)"
|
||||
)
|
||||
return "\n".join(lines).strip() + "\n"
|
||||
|
||||
|
||||
def save_run_review(review: dict[str, Any], output_dir: Path) -> None:
|
||||
write_json(output_dir / "run_review.json", review)
|
||||
write_text(output_dir / "run_review.md", build_review_markdown(review))
|
||||
write_json(output_dir / "conversation_pairs.json", review.get("conversation_pairs") or [])
|
||||
write_json(output_dir / "question_quality_review.json", review.get("question_quality_review") or {})
|
||||
write_json(output_dir / "repair_targets.json", review.get("repair_targets") or [])
|
||||
|
||||
|
||||
def build_parser() -> argparse.ArgumentParser:
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Review a GUI assistant_stage1 saved-session run by assistant-stage1-* id."
|
||||
)
|
||||
parser.add_argument("run_id", help="Run id or text containing assistant-stage1-...")
|
||||
parser.add_argument("--session-file", type=Path, default=None, help="Explicit assistant session JSON file.")
|
||||
parser.add_argument("--sessions-dir", type=Path, default=DEFAULT_SESSIONS_DIR)
|
||||
parser.add_argument("--reports-dir", type=Path, default=DEFAULT_REPORTS_DIR)
|
||||
parser.add_argument("--output-root", type=Path, default=DEFAULT_OUTPUT_ROOT)
|
||||
parser.add_argument("--output-dir", type=Path, default=None)
|
||||
parser.add_argument("--print-summary", action="store_true")
|
||||
return parser
|
||||
|
||||
|
||||
def main(argv: list[str] | None = None) -> int:
|
||||
args = build_parser().parse_args(argv)
|
||||
run_id = run_id_from_value(args.run_id)
|
||||
report_path = args.reports_dir / f"{run_id}.md"
|
||||
session_files = resolve_session_files(
|
||||
run_id=run_id,
|
||||
sessions_dir=args.sessions_dir,
|
||||
explicit_session_file=args.session_file,
|
||||
)
|
||||
review = build_run_review(run_id=run_id, session_files=session_files, report_path=report_path)
|
||||
output_dir = args.output_dir or (args.output_root / run_id)
|
||||
save_run_review(review, output_dir)
|
||||
if args.print_summary:
|
||||
summary = review["summary"]
|
||||
print(
|
||||
json.dumps(
|
||||
{
|
||||
"run_id": run_id,
|
||||
"output_dir": repo_relative(output_dir),
|
||||
"overall_business_status": summary["overall_business_status"],
|
||||
"turn_pairs_total": summary["turn_pairs_total"],
|
||||
"business_issue_turns": summary["business_issue_turns"],
|
||||
"question_quality_score": summary["question_quality_score"],
|
||||
},
|
||||
ensure_ascii=False,
|
||||
indent=2,
|
||||
)
|
||||
)
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main(sys.argv[1:]))
|
||||
|
|
@ -14,6 +14,7 @@ REPO_ROOT = Path(__file__).resolve().parents[1]
|
|||
HISTORY_FILE = REPO_ROOT / "llm_normalizer" / "data" / "autorun_generators" / "history.json"
|
||||
SAVED_SESSIONS_DIR = REPO_ROOT / "llm_normalizer" / "data" / "autorun_generators" / "saved_sessions"
|
||||
EVAL_CASES_DIR = REPO_ROOT / "llm_normalizer" / "data" / "eval_cases"
|
||||
VALIDATED_AGENT_SAVE_SCHEMA_VERSION = "agent_semantic_save_gate_v1"
|
||||
|
||||
|
||||
def now_utc() -> datetime:
|
||||
|
|
@ -54,6 +55,188 @@ def write_json(path: Path, payload: Any) -> None:
|
|||
path.write_text(json.dumps(payload, ensure_ascii=False, indent=2) + "\n", encoding="utf-8")
|
||||
|
||||
|
||||
def resolve_repo_path(raw_path: str | Path) -> Path:
|
||||
path = Path(raw_path)
|
||||
return path if path.is_absolute() else (REPO_ROOT / path).resolve()
|
||||
|
||||
|
||||
def repo_relative(path: Path) -> str:
|
||||
try:
|
||||
return str(path.resolve().relative_to(REPO_ROOT))
|
||||
except ValueError:
|
||||
return str(path.resolve())
|
||||
|
||||
|
||||
def load_json_object(path: Path, label: str) -> dict[str, Any]:
|
||||
if not path.exists():
|
||||
raise RuntimeError(f"{label} not found: {path}")
|
||||
parsed = load_json(path)
|
||||
if not isinstance(parsed, dict):
|
||||
raise RuntimeError(f"{label} must be a JSON object: {path}")
|
||||
return parsed
|
||||
|
||||
|
||||
def assert_status(value: Any, expected: str, label: str, problems: list[str]) -> None:
|
||||
actual = str(value or "").strip().lower()
|
||||
if actual != expected:
|
||||
problems.append(f"{label}={actual or 'missing'}")
|
||||
|
||||
|
||||
def validate_truth_harness_run_dir(run_dir: Path) -> dict[str, Any]:
|
||||
run_dir = run_dir.resolve()
|
||||
pack_state = load_json_object(run_dir / "pack_state.json", "Validated run pack_state.json")
|
||||
truth_review = load_json_object(run_dir / "truth_review.json", "Validated run truth_review.json")
|
||||
business_review = load_json_object(run_dir / "business_review.json", "Validated run business_review.json")
|
||||
truth_summary = truth_review.get("summary") if isinstance(truth_review.get("summary"), dict) else {}
|
||||
|
||||
problems: list[str] = []
|
||||
assert_status(pack_state.get("final_status"), "accepted", "pack_state.final_status", problems)
|
||||
assert_status(pack_state.get("review_overall_status"), "pass", "pack_state.review_overall_status", problems)
|
||||
assert_status(truth_summary.get("overall_status"), "pass", "truth_review.summary.overall_status", problems)
|
||||
assert_status(business_review.get("overall_business_status"), "pass", "business_review.overall_business_status", problems)
|
||||
if pack_state.get("acceptance_gate_passed") is not True:
|
||||
problems.append("pack_state.acceptance_gate_passed=false")
|
||||
if pack_state.get("no_unresolved_p0") is not True:
|
||||
problems.append("pack_state.no_unresolved_p0=false")
|
||||
if int(pack_state.get("unresolved_p0_count") or 0) != 0:
|
||||
problems.append(f"pack_state.unresolved_p0_count={pack_state.get('unresolved_p0_count')}")
|
||||
if int(business_review.get("steps_with_business_failures") or 0) != 0:
|
||||
problems.append(f"business_review.steps_with_business_failures={business_review.get('steps_with_business_failures')}")
|
||||
|
||||
if problems:
|
||||
raise RuntimeError(
|
||||
"Refusing to save AGENT autorun because the validated run is not clean: "
|
||||
+ ", ".join(problems)
|
||||
)
|
||||
|
||||
return {
|
||||
"schema_version": VALIDATED_AGENT_SAVE_SCHEMA_VERSION,
|
||||
"validation_status": "accepted_live_replay",
|
||||
"validated_run_dir": repo_relative(run_dir),
|
||||
"final_status": pack_state.get("final_status"),
|
||||
"review_overall_status": pack_state.get("review_overall_status"),
|
||||
"business_overall_status": business_review.get("overall_business_status"),
|
||||
"steps_total": pack_state.get("steps_total"),
|
||||
"steps_passed": pack_state.get("steps_passed"),
|
||||
"steps_failed": pack_state.get("steps_failed"),
|
||||
"steps_with_business_failures": business_review.get("steps_with_business_failures"),
|
||||
"steps_with_business_warnings": business_review.get("steps_with_business_warnings"),
|
||||
"acceptance_gate_passed": pack_state.get("acceptance_gate_passed"),
|
||||
"saved_after_validated_replay": True,
|
||||
}
|
||||
|
||||
|
||||
def validate_domain_pack_loop_dir(loop_dir: Path) -> dict[str, Any]:
|
||||
loop_dir = loop_dir.resolve()
|
||||
loop_state = load_json_object(loop_dir / "loop_state.json", "Validated loop_state.json")
|
||||
iterations = loop_state.get("iterations")
|
||||
if not isinstance(iterations, list) or not iterations:
|
||||
raise RuntimeError("Refusing to save AGENT autorun because the validated loop has no iterations")
|
||||
accepted_iterations = [
|
||||
item for item in iterations if isinstance(item, dict) and bool(item.get("accepted_gate"))
|
||||
]
|
||||
last_iteration = accepted_iterations[-1] if accepted_iterations else iterations[-1]
|
||||
if not isinstance(last_iteration, dict):
|
||||
raise RuntimeError("Refusing to save AGENT autorun because the validated loop iteration is invalid")
|
||||
|
||||
analyst_path_raw = str(last_iteration.get("analyst_verdict_path") or "").strip()
|
||||
repair_targets_path_raw = str(last_iteration.get("repair_targets_path") or "").strip()
|
||||
analyst_verdict = load_json_object(resolve_repo_path(analyst_path_raw), "Validated loop analyst_verdict.json")
|
||||
repair_targets = load_json_object(resolve_repo_path(repair_targets_path_raw), "Validated loop repair_targets.json")
|
||||
severity_counts = repair_targets.get("severity_counts") if isinstance(repair_targets.get("severity_counts"), dict) else {}
|
||||
|
||||
problems: list[str] = []
|
||||
assert_status(loop_state.get("final_status"), "accepted", "loop_state.final_status", problems)
|
||||
if last_iteration.get("accepted_gate") is not True:
|
||||
problems.append("last_iteration.accepted_gate=false")
|
||||
if last_iteration.get("analyst_accepted_gate") is not True:
|
||||
problems.append("last_iteration.analyst_accepted_gate=false")
|
||||
if last_iteration.get("deterministic_gate_ok") is not True:
|
||||
problems.append("last_iteration.deterministic_gate_ok=false")
|
||||
if int(last_iteration.get("quality_score") or 0) < int(loop_state.get("target_score") or 80):
|
||||
problems.append(
|
||||
f"last_iteration.quality_score={last_iteration.get('quality_score')}<target_score={loop_state.get('target_score')}"
|
||||
)
|
||||
assert_status(analyst_verdict.get("loop_decision"), "accepted", "analyst_verdict.loop_decision", problems)
|
||||
if int(analyst_verdict.get("unresolved_p0_count") or 0) != 0:
|
||||
problems.append(f"analyst_verdict.unresolved_p0_count={analyst_verdict.get('unresolved_p0_count')}")
|
||||
if bool(analyst_verdict.get("regression_detected")):
|
||||
problems.append("analyst_verdict.regression_detected=true")
|
||||
for field_name in (
|
||||
"direct_answer_ok",
|
||||
"business_usefulness_ok",
|
||||
"temporal_honesty_ok",
|
||||
"field_truth_ok",
|
||||
"answer_layering_ok",
|
||||
):
|
||||
if analyst_verdict.get(field_name) is not True:
|
||||
problems.append(f"analyst_verdict.{field_name}=false")
|
||||
if int(severity_counts.get("P0") or 0) != 0 or int(severity_counts.get("P1") or 0) != 0:
|
||||
problems.append(
|
||||
f"repair_targets.severity_counts=P0:{severity_counts.get('P0') or 0},P1:{severity_counts.get('P1') or 0}"
|
||||
)
|
||||
|
||||
if problems:
|
||||
raise RuntimeError(
|
||||
"Refusing to save AGENT autorun because the validated stage/domain loop is not clean: "
|
||||
+ ", ".join(problems)
|
||||
)
|
||||
|
||||
return {
|
||||
"schema_version": VALIDATED_AGENT_SAVE_SCHEMA_VERSION,
|
||||
"validation_status": "accepted_domain_pack_loop",
|
||||
"validated_run_dir": repo_relative(loop_dir),
|
||||
"final_status": loop_state.get("final_status"),
|
||||
"loop_id": loop_state.get("loop_id"),
|
||||
"target_score": loop_state.get("target_score"),
|
||||
"iterations_ran": len(iterations),
|
||||
"quality_score": last_iteration.get("quality_score"),
|
||||
"repair_target_count": last_iteration.get("repair_target_count"),
|
||||
"repair_target_severity_counts": last_iteration.get("repair_target_severity_counts"),
|
||||
"accepted_gate": last_iteration.get("accepted_gate"),
|
||||
"saved_after_validated_replay": True,
|
||||
}
|
||||
|
||||
|
||||
def validate_accepted_run_dir(run_dir: Path) -> dict[str, Any]:
|
||||
run_dir = run_dir.resolve()
|
||||
if (run_dir / "loop_state.json").exists():
|
||||
return validate_domain_pack_loop_dir(run_dir)
|
||||
return validate_truth_harness_run_dir(run_dir)
|
||||
|
||||
|
||||
def build_save_gate_metadata(args: argparse.Namespace, spec: dict[str, Any], spec_path: Path) -> dict[str, Any]:
|
||||
raw_run_dir = args.validated_run_dir or spec.get("validated_run_dir") or spec.get("validated_artifact_dir")
|
||||
if raw_run_dir:
|
||||
return validate_accepted_run_dir(resolve_repo_path(str(raw_run_dir)))
|
||||
|
||||
if args.dry_run:
|
||||
return {
|
||||
"schema_version": VALIDATED_AGENT_SAVE_SCHEMA_VERSION,
|
||||
"validation_status": "dry_run_unvalidated",
|
||||
"source_spec_file": repo_relative(spec_path),
|
||||
"saved_after_validated_replay": False,
|
||||
}
|
||||
|
||||
if args.allow_unvalidated:
|
||||
reason = str(args.unvalidated_reason or "").strip()
|
||||
if not reason:
|
||||
raise RuntimeError("--unvalidated-reason is required when --allow-unvalidated is used")
|
||||
return {
|
||||
"schema_version": VALIDATED_AGENT_SAVE_SCHEMA_VERSION,
|
||||
"validation_status": "explicitly_unvalidated",
|
||||
"source_spec_file": repo_relative(spec_path),
|
||||
"unvalidated_reason": reason,
|
||||
"saved_after_validated_replay": False,
|
||||
}
|
||||
|
||||
raise RuntimeError(
|
||||
"Refusing to save AGENT autorun before a reviewed live replay. "
|
||||
"Pass --validated-run-dir artifacts/domain_runs/<run_id> after run-live/review-export is accepted, "
|
||||
"or use --allow-unvalidated --unvalidated-reason only for an explicit draft."
|
||||
)
|
||||
|
||||
|
||||
def normalize_questions(raw_questions: list[Any]) -> list[str]:
|
||||
result: list[str] = []
|
||||
seen: set[str] = set()
|
||||
|
|
@ -90,9 +273,31 @@ def extract_questions_from_spec(spec: dict[str, Any]) -> list[str]:
|
|||
steps = spec.get("steps")
|
||||
if isinstance(steps, list):
|
||||
return normalize_questions(
|
||||
[step.get("question") for step in steps if isinstance(step, dict) and step.get("question")]
|
||||
[
|
||||
step.get("question") or step.get("question_template")
|
||||
for step in steps
|
||||
if isinstance(step, dict) and (step.get("question") or step.get("question_template"))
|
||||
]
|
||||
)
|
||||
raise RuntimeError("Spec must define either `questions[]` or `steps[].question`")
|
||||
scenarios = spec.get("scenarios")
|
||||
if isinstance(scenarios, list):
|
||||
raw_questions: list[Any] = []
|
||||
for scenario in scenarios:
|
||||
if not isinstance(scenario, dict):
|
||||
continue
|
||||
scenario_steps = scenario.get("steps")
|
||||
if not isinstance(scenario_steps, list):
|
||||
continue
|
||||
raw_questions.extend(
|
||||
step.get("question") or step.get("question_template")
|
||||
for step in scenario_steps
|
||||
if isinstance(step, dict) and (step.get("question") or step.get("question_template"))
|
||||
)
|
||||
return normalize_questions(raw_questions)
|
||||
raise RuntimeError(
|
||||
"Spec must define `questions[]`, `steps[].question`, `steps[].question_template`, "
|
||||
"or `scenarios[].steps[]` questions"
|
||||
)
|
||||
|
||||
|
||||
def build_case_set_payload(
|
||||
|
|
@ -203,6 +408,9 @@ def build_history_record(
|
|||
"source_spec_file": metadata.get("source_spec_file"),
|
||||
"scenario_id": metadata.get("scenario_id"),
|
||||
"semantic_tags": metadata.get("semantic_tags"),
|
||||
"validation_status": metadata.get("validation_status"),
|
||||
"validated_run_dir": metadata.get("validated_run_dir"),
|
||||
"saved_after_validated_replay": metadata.get("saved_after_validated_replay"),
|
||||
}
|
||||
return {
|
||||
"generation_id": generation_id,
|
||||
|
|
@ -218,7 +426,12 @@ def build_history_record(
|
|||
}
|
||||
|
||||
|
||||
def build_metadata(args: argparse.Namespace, spec: dict[str, Any], spec_path: Path | None) -> dict[str, Any]:
|
||||
def build_metadata(
|
||||
args: argparse.Namespace,
|
||||
spec: dict[str, Any],
|
||||
spec_path: Path | None,
|
||||
save_gate: dict[str, Any],
|
||||
) -> dict[str, Any]:
|
||||
semantic_tags = extract_semantic_tags(spec)
|
||||
return {
|
||||
"assistant_prompt_version": args.assistant_prompt_version,
|
||||
|
|
@ -229,6 +442,10 @@ def build_metadata(args: argparse.Namespace, spec: dict[str, Any], spec_path: Pa
|
|||
"source_spec_file": str(spec_path.resolve()) if spec_path else None,
|
||||
"scenario_id": str(spec.get("scenario_id") or "").strip() or None,
|
||||
"semantic_tags": semantic_tags,
|
||||
"validation_status": save_gate.get("validation_status"),
|
||||
"validated_run_dir": save_gate.get("validated_run_dir"),
|
||||
"saved_after_validated_replay": save_gate.get("saved_after_validated_replay"),
|
||||
"save_gate": save_gate,
|
||||
}
|
||||
|
||||
|
||||
|
|
@ -242,6 +459,19 @@ def parse_args() -> argparse.Namespace:
|
|||
parser.add_argument("--assistant-prompt-version", help="Optional assistant prompt version metadata.")
|
||||
parser.add_argument("--decomposition-prompt-version", help="Optional decomposition prompt version metadata.")
|
||||
parser.add_argument("--prompt-fingerprint", help="Optional prompt fingerprint metadata.")
|
||||
parser.add_argument(
|
||||
"--validated-run-dir",
|
||||
help="Accepted truth-harness artifact directory containing pack_state.json, truth_review.json, and business_review.json.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--allow-unvalidated",
|
||||
action="store_true",
|
||||
help="Explicitly save a draft AGENT run without accepted replay artifacts. This is not an acceptance proof.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--unvalidated-reason",
|
||||
help="Required explanation when --allow-unvalidated is used.",
|
||||
)
|
||||
parser.add_argument("--dry-run", action="store_true", help="Print resulting record metadata without writing files.")
|
||||
return parser.parse_args()
|
||||
|
||||
|
|
@ -262,10 +492,11 @@ def main() -> int:
|
|||
if not questions:
|
||||
raise RuntimeError("Agent semantic run must contain at least one question")
|
||||
|
||||
save_gate = build_save_gate_metadata(args, spec_raw, spec_path)
|
||||
domain = str(spec_raw.get("domain") or "").strip() or None
|
||||
source_title = str(args.title or spec_raw.get("title") or spec_path.stem).strip()
|
||||
title = ensure_agent_title(source_title)
|
||||
metadata = build_metadata(args, spec_raw, spec_path)
|
||||
metadata = build_metadata(args, spec_raw, spec_path, save_gate)
|
||||
|
||||
timestamp = now_utc()
|
||||
generation_id = generate_id(timestamp)
|
||||
|
|
@ -307,6 +538,8 @@ def main() -> int:
|
|||
"case_set_file": case_set_file,
|
||||
"saved_session_file": saved_session_file,
|
||||
"domain": domain,
|
||||
"validation_status": save_gate.get("validation_status"),
|
||||
"validated_run_dir": save_gate.get("validated_run_dir"),
|
||||
},
|
||||
ensure_ascii=False,
|
||||
indent=2,
|
||||
|
|
@ -329,6 +562,8 @@ def main() -> int:
|
|||
"questions_total": len(questions),
|
||||
"case_set_file": case_set_file,
|
||||
"saved_session_file": saved_session_file,
|
||||
"validation_status": save_gate.get("validation_status"),
|
||||
"validated_run_dir": save_gate.get("validated_run_dir"),
|
||||
},
|
||||
ensure_ascii=False,
|
||||
indent=2,
|
||||
|
|
|
|||
|
|
@ -0,0 +1,406 @@
|
|||
#!/usr/bin/env python3
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import re
|
||||
import subprocess
|
||||
import sys
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
|
||||
REPO_ROOT = Path(__file__).resolve().parents[1]
|
||||
DEFAULT_STAGE_OUTPUT_ROOT = REPO_ROOT / "artifacts" / "domain_runs" / "stage_agent_loops"
|
||||
STAGE_LOOP_SCHEMA_VERSION = "stage_agent_loop_manifest_v1"
|
||||
STAGE_SUMMARY_SCHEMA_VERSION = "stage_agent_loop_summary_v1"
|
||||
|
||||
|
||||
def now_iso() -> str:
|
||||
return datetime.now(timezone.utc).replace(microsecond=0).isoformat()
|
||||
|
||||
|
||||
def slugify(value: str, fallback: str = "stage_agent_loop") -> str:
|
||||
normalized = re.sub(r"[^a-zA-Z0-9_.-]+", "_", str(value or "").strip()).strip("_.-")
|
||||
return normalized or fallback
|
||||
|
||||
|
||||
def load_json(path: Path) -> Any:
|
||||
return json.loads(path.read_text(encoding="utf-8"))
|
||||
|
||||
|
||||
def load_json_object(path: Path, label: str) -> dict[str, Any]:
|
||||
if not path.exists():
|
||||
raise RuntimeError(f"{label} not found: {path}")
|
||||
parsed = load_json(path)
|
||||
if not isinstance(parsed, dict):
|
||||
raise RuntimeError(f"{label} must be a JSON object: {path}")
|
||||
return parsed
|
||||
|
||||
|
||||
def write_text(path: Path, text: str) -> None:
|
||||
path.parent.mkdir(parents=True, exist_ok=True)
|
||||
path.write_text(text, encoding="utf-8")
|
||||
|
||||
|
||||
def write_json(path: Path, payload: Any) -> None:
|
||||
write_text(path, json.dumps(payload, ensure_ascii=False, indent=2) + "\n")
|
||||
|
||||
|
||||
def repo_path(raw_path: str | Path) -> Path:
|
||||
path = Path(raw_path)
|
||||
return path if path.is_absolute() else (REPO_ROOT / path).resolve()
|
||||
|
||||
|
||||
def repo_relative(path: Path) -> str:
|
||||
try:
|
||||
return str(path.resolve().relative_to(REPO_ROOT))
|
||||
except ValueError:
|
||||
return str(path.resolve())
|
||||
|
||||
|
||||
def string_list(value: Any) -> list[str]:
|
||||
if not isinstance(value, list):
|
||||
return []
|
||||
result: list[str] = []
|
||||
for item in value:
|
||||
text = str(item or "").strip()
|
||||
if text:
|
||||
result.append(text)
|
||||
return result
|
||||
|
||||
|
||||
def load_stage_manifest(path: Path) -> dict[str, Any]:
|
||||
raw = load_json_object(path, "Stage agent loop manifest")
|
||||
stage_id = slugify(str(raw.get("stage_id") or path.stem), path.stem)
|
||||
pack_manifest = str(raw.get("pack_manifest") or "").strip()
|
||||
if not pack_manifest:
|
||||
raise RuntimeError("Stage manifest must define `pack_manifest` for the autonomous stage loop")
|
||||
target_score = int(raw.get("target_score") or 88)
|
||||
max_iterations = int(raw.get("max_iterations") or 6)
|
||||
if target_score < 0 or target_score > 100:
|
||||
raise RuntimeError("Stage manifest `target_score` must be between 0 and 100")
|
||||
if max_iterations < 1:
|
||||
raise RuntimeError("Stage manifest `max_iterations` must be >= 1")
|
||||
return {
|
||||
**raw,
|
||||
"schema_version": str(raw.get("schema_version") or STAGE_LOOP_SCHEMA_VERSION),
|
||||
"stage_id": stage_id,
|
||||
"module_name": str(raw.get("module_name") or raw.get("domain") or "unknown_module").strip(),
|
||||
"title": str(raw.get("title") or stage_id).strip(),
|
||||
"pack_manifest": pack_manifest,
|
||||
"target_score": target_score,
|
||||
"max_iterations": max_iterations,
|
||||
"global_plan_refs": string_list(raw.get("global_plan_refs")),
|
||||
"acceptance_invariants": string_list(raw.get("acceptance_invariants")),
|
||||
"save_autorun_on_accept": bool(raw.get("save_autorun_on_accept", True)),
|
||||
"manual_confirmation_required_after_accept": bool(raw.get("manual_confirmation_required_after_accept", True)),
|
||||
}
|
||||
|
||||
|
||||
def stage_dir_for(output_root: Path, stage_id: str) -> Path:
|
||||
return output_root.resolve() / slugify(stage_id)
|
||||
|
||||
|
||||
def stage_loop_dir(stage_dir: Path, stage_manifest: dict[str, Any]) -> Path:
|
||||
loop_id = str(stage_manifest.get("loop_id") or stage_manifest["stage_id"]).strip()
|
||||
return stage_dir / "domain_loops" / slugify(loop_id)
|
||||
|
||||
|
||||
def build_domain_pack_loop_command(args: argparse.Namespace, stage_manifest: dict[str, Any], stage_dir: Path) -> list[str]:
|
||||
loop_id = str(stage_manifest.get("loop_id") or stage_manifest["stage_id"]).strip()
|
||||
command = [
|
||||
sys.executable,
|
||||
str(REPO_ROOT / "scripts" / "domain_case_loop.py"),
|
||||
"run-pack-loop",
|
||||
"--manifest",
|
||||
str(repo_path(stage_manifest["pack_manifest"])),
|
||||
"--loop-id",
|
||||
loop_id,
|
||||
"--output-root",
|
||||
str(stage_dir / "domain_loops"),
|
||||
"--target-score",
|
||||
str(int(stage_manifest["target_score"])),
|
||||
"--max-iterations",
|
||||
str(int(stage_manifest["max_iterations"])),
|
||||
"--backend-url",
|
||||
str(args.backend_url),
|
||||
"--prompt-version",
|
||||
str(args.prompt_version),
|
||||
"--llm-provider",
|
||||
str(args.llm_provider),
|
||||
"--llm-model",
|
||||
str(args.llm_model),
|
||||
"--llm-base-url",
|
||||
str(args.llm_base_url),
|
||||
"--llm-api-key",
|
||||
str(args.llm_api_key),
|
||||
"--temperature",
|
||||
str(args.temperature),
|
||||
"--max-output-tokens",
|
||||
str(args.max_output_tokens),
|
||||
"--timeout-seconds",
|
||||
str(args.timeout_seconds),
|
||||
"--codex-binary",
|
||||
str(args.codex_binary),
|
||||
"--analyst-codex-model",
|
||||
str(args.analyst_codex_model),
|
||||
"--coder-codex-model",
|
||||
str(args.coder_codex_model),
|
||||
"--analyst-reasoning-effort",
|
||||
str(args.analyst_reasoning_effort),
|
||||
"--coder-reasoning-effort",
|
||||
str(args.coder_reasoning_effort),
|
||||
"--codex-timeout-seconds",
|
||||
str(args.codex_timeout_seconds),
|
||||
]
|
||||
if args.codex_profile:
|
||||
command.extend(["--codex-profile", str(args.codex_profile)])
|
||||
if args.codex_model:
|
||||
command.extend(["--codex-model", str(args.codex_model)])
|
||||
if args.analysis_date:
|
||||
command.extend(["--analysis-date", str(args.analysis_date)])
|
||||
if args.max_scenarios is not None:
|
||||
command.extend(["--max-scenarios", str(int(args.max_scenarios))])
|
||||
if args.use_mock:
|
||||
command.append("--use-mock")
|
||||
return command
|
||||
|
||||
|
||||
def run_command(command: list[str], cwd: Path, stdout_path: Path, stderr_path: Path, timeout_seconds: int) -> None:
|
||||
result = subprocess.run(
|
||||
command,
|
||||
cwd=str(cwd),
|
||||
text=True,
|
||||
encoding="utf-8",
|
||||
errors="replace",
|
||||
capture_output=True,
|
||||
timeout=timeout_seconds,
|
||||
check=False,
|
||||
)
|
||||
write_text(stdout_path, result.stdout)
|
||||
write_text(stderr_path, result.stderr)
|
||||
if result.returncode != 0:
|
||||
raise RuntimeError(f"Command failed with exit code {result.returncode}: {' '.join(command)}")
|
||||
|
||||
|
||||
def build_stage_summary(stage_manifest: dict[str, Any], loop_dir: Path) -> dict[str, Any]:
|
||||
loop_state = load_json_object(loop_dir / "loop_state.json", "Stage domain loop_state.json")
|
||||
iterations = loop_state.get("iterations") if isinstance(loop_state.get("iterations"), list) else []
|
||||
last_iteration = iterations[-1] if iterations and isinstance(iterations[-1], dict) else {}
|
||||
final_status = str(loop_state.get("final_status") or "unknown").strip()
|
||||
accepted = final_status == "accepted" and bool(last_iteration.get("accepted_gate"))
|
||||
manual_confirmation_required = bool(stage_manifest.get("manual_confirmation_required_after_accept", True)) and accepted
|
||||
if accepted and manual_confirmation_required:
|
||||
next_action = "manual_gui_confirmation"
|
||||
elif accepted:
|
||||
next_action = "stage_closed_without_manual_confirmation"
|
||||
elif bool(loop_state.get("last_user_decision_prompt")):
|
||||
next_action = "user_decision_required"
|
||||
else:
|
||||
next_action = "continue_autonomous_or_fix_blocker"
|
||||
return {
|
||||
"schema_version": STAGE_SUMMARY_SCHEMA_VERSION,
|
||||
"stage_id": stage_manifest["stage_id"],
|
||||
"module_name": stage_manifest.get("module_name"),
|
||||
"title": stage_manifest.get("title"),
|
||||
"global_plan_refs": stage_manifest.get("global_plan_refs") or [],
|
||||
"target_score": stage_manifest.get("target_score"),
|
||||
"acceptance_invariants": stage_manifest.get("acceptance_invariants") or [],
|
||||
"loop_dir": repo_relative(loop_dir),
|
||||
"loop_final_status": final_status,
|
||||
"stop_reason": loop_state.get("stop_reason"),
|
||||
"iterations_ran": len(iterations),
|
||||
"last_quality_score": last_iteration.get("quality_score"),
|
||||
"last_analyst_decision": last_iteration.get("loop_decision") or loop_state.get("last_analyst_decision"),
|
||||
"last_deterministic_gate_ok": last_iteration.get("deterministic_gate_ok"),
|
||||
"last_deterministic_gate_reason": last_iteration.get("deterministic_gate_reason"),
|
||||
"accepted_gate": bool(last_iteration.get("accepted_gate")),
|
||||
"manual_confirmation_required": manual_confirmation_required,
|
||||
"next_action": next_action,
|
||||
"save_autorun_on_accept": bool(stage_manifest.get("save_autorun_on_accept", True)),
|
||||
"updated_at": now_iso(),
|
||||
}
|
||||
|
||||
|
||||
def build_stage_handoff_markdown(summary: dict[str, Any]) -> str:
|
||||
lines = [
|
||||
"# Stage agent loop handoff",
|
||||
"",
|
||||
f"- stage_id: `{summary.get('stage_id')}`",
|
||||
f"- module_name: `{summary.get('module_name')}`",
|
||||
f"- title: {summary.get('title')}",
|
||||
f"- loop_final_status: `{summary.get('loop_final_status')}`",
|
||||
f"- target_score: `{summary.get('target_score')}`",
|
||||
f"- iterations_ran: `{summary.get('iterations_ran')}`",
|
||||
f"- last_quality_score: `{summary.get('last_quality_score')}`",
|
||||
f"- accepted_gate: `{summary.get('accepted_gate')}`",
|
||||
f"- deterministic_gate_ok: `{summary.get('last_deterministic_gate_ok')}`",
|
||||
f"- deterministic_gate_reason: `{summary.get('last_deterministic_gate_reason') or 'n/a'}`",
|
||||
f"- manual_confirmation_required: `{summary.get('manual_confirmation_required')}`",
|
||||
f"- next_action: `{summary.get('next_action')}`",
|
||||
f"- loop_dir: `{summary.get('loop_dir')}`",
|
||||
f"- stop_reason: {summary.get('stop_reason') or 'n/a'}",
|
||||
"",
|
||||
"## Plan refs",
|
||||
]
|
||||
refs = summary.get("global_plan_refs") or []
|
||||
lines.extend([f"- {item}" for item in refs] if refs else ["- none"])
|
||||
lines.extend(["", "## Acceptance invariants"])
|
||||
invariants = summary.get("acceptance_invariants") or []
|
||||
lines.extend([f"- {item}" for item in invariants] if invariants else ["- domain loop gate + analyst verdict"])
|
||||
return "\n".join(lines).strip() + "\n"
|
||||
|
||||
|
||||
def save_stage_summary(stage_dir: Path, summary: dict[str, Any]) -> None:
|
||||
write_json(stage_dir / "stage_loop_summary.json", summary)
|
||||
write_text(stage_dir / "stage_loop_handoff.md", build_stage_handoff_markdown(summary))
|
||||
|
||||
|
||||
def build_save_autorun_command(args: argparse.Namespace, stage_manifest: dict[str, Any], loop_dir: Path) -> list[str]:
|
||||
return [
|
||||
sys.executable,
|
||||
str(REPO_ROOT / "scripts" / "save_agent_semantic_run.py"),
|
||||
"--spec",
|
||||
str(repo_path(stage_manifest["pack_manifest"])),
|
||||
"--validated-run-dir",
|
||||
str(loop_dir),
|
||||
"--title",
|
||||
f"AGENT | {stage_manifest.get('title') or stage_manifest['stage_id']}",
|
||||
"--architecture-phase",
|
||||
str(stage_manifest.get("architecture_phase") or stage_manifest.get("module_name") or "stage_agent_loop"),
|
||||
"--agent-focus",
|
||||
str(stage_manifest.get("agent_focus") or stage_manifest.get("title") or stage_manifest["stage_id"]),
|
||||
]
|
||||
|
||||
|
||||
def handle_plan(args: argparse.Namespace) -> int:
|
||||
stage_manifest_path = repo_path(args.manifest)
|
||||
stage_manifest = load_stage_manifest(stage_manifest_path)
|
||||
stage_dir = stage_dir_for(repo_path(args.output_root), stage_manifest["stage_id"])
|
||||
command = build_domain_pack_loop_command(args, stage_manifest, stage_dir)
|
||||
payload = {
|
||||
"schema_version": STAGE_SUMMARY_SCHEMA_VERSION,
|
||||
"stage_manifest": repo_relative(stage_manifest_path),
|
||||
"stage_id": stage_manifest["stage_id"],
|
||||
"stage_dir": repo_relative(stage_dir),
|
||||
"loop_dir": repo_relative(stage_loop_dir(stage_dir, stage_manifest)),
|
||||
"domain_pack_loop_command": command,
|
||||
}
|
||||
print(json.dumps(payload, ensure_ascii=False, indent=2))
|
||||
return 0
|
||||
|
||||
|
||||
def handle_summarize(args: argparse.Namespace) -> int:
|
||||
stage_manifest_path = repo_path(args.manifest)
|
||||
stage_manifest = load_stage_manifest(stage_manifest_path)
|
||||
stage_dir = stage_dir_for(repo_path(args.output_root), stage_manifest["stage_id"])
|
||||
loop_dir = repo_path(args.loop_dir) if args.loop_dir else stage_loop_dir(stage_dir, stage_manifest)
|
||||
summary = build_stage_summary(stage_manifest, loop_dir)
|
||||
save_stage_summary(stage_dir, summary)
|
||||
print(json.dumps(summary, ensure_ascii=False, indent=2))
|
||||
return 0
|
||||
|
||||
|
||||
def handle_run(args: argparse.Namespace) -> int:
|
||||
stage_manifest_path = repo_path(args.manifest)
|
||||
stage_manifest = load_stage_manifest(stage_manifest_path)
|
||||
stage_dir = stage_dir_for(repo_path(args.output_root), stage_manifest["stage_id"])
|
||||
stage_dir.mkdir(parents=True, exist_ok=True)
|
||||
write_json(stage_dir / "stage_manifest.json", stage_manifest)
|
||||
write_text(stage_dir / "stage_manifest_source.txt", repo_relative(stage_manifest_path) + "\n")
|
||||
|
||||
command = build_domain_pack_loop_command(args, stage_manifest, stage_dir)
|
||||
write_text(stage_dir / "domain_pack_loop.command.txt", " ".join(command) + "\n")
|
||||
if args.dry_run:
|
||||
print(json.dumps({"dry_run": True, "command": command}, ensure_ascii=False, indent=2))
|
||||
return 0
|
||||
|
||||
run_command(
|
||||
command,
|
||||
cwd=REPO_ROOT,
|
||||
stdout_path=stage_dir / "domain_pack_loop.stdout.log",
|
||||
stderr_path=stage_dir / "domain_pack_loop.stderr.log",
|
||||
timeout_seconds=max(3600, int(args.codex_timeout_seconds) * max(1, int(stage_manifest["max_iterations"]))),
|
||||
)
|
||||
loop_dir = stage_loop_dir(stage_dir, stage_manifest)
|
||||
summary = build_stage_summary(stage_manifest, loop_dir)
|
||||
save_stage_summary(stage_dir, summary)
|
||||
|
||||
if (
|
||||
summary["loop_final_status"] == "accepted"
|
||||
and bool(stage_manifest.get("save_autorun_on_accept", True))
|
||||
and not args.no_save_autorun
|
||||
):
|
||||
save_command = build_save_autorun_command(args, stage_manifest, loop_dir)
|
||||
write_text(stage_dir / "save_agent_semantic_run.command.txt", " ".join(save_command) + "\n")
|
||||
run_command(
|
||||
save_command,
|
||||
cwd=REPO_ROOT,
|
||||
stdout_path=stage_dir / "save_agent_semantic_run.stdout.log",
|
||||
stderr_path=stage_dir / "save_agent_semantic_run.stderr.log",
|
||||
timeout_seconds=120,
|
||||
)
|
||||
print(json.dumps(summary, ensure_ascii=False, indent=2))
|
||||
return 0
|
||||
|
||||
|
||||
def add_common_args(parser: argparse.ArgumentParser) -> None:
|
||||
parser.add_argument("--manifest", required=True)
|
||||
parser.add_argument("--output-root", default=str(DEFAULT_STAGE_OUTPUT_ROOT))
|
||||
parser.add_argument("--analysis-date")
|
||||
parser.add_argument("--max-scenarios", type=int)
|
||||
parser.add_argument("--backend-url", default="http://127.0.0.1:8787")
|
||||
parser.add_argument("--prompt-version", default="address_query_runtime_v1")
|
||||
parser.add_argument("--llm-provider", default="local", choices=["openai", "local"])
|
||||
parser.add_argument("--llm-model", default="qwen2.5-14b-instruct-1m")
|
||||
parser.add_argument("--llm-base-url", default="http://127.0.0.1:1234/v1")
|
||||
parser.add_argument("--llm-api-key", default="")
|
||||
parser.add_argument("--temperature", type=float, default=0.0)
|
||||
parser.add_argument("--max-output-tokens", type=int, default=2048)
|
||||
parser.add_argument("--timeout-seconds", type=int, default=180)
|
||||
parser.add_argument("--use-mock", action="store_true")
|
||||
parser.add_argument("--codex-binary", default="codex")
|
||||
parser.add_argument("--codex-profile")
|
||||
parser.add_argument("--codex-model")
|
||||
parser.add_argument("--analyst-codex-model", default="gpt-5.4")
|
||||
parser.add_argument("--coder-codex-model", default="gpt-5.4-mini")
|
||||
parser.add_argument("--analyst-reasoning-effort", default="medium")
|
||||
parser.add_argument("--coder-reasoning-effort", default="low")
|
||||
parser.add_argument("--codex-timeout-seconds", type=int, default=1800)
|
||||
|
||||
|
||||
def build_parser() -> argparse.ArgumentParser:
|
||||
parser = argparse.ArgumentParser(description="Stage-level AGENT loop wrapper for NDC_1C development phases.")
|
||||
subparsers = parser.add_subparsers(dest="command", required=True)
|
||||
|
||||
plan_parser = subparsers.add_parser("plan", help="Print the domain pack-loop command for a stage manifest.")
|
||||
add_common_args(plan_parser)
|
||||
plan_parser.set_defaults(func=handle_plan)
|
||||
|
||||
run_parser = subparsers.add_parser("run", help="Run stage pack-loop, summarize, and optionally save accepted autorun.")
|
||||
add_common_args(run_parser)
|
||||
run_parser.add_argument("--dry-run", action="store_true")
|
||||
run_parser.add_argument("--no-save-autorun", action="store_true")
|
||||
run_parser.set_defaults(func=handle_run)
|
||||
|
||||
summarize_parser = subparsers.add_parser("summarize", help="Build stage handoff from an existing loop_dir.")
|
||||
add_common_args(summarize_parser)
|
||||
summarize_parser.add_argument("--loop-dir")
|
||||
summarize_parser.set_defaults(func=handle_summarize)
|
||||
return parser
|
||||
|
||||
|
||||
def main() -> int:
|
||||
parser = build_parser()
|
||||
args = parser.parse_args()
|
||||
try:
|
||||
return int(args.func(args))
|
||||
except Exception as error: # noqa: BLE001
|
||||
print(f"[stage-agent-loop] error: {error}", file=sys.stderr)
|
||||
return 1
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
|
|
@ -114,6 +114,147 @@ class DomainCaseLoopStepStateTests(unittest.TestCase):
|
|||
self.assertEqual(reviewed["critical_findings_count"], 1)
|
||||
self.assertEqual(reviewed["review_findings"][0]["code"], "wrong_catalog_chain_top_match")
|
||||
|
||||
def test_business_first_review_flags_dirty_direct_answer_surface(self) -> None:
|
||||
step_state = dcl.build_scenario_step_state(
|
||||
scenario_id="business_surface_demo",
|
||||
domain="business_overview",
|
||||
step={
|
||||
"step_id": "step_01",
|
||||
"title": "Top year",
|
||||
"depends_on": [],
|
||||
"question_template": "какой у нас самый доходный год",
|
||||
},
|
||||
step_index=1,
|
||||
question_resolved="какой у нас самый доходный год",
|
||||
analysis_context={},
|
||||
turn_artifact={
|
||||
"assistant_message": {
|
||||
"reply_type": "partial_coverage",
|
||||
"text": "Коротко: Ограниченный бизнес-обзор по подтвержденным строкам 1С. " + ("лишний текст " * 220),
|
||||
"message_id": "msg-1",
|
||||
"trace_id": "trace-1",
|
||||
},
|
||||
"technical_debug_payload": {},
|
||||
"session_summary": {},
|
||||
},
|
||||
entries=[],
|
||||
)
|
||||
|
||||
review = step_state["business_first_review"]
|
||||
self.assertFalse(review["direct_answer_first_ok"])
|
||||
self.assertFalse(review["business_usefulness_ok"])
|
||||
self.assertIn("business_direct_answer_missing", review["issue_codes"])
|
||||
self.assertIn("answer_layering_noise", review["issue_codes"])
|
||||
self.assertIn("business_answer_too_verbose", review["issue_codes"])
|
||||
self.assertIn("business_direct_answer_missing", step_state["violated_invariants"])
|
||||
|
||||
def test_business_first_review_accepts_compact_direct_answer_surface(self) -> None:
|
||||
step_state = dcl.build_scenario_step_state(
|
||||
scenario_id="business_surface_demo",
|
||||
domain="business_overview",
|
||||
step={
|
||||
"step_id": "step_01",
|
||||
"title": "Top year",
|
||||
"depends_on": [],
|
||||
"question_template": "какой у нас самый доходный год",
|
||||
},
|
||||
step_index=1,
|
||||
question_resolved="какой у нас самый доходный год",
|
||||
analysis_context={},
|
||||
turn_artifact={
|
||||
"assistant_message": {
|
||||
"reply_type": "partial_coverage",
|
||||
"text": "Коротко: самый доходный год в доступном денежном контуре 1С — 2015: 136 723 459,73 руб.\nМетод: считаю по подтвержденным входящим поступлениям.",
|
||||
"message_id": "msg-1",
|
||||
"trace_id": "trace-1",
|
||||
},
|
||||
"technical_debug_payload": {},
|
||||
"session_summary": {},
|
||||
},
|
||||
entries=[],
|
||||
)
|
||||
|
||||
review = step_state["business_first_review"]
|
||||
self.assertTrue(review["direct_answer_first_ok"])
|
||||
self.assertTrue(review["business_usefulness_ok"])
|
||||
self.assertEqual(review["issue_codes"], [])
|
||||
|
||||
def test_business_first_review_separates_direct_answer_from_later_technical_leak(self) -> None:
|
||||
question = "\u043a\u0430\u043a\u043e\u0439 \u0443 \u043d\u0430\u0441 \u0441\u0430\u043c\u044b\u0439 \u0434\u043e\u0445\u043e\u0434\u043d\u044b\u0439 \u0433\u043e\u0434"
|
||||
step_state = dcl.build_scenario_step_state(
|
||||
scenario_id="business_surface_demo",
|
||||
domain="business_overview",
|
||||
step={
|
||||
"step_id": "step_01",
|
||||
"title": "Top year",
|
||||
"depends_on": [],
|
||||
"question_template": question,
|
||||
},
|
||||
step_index=1,
|
||||
question_resolved=question,
|
||||
analysis_context={},
|
||||
turn_artifact={
|
||||
"assistant_message": {
|
||||
"reply_type": "partial_coverage",
|
||||
"text": "2015 \u2014 \u0441\u0430\u043c\u044b\u0439 \u0434\u043e\u0445\u043e\u0434\u043d\u044b\u0439 \u0433\u043e\u0434 \u043f\u043e \u043f\u043e\u0434\u0442\u0432\u0435\u0440\u0436\u0434\u0435\u043d\u043d\u044b\u043c \u0432\u0445\u043e\u0434\u044f\u0449\u0438\u043c \u0434\u0435\u043d\u044c\u0433\u0430\u043c.\nservice: capability_id=business_overview_route_template_v1",
|
||||
"message_id": "msg-1",
|
||||
"trace_id": "trace-1",
|
||||
},
|
||||
"technical_debug_payload": {},
|
||||
"session_summary": {},
|
||||
},
|
||||
entries=[],
|
||||
)
|
||||
|
||||
review = step_state["business_first_review"]
|
||||
self.assertTrue(review["direct_answer_first_ok"])
|
||||
self.assertTrue(review["technical_garbage_present"])
|
||||
self.assertIn("technical_garbage_in_answer", review["issue_codes"])
|
||||
self.assertNotIn("business_direct_answer_missing", review["issue_codes"])
|
||||
|
||||
def test_truth_harness_promotes_business_review_issues_to_findings(self) -> None:
|
||||
step_state = dcl.build_scenario_step_state(
|
||||
scenario_id="business_surface_demo",
|
||||
domain="business_overview",
|
||||
step={
|
||||
"step_id": "step_01",
|
||||
"title": "Top year",
|
||||
"depends_on": [],
|
||||
"question_template": "какой у нас самый доходный год",
|
||||
},
|
||||
step_index=1,
|
||||
question_resolved="какой у нас самый доходный год",
|
||||
analysis_context={},
|
||||
turn_artifact={
|
||||
"assistant_message": {
|
||||
"reply_type": "partial_coverage",
|
||||
"text": "Коротко: Ограниченный бизнес-обзор по подтвержденным строкам 1С. " + ("лишний текст " * 220),
|
||||
"message_id": "msg-1",
|
||||
"trace_id": "trace-1",
|
||||
},
|
||||
"technical_debug_payload": {},
|
||||
"session_summary": {},
|
||||
},
|
||||
entries=[],
|
||||
)
|
||||
reviewed = dth.evaluate_truth_step(
|
||||
step={
|
||||
"step_id": "step_01",
|
||||
"question_template": "какой у нас самый доходный год",
|
||||
"criticality": "critical",
|
||||
"allowed_reply_types": [],
|
||||
},
|
||||
step_state=step_state,
|
||||
step_results={},
|
||||
bindings={},
|
||||
runtime_bindings={},
|
||||
)
|
||||
|
||||
codes = [item["code"] for item in reviewed["review_findings"]]
|
||||
self.assertIn("business_review:business_direct_answer_missing", codes)
|
||||
self.assertIn("business_review:answer_layering_noise", codes)
|
||||
self.assertEqual(reviewed["review_status"], "fail")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
unittest.main()
|
||||
|
|
|
|||
|
|
@ -0,0 +1,155 @@
|
|||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import sys
|
||||
import tempfile
|
||||
import unittest
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).resolve().parent))
|
||||
|
||||
import review_assistant_stage1_run as reviewer
|
||||
|
||||
|
||||
def write_json(path: Path, payload: object) -> None:
|
||||
path.parent.mkdir(parents=True, exist_ok=True)
|
||||
path.write_text(json.dumps(payload, ensure_ascii=False, indent=2) + "\n", encoding="utf-8")
|
||||
|
||||
|
||||
def session_payload(conversation: list[dict[str, object]]) -> dict[str, object]:
|
||||
return {
|
||||
"schema_version": "assistant_session_v1",
|
||||
"session_id": "assistant-stage1-test-SAVED-001",
|
||||
"started_at": "2026-05-09T00:00:00Z",
|
||||
"updated_at": "2026-05-09T00:01:00Z",
|
||||
"conversation": conversation,
|
||||
"address_navigation_state": {"session_context": {}},
|
||||
"investigation_state": {},
|
||||
"counters": {},
|
||||
"reply_types": {},
|
||||
}
|
||||
|
||||
|
||||
class AssistantStage1RunReviewTests(unittest.TestCase):
|
||||
def test_builds_conversation_pairs_without_crossing_next_user_turn(self) -> None:
|
||||
conversation = [
|
||||
{"role": "user", "text": "первый вопрос"},
|
||||
{"role": "assistant", "text": "первый ответ"},
|
||||
{"role": "user", "text": "второй вопрос"},
|
||||
]
|
||||
|
||||
pairs = reviewer.build_conversation_pairs(conversation)
|
||||
|
||||
self.assertEqual(len(pairs), 2)
|
||||
self.assertEqual(pairs[0]["assistant"]["text"], "первый ответ")
|
||||
self.assertIsNone(pairs[1]["assistant"])
|
||||
|
||||
def test_review_flags_dirty_business_answer_and_writes_repair_targets(self) -> None:
|
||||
with tempfile.TemporaryDirectory() as tmp:
|
||||
root = Path(tmp)
|
||||
sessions_dir = root / "sessions"
|
||||
reports_dir = root / "reports"
|
||||
run_id = "assistant-stage1-test123"
|
||||
session_file = sessions_dir / f"{run_id}-SAVED-001.json"
|
||||
report_file = reports_dir / f"{run_id}.md"
|
||||
write_json(
|
||||
session_file,
|
||||
session_payload(
|
||||
[
|
||||
{"role": "user", "text": "какой у нас самый доходный год"},
|
||||
{
|
||||
"role": "assistant",
|
||||
"text": "Коротко: Ограниченный бизнес-обзор по подтвержденным строкам 1С. "
|
||||
+ ("лишний текст " * 220),
|
||||
"reply_type": "partial_coverage",
|
||||
"message_id": "a-1",
|
||||
"trace_id": "trace-1",
|
||||
"debug": {"capability_id": "business_overview_route_template_v1"},
|
||||
},
|
||||
{"role": "user", "text": "по нему покажи документы"},
|
||||
{
|
||||
"role": "assistant",
|
||||
"text": "Документы по выбранному году не найдены в подтвержденном контуре.",
|
||||
"reply_type": "factual_with_explanation",
|
||||
"message_id": "a-2",
|
||||
"trace_id": "trace-2",
|
||||
"debug": {},
|
||||
},
|
||||
]
|
||||
),
|
||||
)
|
||||
report_file.parent.mkdir(parents=True, exist_ok=True)
|
||||
report_file.write_text(
|
||||
"# Assistant Stage 1 Eval Run\n\n"
|
||||
f"- run_id: {run_id}\n"
|
||||
"- suite_id: assistant_saved_session_runtime_job-test\n",
|
||||
encoding="utf-8",
|
||||
)
|
||||
|
||||
review = reviewer.build_run_review(
|
||||
run_id=run_id,
|
||||
session_files=[session_file],
|
||||
report_path=report_file,
|
||||
)
|
||||
|
||||
self.assertEqual(review["summary"]["overall_business_status"], "fail")
|
||||
self.assertEqual(review["summary"]["turn_pairs_total"], 2)
|
||||
self.assertGreaterEqual(review["summary"]["p0_findings"], 1)
|
||||
self.assertIn("business_direct_answer_missing", review["summary"]["issue_counts"])
|
||||
self.assertTrue(review["repair_targets"])
|
||||
target_by_issue = {item["issue_code"]: item for item in review["repair_targets"]}
|
||||
self.assertEqual(target_by_issue["business_direct_answer_missing"]["severity"], "P0")
|
||||
self.assertEqual(target_by_issue["business_answer_too_verbose"]["severity"], "P1")
|
||||
self.assertEqual(review["question_quality_review"]["turns_total"], 2)
|
||||
self.assertIn("contextual_followup", review["question_quality_review"]["tag_counts"])
|
||||
|
||||
def test_save_run_review_materializes_machine_and_markdown_artifacts(self) -> None:
|
||||
with tempfile.TemporaryDirectory() as tmp:
|
||||
output_dir = Path(tmp) / "review"
|
||||
review = {
|
||||
"run_id": "assistant-stage1-test123",
|
||||
"summary": {
|
||||
"overall_business_status": "pass",
|
||||
"turn_pairs_total": 1,
|
||||
"business_issue_turns": 0,
|
||||
"p0_findings": 0,
|
||||
"p1_findings": 0,
|
||||
"question_quality_status": "strong",
|
||||
"question_quality_score": 95,
|
||||
},
|
||||
"question_quality_review": {"status": "strong", "score": 95},
|
||||
"findings": [],
|
||||
"repair_targets": [],
|
||||
"conversation_pairs": [],
|
||||
}
|
||||
|
||||
reviewer.save_run_review(review, output_dir)
|
||||
|
||||
self.assertTrue((output_dir / "run_review.json").exists())
|
||||
self.assertTrue((output_dir / "run_review.md").exists())
|
||||
markdown = (output_dir / "run_review.md").read_text(encoding="utf-8")
|
||||
self.assertIn("overall_business_status", markdown)
|
||||
self.assertIn("Question Quality", markdown)
|
||||
|
||||
def test_question_quality_treats_short_natural_followups_as_contextual(self) -> None:
|
||||
pairs = [
|
||||
{"pair_index": 1, "user": {"text": "приветик - че как там дела"}},
|
||||
{"pair_index": 2, "user": {"text": "какие остатки на складе"}},
|
||||
{"pair_index": 3, "user": {"text": "давай на июль 2017"}},
|
||||
{"pair_index": 4, "user": {"text": "март 2016"}},
|
||||
{"pair_index": 5, "user": {"text": "а кому продали?"}},
|
||||
{"pair_index": 6, "user": {"text": "кто нам должен денег на май 2017"}},
|
||||
{"pair_index": 7, "user": {"text": "а по свк"}},
|
||||
]
|
||||
|
||||
review = reviewer.build_question_quality_review(pairs)
|
||||
|
||||
self.assertNotIn("root_question_requires_missing_context", review["weak_flag_counts"])
|
||||
self.assertNotIn("low_business_anchor", review["weak_flag_counts"])
|
||||
self.assertGreaterEqual(review["tag_counts"]["contextual_followup"], 3)
|
||||
self.assertGreaterEqual(review["tag_counts"]["direct_business_question"], 2)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
unittest.main()
|
||||
|
|
@ -0,0 +1,241 @@
|
|||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import sys
|
||||
import tempfile
|
||||
import unittest
|
||||
from pathlib import Path
|
||||
from types import SimpleNamespace
|
||||
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).resolve().parent))
|
||||
|
||||
import save_agent_semantic_run as saver
|
||||
|
||||
|
||||
def write_json(path: Path, payload: object) -> None:
|
||||
path.parent.mkdir(parents=True, exist_ok=True)
|
||||
path.write_text(json.dumps(payload, ensure_ascii=False, indent=2) + "\n", encoding="utf-8")
|
||||
|
||||
|
||||
class SaveAgentSemanticRunTests(unittest.TestCase):
|
||||
def test_extract_questions_accepts_truth_harness_question_template(self) -> None:
|
||||
questions = saver.extract_questions_from_spec(
|
||||
{
|
||||
"steps": [
|
||||
{"step_id": "step_01", "question_template": "first question"},
|
||||
{"step_id": "step_02", "question": "second question"},
|
||||
]
|
||||
}
|
||||
)
|
||||
|
||||
self.assertEqual(questions, ["first question", "second question"])
|
||||
|
||||
def test_extract_questions_accepts_domain_pack_scenarios(self) -> None:
|
||||
questions = saver.extract_questions_from_spec(
|
||||
{
|
||||
"pack_id": "demo_pack",
|
||||
"scenarios": [
|
||||
{
|
||||
"scenario_id": "scenario_01",
|
||||
"steps": [
|
||||
{"step_id": "step_01", "question_template": "first question"},
|
||||
{"step_id": "step_02", "question": "second question"},
|
||||
],
|
||||
},
|
||||
{
|
||||
"scenario_id": "scenario_02",
|
||||
"steps": [
|
||||
{"step_id": "step_01", "question": "first question"},
|
||||
{"step_id": "step_02", "question": "third question"},
|
||||
],
|
||||
},
|
||||
],
|
||||
}
|
||||
)
|
||||
|
||||
self.assertEqual(questions, ["first question", "second question", "third question"])
|
||||
|
||||
def test_validate_accepted_run_dir_accepts_clean_business_review(self) -> None:
|
||||
with tempfile.TemporaryDirectory() as tmp:
|
||||
run_dir = Path(tmp)
|
||||
write_json(
|
||||
run_dir / "pack_state.json",
|
||||
{
|
||||
"final_status": "accepted",
|
||||
"review_overall_status": "pass",
|
||||
"acceptance_gate_passed": True,
|
||||
"no_unresolved_p0": True,
|
||||
"unresolved_p0_count": 0,
|
||||
"steps_total": 1,
|
||||
"steps_passed": 1,
|
||||
"steps_failed": 0,
|
||||
},
|
||||
)
|
||||
write_json(run_dir / "truth_review.json", {"summary": {"overall_status": "pass"}})
|
||||
write_json(
|
||||
run_dir / "business_review.json",
|
||||
{
|
||||
"overall_business_status": "pass",
|
||||
"steps_with_business_failures": 0,
|
||||
"steps_with_business_warnings": 0,
|
||||
},
|
||||
)
|
||||
|
||||
metadata = saver.validate_accepted_run_dir(run_dir)
|
||||
|
||||
self.assertEqual(metadata["validation_status"], "accepted_live_replay")
|
||||
self.assertTrue(metadata["saved_after_validated_replay"])
|
||||
|
||||
def test_validate_accepted_run_dir_rejects_business_review_failures(self) -> None:
|
||||
with tempfile.TemporaryDirectory() as tmp:
|
||||
run_dir = Path(tmp)
|
||||
write_json(
|
||||
run_dir / "pack_state.json",
|
||||
{
|
||||
"final_status": "accepted",
|
||||
"review_overall_status": "pass",
|
||||
"acceptance_gate_passed": True,
|
||||
"no_unresolved_p0": True,
|
||||
"unresolved_p0_count": 0,
|
||||
},
|
||||
)
|
||||
write_json(run_dir / "truth_review.json", {"summary": {"overall_status": "pass"}})
|
||||
write_json(
|
||||
run_dir / "business_review.json",
|
||||
{
|
||||
"overall_business_status": "fail",
|
||||
"steps_with_business_failures": 1,
|
||||
},
|
||||
)
|
||||
|
||||
with self.assertRaisesRegex(RuntimeError, "business_review"):
|
||||
saver.validate_accepted_run_dir(run_dir)
|
||||
|
||||
def test_validate_accepted_run_dir_accepts_clean_domain_pack_loop(self) -> None:
|
||||
with tempfile.TemporaryDirectory() as tmp:
|
||||
loop_dir = Path(tmp)
|
||||
iteration_dir = loop_dir / "iterations" / "iteration_00"
|
||||
analyst_path = iteration_dir / "analyst_verdict.json"
|
||||
repair_targets_path = iteration_dir / "pack_output" / "pack_run" / "repair_targets.json"
|
||||
write_json(
|
||||
loop_dir / "loop_state.json",
|
||||
{
|
||||
"loop_id": "stage_demo",
|
||||
"target_score": 88,
|
||||
"final_status": "accepted",
|
||||
"iterations": [
|
||||
{
|
||||
"iteration_id": "iteration_00",
|
||||
"quality_score": 91,
|
||||
"accepted_gate": True,
|
||||
"analyst_accepted_gate": True,
|
||||
"deterministic_gate_ok": True,
|
||||
"repair_target_count": 0,
|
||||
"repair_target_severity_counts": {"P0": 0, "P1": 0, "P2": 0},
|
||||
"analyst_verdict_path": str(analyst_path),
|
||||
"repair_targets_path": str(repair_targets_path),
|
||||
}
|
||||
],
|
||||
},
|
||||
)
|
||||
write_json(
|
||||
analyst_path,
|
||||
{
|
||||
"loop_decision": "accepted",
|
||||
"unresolved_p0_count": 0,
|
||||
"regression_detected": False,
|
||||
"direct_answer_ok": True,
|
||||
"business_usefulness_ok": True,
|
||||
"temporal_honesty_ok": True,
|
||||
"field_truth_ok": True,
|
||||
"answer_layering_ok": True,
|
||||
},
|
||||
)
|
||||
write_json(repair_targets_path, {"severity_counts": {"P0": 0, "P1": 0, "P2": 0}})
|
||||
|
||||
metadata = saver.validate_accepted_run_dir(loop_dir)
|
||||
|
||||
self.assertEqual(metadata["validation_status"], "accepted_domain_pack_loop")
|
||||
self.assertEqual(metadata["quality_score"], 91)
|
||||
|
||||
def test_validate_accepted_run_dir_rejects_domain_pack_loop_with_p1_targets(self) -> None:
|
||||
with tempfile.TemporaryDirectory() as tmp:
|
||||
loop_dir = Path(tmp)
|
||||
iteration_dir = loop_dir / "iterations" / "iteration_00"
|
||||
analyst_path = iteration_dir / "analyst_verdict.json"
|
||||
repair_targets_path = iteration_dir / "pack_output" / "pack_run" / "repair_targets.json"
|
||||
write_json(
|
||||
loop_dir / "loop_state.json",
|
||||
{
|
||||
"loop_id": "stage_demo",
|
||||
"target_score": 88,
|
||||
"final_status": "accepted",
|
||||
"iterations": [
|
||||
{
|
||||
"quality_score": 91,
|
||||
"accepted_gate": True,
|
||||
"analyst_accepted_gate": True,
|
||||
"deterministic_gate_ok": True,
|
||||
"analyst_verdict_path": str(analyst_path),
|
||||
"repair_targets_path": str(repair_targets_path),
|
||||
}
|
||||
],
|
||||
},
|
||||
)
|
||||
write_json(
|
||||
analyst_path,
|
||||
{
|
||||
"loop_decision": "accepted",
|
||||
"unresolved_p0_count": 0,
|
||||
"regression_detected": False,
|
||||
"direct_answer_ok": True,
|
||||
"business_usefulness_ok": True,
|
||||
"temporal_honesty_ok": True,
|
||||
"field_truth_ok": True,
|
||||
"answer_layering_ok": True,
|
||||
},
|
||||
)
|
||||
write_json(repair_targets_path, {"severity_counts": {"P0": 0, "P1": 1, "P2": 0}})
|
||||
|
||||
with self.assertRaisesRegex(RuntimeError, "repair_targets"):
|
||||
saver.validate_accepted_run_dir(loop_dir)
|
||||
|
||||
def test_save_gate_refuses_real_write_without_validation(self) -> None:
|
||||
args = SimpleNamespace(
|
||||
validated_run_dir=None,
|
||||
dry_run=False,
|
||||
allow_unvalidated=False,
|
||||
unvalidated_reason=None,
|
||||
)
|
||||
|
||||
with self.assertRaisesRegex(RuntimeError, "Refusing to save AGENT autorun"):
|
||||
saver.build_save_gate_metadata(args, {}, Path("demo.json"))
|
||||
|
||||
def test_save_gate_requires_reason_for_unvalidated_draft(self) -> None:
|
||||
args = SimpleNamespace(
|
||||
validated_run_dir=None,
|
||||
dry_run=False,
|
||||
allow_unvalidated=True,
|
||||
unvalidated_reason="",
|
||||
)
|
||||
|
||||
with self.assertRaisesRegex(RuntimeError, "--unvalidated-reason"):
|
||||
saver.build_save_gate_metadata(args, {}, Path("demo.json"))
|
||||
|
||||
def test_save_gate_marks_explicit_unvalidated_draft(self) -> None:
|
||||
args = SimpleNamespace(
|
||||
validated_run_dir=None,
|
||||
dry_run=False,
|
||||
allow_unvalidated=True,
|
||||
unvalidated_reason="manual GUI canary before live replay",
|
||||
)
|
||||
|
||||
metadata = saver.build_save_gate_metadata(args, {}, Path("demo.json"))
|
||||
|
||||
self.assertEqual(metadata["validation_status"], "explicitly_unvalidated")
|
||||
self.assertFalse(metadata["saved_after_validated_replay"])
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
unittest.main()
|
||||
|
|
@ -0,0 +1,153 @@
|
|||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import sys
|
||||
import tempfile
|
||||
import unittest
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).resolve().parent))
|
||||
|
||||
import stage_agent_loop as stage_loop
|
||||
|
||||
|
||||
def write_json(path: Path, payload: object) -> None:
|
||||
path.parent.mkdir(parents=True, exist_ok=True)
|
||||
path.write_text(json.dumps(payload, ensure_ascii=False, indent=2) + "\n", encoding="utf-8")
|
||||
|
||||
|
||||
def args() -> argparse.Namespace:
|
||||
return argparse.Namespace(
|
||||
backend_url="http://127.0.0.1:8787",
|
||||
prompt_version="address_query_runtime_v1",
|
||||
llm_provider="local",
|
||||
llm_model="qwen2.5-14b-instruct-1m",
|
||||
llm_base_url="http://127.0.0.1:1234/v1",
|
||||
llm_api_key="",
|
||||
temperature=0.0,
|
||||
max_output_tokens=2048,
|
||||
timeout_seconds=180,
|
||||
codex_binary="codex",
|
||||
codex_profile=None,
|
||||
codex_model=None,
|
||||
analyst_codex_model="gpt-5.4",
|
||||
coder_codex_model="gpt-5.4-mini",
|
||||
analyst_reasoning_effort="medium",
|
||||
coder_reasoning_effort="low",
|
||||
codex_timeout_seconds=1800,
|
||||
analysis_date=None,
|
||||
max_scenarios=None,
|
||||
use_mock=False,
|
||||
)
|
||||
|
||||
|
||||
class StageAgentLoopTests(unittest.TestCase):
|
||||
def test_load_stage_manifest_defaults_gate_fields(self) -> None:
|
||||
with tempfile.TemporaryDirectory() as tmp:
|
||||
manifest_path = Path(tmp) / "stage.json"
|
||||
write_json(
|
||||
manifest_path,
|
||||
{
|
||||
"stage_id": "open_world_control_gate",
|
||||
"module_name": "Open-World Bounded Autonomy Breadth",
|
||||
"title": "Open-world semantic control gate",
|
||||
"pack_manifest": "docs/orchestration/demo_pack.json",
|
||||
},
|
||||
)
|
||||
|
||||
manifest = stage_loop.load_stage_manifest(manifest_path)
|
||||
|
||||
self.assertEqual(manifest["target_score"], 88)
|
||||
self.assertEqual(manifest["max_iterations"], 6)
|
||||
self.assertTrue(manifest["save_autorun_on_accept"])
|
||||
self.assertTrue(manifest["manual_confirmation_required_after_accept"])
|
||||
|
||||
def test_build_domain_pack_loop_command_uses_stage_gate(self) -> None:
|
||||
manifest = {
|
||||
"stage_id": "open_world_control_gate",
|
||||
"pack_manifest": "docs/orchestration/demo_pack.json",
|
||||
"target_score": 91,
|
||||
"max_iterations": 4,
|
||||
}
|
||||
command = stage_loop.build_domain_pack_loop_command(args(), manifest, Path("X:/repo/stage"))
|
||||
|
||||
self.assertIn("run-pack-loop", command)
|
||||
self.assertIn("--target-score", command)
|
||||
self.assertIn("91", command)
|
||||
self.assertIn("--max-iterations", command)
|
||||
self.assertIn("4", command)
|
||||
self.assertIn("--output-root", command)
|
||||
|
||||
def test_build_stage_summary_requests_manual_confirmation_after_accept(self) -> None:
|
||||
with tempfile.TemporaryDirectory() as tmp:
|
||||
loop_dir = Path(tmp)
|
||||
write_json(
|
||||
loop_dir / "loop_state.json",
|
||||
{
|
||||
"final_status": "accepted",
|
||||
"target_score": 88,
|
||||
"stop_reason": "analyst accepted + deterministic gate passed",
|
||||
"iterations": [
|
||||
{
|
||||
"quality_score": 93,
|
||||
"loop_decision": "accepted",
|
||||
"accepted_gate": True,
|
||||
"deterministic_gate_ok": True,
|
||||
}
|
||||
],
|
||||
},
|
||||
)
|
||||
|
||||
summary = stage_loop.build_stage_summary(
|
||||
{
|
||||
"stage_id": "open_world_control_gate",
|
||||
"module_name": "Open-World Bounded Autonomy Breadth",
|
||||
"title": "Open-world semantic control gate",
|
||||
"target_score": 88,
|
||||
"manual_confirmation_required_after_accept": True,
|
||||
},
|
||||
loop_dir,
|
||||
)
|
||||
|
||||
self.assertEqual(summary["loop_final_status"], "accepted")
|
||||
self.assertTrue(summary["manual_confirmation_required"])
|
||||
self.assertEqual(summary["next_action"], "manual_gui_confirmation")
|
||||
|
||||
def test_build_stage_summary_continues_when_loop_is_partial(self) -> None:
|
||||
with tempfile.TemporaryDirectory() as tmp:
|
||||
loop_dir = Path(tmp)
|
||||
write_json(
|
||||
loop_dir / "loop_state.json",
|
||||
{
|
||||
"final_status": "partial",
|
||||
"target_score": 88,
|
||||
"iterations": [
|
||||
{
|
||||
"quality_score": 76,
|
||||
"loop_decision": "continue",
|
||||
"accepted_gate": False,
|
||||
"deterministic_gate_ok": False,
|
||||
"deterministic_gate_reason": "repair_targets_remaining=P1:1",
|
||||
}
|
||||
],
|
||||
},
|
||||
)
|
||||
|
||||
summary = stage_loop.build_stage_summary(
|
||||
{
|
||||
"stage_id": "open_world_control_gate",
|
||||
"module_name": "Open-World Bounded Autonomy Breadth",
|
||||
"title": "Open-world semantic control gate",
|
||||
"target_score": 88,
|
||||
},
|
||||
loop_dir,
|
||||
)
|
||||
|
||||
self.assertFalse(summary["manual_confirmation_required"])
|
||||
self.assertEqual(summary["next_action"], "continue_autonomous_or_fix_blocker")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
unittest.main()
|
||||
Loading…
Reference in New Issue