13 KiB
Domain Scenario Loop - Repo Adapter
Purpose
This repository now supports two outer-loop capture modes:
run-casefor one concrete domain question;run-scenariofor a linked multi-step domain chain that should reuse one assistant session.run-packfor a whole domain question pool grouped into several scenarios.run-pack-loopfor a strong analyst review loop over a whole domain pack, with Lead Codex repair handoff by default.
run-scenario is the preferred capture mode for domains where the user's next question depends on the previous result set.
run-pack is the preferred capture mode when the user brings a full domain pool that should be kept in one aggregate backlog.
Runtime contract
The scenario runner does not introduce a new product runtime.
It reuses:
POST /api/assistant/messageGET /api/assistant/session/:session_id- current backend LLM/profile configuration
- current address/deep routing inside the product
Artifact contract
Scenario artifacts live under:
artifacts/domain_runs/<scenario_id>/
Top-level artifacts:
scenario_brief.mdscenario_manifest.jsonscenario_state.jsonscenario_summary.mdscenario_output.mdfinal_status.md
Per-step artifacts:
steps/<step_id>/output.mdsteps/<step_id>/debug.jsonsteps/<step_id>/turn.jsonsteps/<step_id>/session.jsonsteps/<step_id>/assistant_response.jsonsteps/<step_id>/step_state.json
Pack artifacts live under:
artifacts/domain_runs/<pack_id>/
pack_manifest.jsonpack_state.jsonpack_summary.mdfinal_status.mdscenarios/<scenario_id>/...
AGENT autorun save gate
scripts/save_agent_semantic_run.py is a post-validation persistence tool, not a replay executor.
The normal path is:
- build/update the truth-harness spec;
- run
python scripts/domain_truth_harness.py run-live --spec ... --output-dir artifacts/domain_runs/<run_id>; - inspect
truth_review.md,business_review.md,pack_state.json, andfinal_status.md; - save to GUI autoruns only with
python scripts/save_agent_semantic_run.py --spec ... --validated-run-dir artifacts/domain_runs/<run_id>.
The save gate requires:
pack_state.final_status = accepted;pack_state.acceptance_gate_passed = true;truth_review.summary.overall_status = pass;business_review.overall_business_status = pass;- zero unresolved P0 and zero business-answer failures.
If a pack must be saved as a deliberate manual draft before live acceptance, use
--allow-unvalidated --unvalidated-reason "<why this is intentionally not accepted>".
That path is explicitly marked as unvalidated and must not be treated as semantic proof.
Stage-level AGENT loop
scripts/stage_agent_loop.py wraps the domain pack loop into the development-stage workflow:
- take the current global/local stage manifest;
- run
scripts/domain_case_loop.py run-pack-loopfor that stage pack; - let the loop run pack replay and a business-first analyst verdict; if the gate is not accepted, write
business_audit.mdandlead_coder_handoff.mdinstead of launching a weak coder by default; - if accepted, persist the validated AGENT pack into GUI autoruns through
scripts/save_agent_semantic_run.py --validated-run-dir; - write
stage_loop_summary.jsonandstage_loop_handoff.mdfor the final human visual confirmation.
The stage manifest schema is docs/orchestration/schemas/stage_agent_loop_manifest.schema.json.
The default stage gate is intentionally stricter than a narrow case gate: target_score = 88, no unresolved P0/P1 repair targets, accepted analyst verdict, clean business usefulness, direct-answer, temporal-honesty, field-truth, and answer-layering flags.
Canonical commands:
python scripts/stage_agent_loop.py plan --manifest docs/orchestration/<stage_loop>.json
python scripts/stage_agent_loop.py run --manifest docs/orchestration/<stage_loop>.json
python scripts/stage_agent_loop.py review-questions --manifest docs/orchestration/<stage_loop>.json
python scripts/stage_agent_loop.py ingest-gui-run --manifest docs/orchestration/<stage_loop>.json --run-id assistant-stage1-<id>
python scripts/stage_agent_loop.py prepare-repair --manifest docs/orchestration/<stage_loop>.json
python scripts/stage_agent_loop.py run-repair --manifest docs/orchestration/<stage_loop>.json --dry-run
python scripts/stage_agent_loop.py status --manifest docs/orchestration/<stage_loop>.json
python scripts/stage_agent_loop.py continue --manifest docs/orchestration/<stage_loop>.json
python scripts/stage_agent_loop.py summarize --manifest docs/orchestration/<stage_loop>.json
This is the intended path for "implement the stage, generate/check stage questions, analyze business answers, patch code, rerun, then ask the user for final visual confirmation".
The default repair mode is lead-handoff. In this mode the expensive replay still runs live and the independent analyst still produces the strict business verdict, but code repair stays with the main Lead Codex context. The loop stops with next_action = lead_coder_repair_required, plus:
business_audit.mdfor the user-facing semantic/business verdict;lead_coder_handoff.md/jsonfor the concrete repair target, candidate files, and validation path;stage_context_capsule.md/jsonfor the current stage contract, question quality, loop status, and operating model.
auto-coder remains available only as an explicit opt-in experiment:
python scripts/stage_agent_loop.py run --manifest docs/orchestration/<stage_loop>.json --repair-mode auto-coder
That path must not be treated as the normal high-trust repair mode for this project.
Before launching an expensive live replay, run review-questions. It reads the stage pack, resolves {{bindings.*}} placeholders, checks scenario/follow-up density, direct-answer shape declarations, domain coverage, stale-scope canaries, dependency order, duplicates, mojibake in generated Russian questions, and estimated Windows artifact path length. It writes:
question_generation_review.json;question_generation_review.md.
A strong question review is not semantic proof that the assistant answers correctly. It is the pre-flight gate that says the generated questions are worth spending a live replay on.
GUI run review bridge
When a manual or GUI autorun already exists, scripts/review_assistant_stage1_run.py turns the run id into the same machine-readable review surface.
Canonical command:
python scripts/review_assistant_stage1_run.py assistant-stage1-<id> --print-summary
The script resolves:
llm_normalizer/reports/assistant-stage1-<id>.md;llm_normalizer/data/assistant_sessions/assistant-stage1-<id>-*.json.
It writes:
artifacts/domain_runs/gui_run_reviews/assistant-stage1-<id>/run_review.json;artifacts/domain_runs/gui_run_reviews/assistant-stage1-<id>/run_review.md;conversation_pairs.json;question_quality_review.json;repair_targets.json.
This bridge is intentionally business-first:
- the user's question and visible assistant answer are reviewed before route ids and debug fields;
- noisy direct answers, missing first-line answers, technical garbage, and over-broad business answers become findings;
- generated question packs get a deterministic quality review for follow-up density, direct questions, report-style analysis, domain diversity, duplicates, and weak business anchors.
Use this bridge when the operator would otherwise say "чекни прогон assistant-stage1-...". The expected next step is no longer manual eyeballing first; it is: review by id, inspect run_review.md, map repair_targets.json into the current stage loop, patch, and rerun.
For stage work, prefer the integrated command:
python scripts/stage_agent_loop.py ingest-gui-run --manifest docs/orchestration/<stage_loop>.json --run-id assistant-stage1-<id>
It stores the GUI review under artifacts/domain_runs/stage_agent_loops/<stage_id>/gui_run_reviews/<run_id>/, updates stage_loop_summary.json, and writes the next stage action:
continue_repair_from_gui_review_p0when the GUI run exposes business-wrong or missing direct-answer defects;continue_repair_from_gui_review_p1when the run is semantically usable but still noisy, over-broad, or poorly layered;manual_gui_confirmation_or_stage_closewhen the GUI run is clean enough for final human confirmation.
stage_loop_summary.json also includes next_step_guidance.command_templates, so the next operator or agent pass can continue from machine-readable commands instead of re-inferring the workflow from prose.
Use python scripts/stage_agent_loop.py status --manifest docs/orchestration/<stage_loop>.json as the cheap read-only checkpoint before continuing a stage. It prints the current next action, closing gate, latest GUI run, latest repair coder status, latest repair validation status, and cold-start continuation artifacts such as domain_pack_loop.command.txt without modifying artifacts.
Use python scripts/stage_agent_loop.py continue --manifest docs/orchestration/<stage_loop>.json as the safe one-command continuation layer. From a cold start it materializes domain_pack_loop.command.txt without launching the long live loop; after a GUI review it can prepare a repair iteration and materialize run-repair --dry-run automatically; it will not run the real coder pass unless --execute-repair is passed, and it waits for a --run-id assistant-stage1-<id> when the next required step is post-repair rerun/ingest validation.
It also writes stage_repair_handoff.md/json next to the stage summary. That handoff is the preferred input for the next coder pass: it lists primary repair targets and sample user-facing failures without forcing the coder to reread the entire GUI conversation first.
For live stage-pack failures, prefer lead_coder_handoff.md over immediately preparing a coder pass. The intent is: strong business audit first, Lead Codex code repair second, same replay/GUI validation third.
To prepare the next repair iteration from that handoff, run:
python scripts/stage_agent_loop.py prepare-repair --manifest docs/orchestration/<stage_loop>.json
This writes repair_iterations/<iteration_id>/repair_iteration_plan.json, repair_prompt.md, and repair_checklist.md. The plan enriches GUI repair targets with candidate runtime files and rerun instructions, so the next coder pass can start from a bounded business defect instead of a full transcript archaeology dig.
To materialize or execute the coder command for that repair iteration, run:
python scripts/stage_agent_loop.py run-repair --manifest docs/orchestration/<stage_loop>.json --dry-run
--dry-run writes repair_coder.command.txt, records repair_execution_summary.json, updates stage_loop_summary.json, and prints the exact non-interactive Codex command without changing code. Without --dry-run, it executes the coder command with the prepared repair_prompt.md, writes repair_coder_result.json, captures stdout/stderr, records repair_execution_summary.json, and updates the stage next action to rerun/ingest, inspect, or stop for a decision depending on the coder status. After a real coder patch, rerun the same semantic pack or GUI session and ingest the new assistant-stage1-<id>.
When the coder result is patched, the next ingest-gui-run is treated as post-repair validation for that repair iteration. stage_loop_summary.json records latest_repair_validation and repair_validation_history, including the validation run id, remaining P0/P1 findings, and whether the repair was actually accepted after replay. A patch without this rerun/ingest evidence is not a closed stage.
The stage closing gate enforces that rule even when the inner pack loop reports accepted: loop_accepted_gate preserves the raw loop verdict, but stage-level accepted_gate stays false with stage_closing_gate.status = blocked_pending_repair_validation until the latest patched repair has a matching successful validation run.
Placeholder contract
Scenario questions can reference earlier step outputs with placeholders such as:
{{step_01_inventory.entries[0].item}}{{semantic_memory.active_result_set_id}}
This keeps carryover explicit and machine-readable.
Status contract
Scenario capture uses four operational statuses:
acceptedpartialblockedneeds_exact_capability
partial means the scenario executed, but one or more steps still need route hardening, evidence hardening, or presentation hardening.
needs_exact_capability means the scenario is valid for the project, but the current contour still lacks the exact route or capability needed to answer it.
In autonomous pack-loop mode, partial and needs_exact_capability are non-terminal by default. The loop should continue domain enablement work until one of these happens:
- analyst quality reaches the configured acceptance gate, normally
>= 80; - the analyst marks
requires_user_decision = truebecause the next step would otherwise require guessing a missing required observation, making an architecture-risky change, accepting a hacky/brittle workaround, or choosing a business-critical tradeoff without enough evidence; - the runtime is truly blocked;
- the loop reaches
max_iterations.