13 KiB

Raw Blame History

Domain Scenario Loop - Repo Adapter

Purpose

This repository now supports two outer-loop capture modes:

run-case for one concrete domain question;
run-scenario for a linked multi-step domain chain that should reuse one assistant session.
run-pack for a whole domain question pool grouped into several scenarios.
run-pack-loop for a strong analyst review loop over a whole domain pack, with Lead Codex repair handoff by default.

run-scenario is the preferred capture mode for domains where the user's next question depends on the previous result set. run-pack is the preferred capture mode when the user brings a full domain pool that should be kept in one aggregate backlog.

Runtime contract

The scenario runner does not introduce a new product runtime.

It reuses:

POST /api/assistant/message
GET /api/assistant/session/:session_id
current backend LLM/profile configuration
current address/deep routing inside the product

Artifact contract

Scenario artifacts live under:

artifacts/domain_runs/<scenario_id>/

Top-level artifacts:

scenario_brief.md
scenario_manifest.json
scenario_state.json
scenario_summary.md
scenario_output.md
final_status.md

Per-step artifacts:

steps/<step_id>/output.md
steps/<step_id>/debug.json
steps/<step_id>/turn.json
steps/<step_id>/session.json
steps/<step_id>/assistant_response.json
steps/<step_id>/step_state.json

Pack artifacts live under:

artifacts/domain_runs/<pack_id>/

pack_manifest.json
pack_state.json
pack_summary.md
final_status.md
scenarios/<scenario_id>/...

AGENT autorun save gate

scripts/save_agent_semantic_run.py is a post-validation persistence tool, not a replay executor. The normal path is:

build/update the truth-harness spec;
run python scripts/domain_truth_harness.py run-live --spec ... --output-dir artifacts/domain_runs/<run_id>;
inspect truth_review.md, business_review.md, pack_state.json, and final_status.md;
save to GUI autoruns only with python scripts/save_agent_semantic_run.py --spec ... --validated-run-dir artifacts/domain_runs/<run_id>.

The save gate requires:

pack_state.final_status = accepted;
pack_state.acceptance_gate_passed = true;
truth_review.summary.overall_status = pass;
business_review.overall_business_status = pass;
zero unresolved P0 and zero business-answer failures.

If a pack must be saved as a deliberate manual draft before live acceptance, use --allow-unvalidated --unvalidated-reason "<why this is intentionally not accepted>". That path is explicitly marked as unvalidated and must not be treated as semantic proof.

Stage-level AGENT loop

scripts/stage_agent_loop.py wraps the domain pack loop into the development-stage workflow:

take the current global/local stage manifest;
run scripts/domain_case_loop.py run-pack-loop for that stage pack;
let the loop run pack replay and a business-first analyst verdict; if the gate is not accepted, write business_audit.md and lead_coder_handoff.md instead of launching a weak coder by default;
if accepted, persist the validated AGENT pack into GUI autoruns through scripts/save_agent_semantic_run.py --validated-run-dir;
write stage_loop_summary.json and stage_loop_handoff.md for the final human visual confirmation.

The stage manifest schema is docs/orchestration/schemas/stage_agent_loop_manifest.schema.json. The default stage gate is intentionally stricter than a narrow case gate: target_score = 88, no unresolved P0/P1 repair targets, accepted analyst verdict, clean business usefulness, direct-answer, temporal-honesty, field-truth, and answer-layering flags.

Canonical commands:

python scripts/stage_agent_loop.py plan --manifest docs/orchestration/<stage_loop>.json
python scripts/stage_agent_loop.py run --manifest docs/orchestration/<stage_loop>.json
python scripts/stage_agent_loop.py review-questions --manifest docs/orchestration/<stage_loop>.json
python scripts/stage_agent_loop.py ingest-gui-run --manifest docs/orchestration/<stage_loop>.json --run-id assistant-stage1-<id>
python scripts/stage_agent_loop.py prepare-repair --manifest docs/orchestration/<stage_loop>.json
python scripts/stage_agent_loop.py run-repair --manifest docs/orchestration/<stage_loop>.json --dry-run
python scripts/stage_agent_loop.py status --manifest docs/orchestration/<stage_loop>.json
python scripts/stage_agent_loop.py continue --manifest docs/orchestration/<stage_loop>.json
python scripts/stage_agent_loop.py summarize --manifest docs/orchestration/<stage_loop>.json

This is the intended path for "implement the stage, generate/check stage questions, analyze business answers, patch code, rerun, then ask the user for final visual confirmation".

The default repair mode is lead-handoff. In this mode the expensive replay still runs live and the independent analyst still produces the strict business verdict, but code repair stays with the main Lead Codex context. The loop stops with next_action = lead_coder_repair_required, plus:

business_audit.md for the user-facing semantic/business verdict;
lead_coder_handoff.md/json for the concrete repair target, candidate files, and validation path;
stage_context_capsule.md/json for the current stage contract, question quality, loop status, and operating model.

auto-coder remains available only as an explicit opt-in experiment:

python scripts/stage_agent_loop.py run --manifest docs/orchestration/<stage_loop>.json --repair-mode auto-coder

That path must not be treated as the normal high-trust repair mode for this project.

Before launching an expensive live replay, run review-questions. It reads the stage pack, resolves {{bindings.*}} placeholders, checks scenario/follow-up density, direct-answer shape declarations, domain coverage, stale-scope canaries, dependency order, duplicates, mojibake in generated Russian questions, and estimated Windows artifact path length. It writes:

question_generation_review.json;
question_generation_review.md.

A strong question review is not semantic proof that the assistant answers correctly. It is the pre-flight gate that says the generated questions are worth spending a live replay on.

GUI run review bridge

When a manual or GUI autorun already exists, scripts/review_assistant_stage1_run.py turns the run id into the same machine-readable review surface.

Canonical command:

python scripts/review_assistant_stage1_run.py assistant-stage1-<id> --print-summary

The script resolves:

llm_normalizer/reports/assistant-stage1-<id>.md;
llm_normalizer/data/assistant_sessions/assistant-stage1-<id>-*.json.

It writes:

artifacts/domain_runs/gui_run_reviews/assistant-stage1-<id>/run_review.json;
artifacts/domain_runs/gui_run_reviews/assistant-stage1-<id>/run_review.md;
conversation_pairs.json;
question_quality_review.json;
repair_targets.json.

This bridge is intentionally business-first:

the user's question and visible assistant answer are reviewed before route ids and debug fields;
noisy direct answers, missing first-line answers, technical garbage, and over-broad business answers become findings;
generated question packs get a deterministic quality review for follow-up density, direct questions, report-style analysis, domain diversity, duplicates, and weak business anchors.

Use this bridge when the operator would otherwise say "чекни прогон assistant-stage1-...". The expected next step is no longer manual eyeballing first; it is: review by id, inspect run_review.md, map repair_targets.json into the current stage loop, patch, and rerun.

For stage work, prefer the integrated command:

python scripts/stage_agent_loop.py ingest-gui-run --manifest docs/orchestration/<stage_loop>.json --run-id assistant-stage1-<id>

It stores the GUI review under artifacts/domain_runs/stage_agent_loops/<stage_id>/gui_run_reviews/<run_id>/, updates stage_loop_summary.json, and writes the next stage action:

continue_repair_from_gui_review_p0 when the GUI run exposes business-wrong or missing direct-answer defects;
continue_repair_from_gui_review_p1 when the run is semantically usable but still noisy, over-broad, or poorly layered;
manual_gui_confirmation_or_stage_close when the GUI run is clean enough for final human confirmation.

stage_loop_summary.json also includes next_step_guidance.command_templates, so the next operator or agent pass can continue from machine-readable commands instead of re-inferring the workflow from prose.

Use python scripts/stage_agent_loop.py status --manifest docs/orchestration/<stage_loop>.json as the cheap read-only checkpoint before continuing a stage. It prints the current next action, closing gate, latest GUI run, latest repair coder status, latest repair validation status, and cold-start continuation artifacts such as domain_pack_loop.command.txt without modifying artifacts.

Use python scripts/stage_agent_loop.py continue --manifest docs/orchestration/<stage_loop>.json as the safe one-command continuation layer. From a cold start it materializes domain_pack_loop.command.txt without launching the long live loop; after a GUI review it can prepare a repair iteration and materialize run-repair --dry-run automatically; it will not run the real coder pass unless --execute-repair is passed, and it waits for a --run-id assistant-stage1-<id> when the next required step is post-repair rerun/ingest validation.

It also writes stage_repair_handoff.md/json next to the stage summary. That handoff is the preferred input for the next coder pass: it lists primary repair targets and sample user-facing failures without forcing the coder to reread the entire GUI conversation first.

For live stage-pack failures, prefer lead_coder_handoff.md over immediately preparing a coder pass. The intent is: strong business audit first, Lead Codex code repair second, same replay/GUI validation third.

To prepare the next repair iteration from that handoff, run:

python scripts/stage_agent_loop.py prepare-repair --manifest docs/orchestration/<stage_loop>.json

This writes repair_iterations/<iteration_id>/repair_iteration_plan.json, repair_prompt.md, and repair_checklist.md. The plan enriches GUI repair targets with candidate runtime files and rerun instructions, so the next coder pass can start from a bounded business defect instead of a full transcript archaeology dig.

To materialize or execute the coder command for that repair iteration, run:

python scripts/stage_agent_loop.py run-repair --manifest docs/orchestration/<stage_loop>.json --dry-run

--dry-run writes repair_coder.command.txt, records repair_execution_summary.json, updates stage_loop_summary.json, and prints the exact non-interactive Codex command without changing code. Without --dry-run, it executes the coder command with the prepared repair_prompt.md, writes repair_coder_result.json, captures stdout/stderr, records repair_execution_summary.json, and updates the stage next action to rerun/ingest, inspect, or stop for a decision depending on the coder status. After a real coder patch, rerun the same semantic pack or GUI session and ingest the new assistant-stage1-<id>.

When the coder result is patched, the next ingest-gui-run is treated as post-repair validation for that repair iteration. stage_loop_summary.json records latest_repair_validation and repair_validation_history, including the validation run id, remaining P0/P1 findings, and whether the repair was actually accepted after replay. A patch without this rerun/ingest evidence is not a closed stage.

The stage closing gate enforces that rule even when the inner pack loop reports accepted: loop_accepted_gate preserves the raw loop verdict, but stage-level accepted_gate stays false with stage_closing_gate.status = blocked_pending_repair_validation until the latest patched repair has a matching successful validation run.

Placeholder contract

Scenario questions can reference earlier step outputs with placeholders such as:

{{step_01_inventory.entries[0].item}}
{{semantic_memory.active_result_set_id}}

This keeps carryover explicit and machine-readable.

Status contract

Scenario capture uses four operational statuses:

accepted
partial
blocked
needs_exact_capability

partial means the scenario executed, but one or more steps still need route hardening, evidence hardening, or presentation hardening. needs_exact_capability means the scenario is valid for the project, but the current contour still lacks the exact route or capability needed to answer it.

In autonomous pack-loop mode, partial and needs_exact_capability are non-terminal by default. The loop should continue domain enablement work until one of these happens:

analyst quality reaches the configured acceptance gate, normally >= 80;
the analyst marks requires_user_decision = true because the next step would otherwise require guessing a missing required observation, making an architecture-risky change, accepting a hacky/brittle workaround, or choosing a business-critical tradeoff without enough evidence;
the runtime is truly blocked;
the loop reaches max_iterations.

13 KiB Raw Blame History