8.4 KiB

Raw Blame History

Domain Scenario Loop - Repo Adapter

Purpose

This repository now supports two outer-loop capture modes:

run-case for one concrete domain question;
run-scenario for a linked multi-step domain chain that should reuse one assistant session.
run-pack for a whole domain question pool grouped into several scenarios.
run-pack-loop for an autonomous analyst/coder loop over a whole domain pack.

run-scenario is the preferred capture mode for domains where the user's next question depends on the previous result set. run-pack is the preferred capture mode when the user brings a full domain pool that should be kept in one aggregate backlog.

Runtime contract

The scenario runner does not introduce a new product runtime.

It reuses:

POST /api/assistant/message
GET /api/assistant/session/:session_id
current backend LLM/profile configuration
current address/deep routing inside the product

Artifact contract

Scenario artifacts live under:

artifacts/domain_runs/<scenario_id>/

Top-level artifacts:

scenario_brief.md
scenario_manifest.json
scenario_state.json
scenario_summary.md
scenario_output.md
final_status.md

Per-step artifacts:

steps/<step_id>/output.md
steps/<step_id>/debug.json
steps/<step_id>/turn.json
steps/<step_id>/session.json
steps/<step_id>/assistant_response.json
steps/<step_id>/step_state.json

Pack artifacts live under:

artifacts/domain_runs/<pack_id>/

pack_manifest.json
pack_state.json
pack_summary.md
final_status.md
scenarios/<scenario_id>/...

AGENT autorun save gate

scripts/save_agent_semantic_run.py is a post-validation persistence tool, not a replay executor. The normal path is:

build/update the truth-harness spec;
run python scripts/domain_truth_harness.py run-live --spec ... --output-dir artifacts/domain_runs/<run_id>;
inspect truth_review.md, business_review.md, pack_state.json, and final_status.md;
save to GUI autoruns only with python scripts/save_agent_semantic_run.py --spec ... --validated-run-dir artifacts/domain_runs/<run_id>.

The save gate requires:

pack_state.final_status = accepted;
pack_state.acceptance_gate_passed = true;
truth_review.summary.overall_status = pass;
business_review.overall_business_status = pass;
zero unresolved P0 and zero business-answer failures.

If a pack must be saved as a deliberate manual draft before live acceptance, use --allow-unvalidated --unvalidated-reason "<why this is intentionally not accepted>". That path is explicitly marked as unvalidated and must not be treated as semantic proof.

Stage-level AGENT loop

scripts/stage_agent_loop.py wraps the domain pack loop into the development-stage workflow:

take the current global/local stage manifest;
run scripts/domain_case_loop.py run-pack-loop for that stage pack;
let the loop iterate through pack replay, business-first analyst verdict, coder patch, and rerun until the objective gate is accepted, blocked, or a real user decision is required;
if accepted, persist the validated AGENT pack into GUI autoruns through scripts/save_agent_semantic_run.py --validated-run-dir;
write stage_loop_summary.json and stage_loop_handoff.md for the final human visual confirmation.

The stage manifest schema is docs/orchestration/schemas/stage_agent_loop_manifest.schema.json. The default stage gate is intentionally stricter than a narrow case gate: target_score = 88, no unresolved P0/P1 repair targets, accepted analyst verdict, clean business usefulness, direct-answer, temporal-honesty, field-truth, and answer-layering flags.

Canonical commands:

python scripts/stage_agent_loop.py plan --manifest docs/orchestration/<stage_loop>.json
python scripts/stage_agent_loop.py run --manifest docs/orchestration/<stage_loop>.json
python scripts/stage_agent_loop.py ingest-gui-run --manifest docs/orchestration/<stage_loop>.json --run-id assistant-stage1-<id>
python scripts/stage_agent_loop.py prepare-repair --manifest docs/orchestration/<stage_loop>.json
python scripts/stage_agent_loop.py summarize --manifest docs/orchestration/<stage_loop>.json

This is the intended path for “implement the stage, generate/check stage questions, analyze business answers, patch code, rerun, then ask the user for final visual confirmation”.

GUI run review bridge

When a manual or GUI autorun already exists, scripts/review_assistant_stage1_run.py turns the run id into the same machine-readable review surface.

Canonical command:

python scripts/review_assistant_stage1_run.py assistant-stage1-<id> --print-summary

The script resolves:

llm_normalizer/reports/assistant-stage1-<id>.md;
llm_normalizer/data/assistant_sessions/assistant-stage1-<id>-*.json.

It writes:

artifacts/domain_runs/gui_run_reviews/assistant-stage1-<id>/run_review.json;
artifacts/domain_runs/gui_run_reviews/assistant-stage1-<id>/run_review.md;
conversation_pairs.json;
question_quality_review.json;
repair_targets.json.

This bridge is intentionally business-first:

the user's question and visible assistant answer are reviewed before route ids and debug fields;
noisy direct answers, missing first-line answers, technical garbage, and over-broad business answers become findings;
generated question packs get a deterministic quality review for follow-up density, direct questions, report-style analysis, domain diversity, duplicates, and weak business anchors.

Use this bridge when the operator would otherwise say “чекни прогон assistant-stage1-...”. The expected next step is no longer manual eyeballing first; it is: review by id, inspect run_review.md, map repair_targets.json into the current stage loop, patch, and rerun.

For stage work, prefer the integrated command:

python scripts/stage_agent_loop.py ingest-gui-run --manifest docs/orchestration/<stage_loop>.json --run-id assistant-stage1-<id>

It stores the GUI review under artifacts/domain_runs/stage_agent_loops/<stage_id>/gui_run_reviews/<run_id>/, updates stage_loop_summary.json, and writes the next stage action:

continue_repair_from_gui_review_p0 when the GUI run exposes business-wrong or missing direct-answer defects;
continue_repair_from_gui_review_p1 when the run is semantically usable but still noisy, over-broad, or poorly layered;
manual_gui_confirmation_or_stage_close when the GUI run is clean enough for final human confirmation.

It also writes stage_repair_handoff.md/json next to the stage summary. That handoff is the preferred input for the next coder pass: it lists primary repair targets and sample user-facing failures without forcing the coder to reread the entire GUI conversation first.

To prepare the next repair iteration from that handoff, run:

python scripts/stage_agent_loop.py prepare-repair --manifest docs/orchestration/<stage_loop>.json

This writes repair_iterations/<iteration_id>/repair_iteration_plan.json, repair_prompt.md, and repair_checklist.md. The plan enriches GUI repair targets with candidate runtime files and rerun instructions, so the next coder pass can start from a bounded business defect instead of a full transcript archaeology dig.

Placeholder contract

Scenario questions can reference earlier step outputs with placeholders such as:

{{step_01_inventory.entries[0].item}}
{{semantic_memory.active_result_set_id}}

This keeps carryover explicit and machine-readable.

Status contract

Scenario capture uses four operational statuses:

accepted
partial
blocked
needs_exact_capability

partial means the scenario executed, but one or more steps still need route hardening, evidence hardening, or presentation hardening. needs_exact_capability means the scenario is valid for the project, but the current contour still lacks the exact route or capability needed to answer it.

In autonomous pack-loop mode, partial and needs_exact_capability are non-terminal by default. The loop should continue domain enablement work until one of these happens:

analyst quality reaches the configured acceptance gate, normally >= 80;
the analyst marks requires_user_decision = true because the next step would otherwise require guessing a missing required observation, making an architecture-risky change, accepting a hacky/brittle workaround, or choosing a business-critical tradeoff without enough evidence;
the runtime is truly blocked;
the loop reaches max_iterations.

8.4 KiB Raw Blame History