Документация: поднять semantic robustness в высокий приоритет readiness перед расширением доменов
This commit is contained in:
parent
82cc5cfd1a
commit
1b9dbebe47
|
|
@ -8,6 +8,10 @@ The goal is not to patch individual failing prompts.
|
|||
|
||||
The goal is to finish the missing runtime authority that should govern mixed live sessions after the turnaround `11` owner extractions.
|
||||
|
||||
From the current stage onward, one more rule is explicit:
|
||||
|
||||
- do not treat "route passed" as sufficient acceptance if a human user would still feel that the assistant did not understand the question or answered the wrong business intent.
|
||||
|
||||
## Current Reading
|
||||
|
||||
The strongest current evidence is:
|
||||
|
|
@ -37,6 +41,17 @@ That object should become the shared authority for:
|
|||
- active answer object / reusable bundle
|
||||
- recap source of truth
|
||||
|
||||
This plan now also treats one product-facing invariant as architecture-critical:
|
||||
|
||||
- `semantic_robustness_for_live_user_wording_v1`
|
||||
|
||||
That means the repaired runtime must stay meaningful for:
|
||||
|
||||
- live colloquial wording;
|
||||
- short follow-up wording;
|
||||
- typo/noisy wording that still clearly points to an already-supported business question;
|
||||
- human-readable answers that match the user's asked intent rather than merely matching an internal route family.
|
||||
|
||||
## Target Runtime Rule
|
||||
|
||||
Before any of the following decisions are made:
|
||||
|
|
@ -107,6 +122,20 @@ Exit condition:
|
|||
- the core mixed replay is green on direct answer, selected-object continuity, same-date carryover, recap truthfulness, and technical cleanliness;
|
||||
- no unresolved `P0` remains on the primary user path.
|
||||
|
||||
### Pass E. Raise semantic robustness to first-class acceptance
|
||||
|
||||
Scope:
|
||||
|
||||
- evaluate repaired contours through the human user's intent and the human-readable answer before reading techchat;
|
||||
- treat typo tolerance, short follow-up retarget, and intent-faithful answer selection as architecture work rather than optional UX polish;
|
||||
- stop accepting a replay as "good enough" when the route is technically valid but the human answer still feels glitchy, stale, or semantically off-target.
|
||||
|
||||
Exit condition:
|
||||
|
||||
- already-supported business questions no longer fail only because of light live-user noise such as obvious typos or compressed colloquial wording;
|
||||
- short follow-up turns can switch entity and intent without silently replaying the previous answer;
|
||||
- new-user perception is no longer "the system glitches", but "the system understood what I meant".
|
||||
|
||||
## Anti-Goals
|
||||
|
||||
This stabilization pass is not:
|
||||
|
|
@ -121,8 +150,9 @@ This stabilization pass is not:
|
|||
1. Finish the continuity snapshot and wire it into the hot route / recap path.
|
||||
2. Rework clarification precedence so it becomes a last meaningful step.
|
||||
3. Harden recap and boundary presentation against ungrounded and technical output.
|
||||
4. Rerun the mixed AGENT replay until the critical continuity edges are green.
|
||||
5. Only then continue deeper intent extraction and wider domain expansion.
|
||||
4. Raise semantic robustness on repaired contours until live wording no longer feels broken to a human user.
|
||||
5. Rerun the mixed AGENT replay until the critical continuity edges are green and semantically clean.
|
||||
6. Only then continue deeper intent extraction and wider domain expansion.
|
||||
|
||||
## Current Pass Status
|
||||
|
||||
|
|
@ -526,6 +556,12 @@ Latest phase17 clarification-resume evidence after the current replay hardening:
|
|||
- this matters because selected-item continuity, purchase-date VAT bridge, and navigation-driven focus-object hints now read one shared navigation-state interpretation instead of three local parsers inside the transition hot path;
|
||||
- targeted `assistantContinuityPolicy` and `assistantTransitionPolicy` suites are green after the move (`37/37`), backend build is green, and graphify was rebuilt on the updated codebase;
|
||||
- phase15 replay has also been made time-stable for the current snapshot step, and live replay `address_truth_harness_phase15_answer_inspection_followup_live_20260419_rerun11` is accepted `9/9`, which is the semantic proof that the selected-item / answer-inspection / VAT bridge contour still survives after the focus-authority convergence pass.
|
||||
- the next continuity-authority pass now centralizes organization clarification continuation state itself instead of rebuilding it separately inside route and transition:
|
||||
- `assistantContinuityPolicy` now owns `resolveOrganizationClarificationContinuation(...)`, which resolves clarification candidates, explicit company selection from the current message set, fallback selection from scope authority, and the final `hasContinuation` verdict through one shared helper;
|
||||
- both `assistantTransitionPolicy` and `assistantRoutePolicy` now consume that helper instead of maintaining separate local chains for `explicitOrganizationClarificationSelection -> organizationClarificationSelection -> hasOrganizationClarificationContinuation`;
|
||||
- this matters because clarification-resume behavior is one of the highest-risk seams for future domain expansion: if route and transition diverge here, the same company choice can resume one contour but be treated as a standalone scope switch by another owner;
|
||||
- targeted `assistantContinuityPolicy`, `assistantTransitionPolicy`, and `assistantRoutePolicy` suites are green after the move (`53/53`), backend build is green, and graphify was rebuilt on the updated codebase;
|
||||
- live replay `address_truth_harness_phase17_clarification_resume_and_counterparty_tail_live_20260419_rerun6` is accepted `10/10`, which is the semantic proof that clarification-resume and short counterparty-tail retarget still survive after moving clarification continuation into the shared continuity layer.
|
||||
|
||||
## Next Execution Slice (2026-04-19)
|
||||
|
||||
|
|
|
|||
|
|
@ -28,6 +28,11 @@ Current confidence snapshot:
|
|||
- pre-multidomain readiness: `~78%`
|
||||
- latest graph snapshot: `5372 nodes`, `11525 edges`, `135 communities`
|
||||
|
||||
One more explicit reading is now required:
|
||||
|
||||
- pre-multidomain readiness must be judged by semantic user trust as well as by route truth;
|
||||
- if a new user would still conclude "the system glitches and does not understand me" on already-supported contours, readiness is overstated even when replay metrics look green.
|
||||
|
||||
## What Is Already True
|
||||
|
||||
The following claims are now supported by code plus live replay evidence:
|
||||
|
|
@ -112,6 +117,23 @@ It is not yet broad enough to justify saying:
|
|||
|
||||
- "the orchestration layer is already platform-grade for many new domains at once."
|
||||
|
||||
### 5. Semantic robustness is still below the target product threshold
|
||||
|
||||
The next readiness blocker is not only structural continuity.
|
||||
|
||||
It is also whether the assistant feels alive and meaning-aware to a new human user.
|
||||
|
||||
Current risk pattern:
|
||||
|
||||
- some already-supported questions still fail under light typo/noise;
|
||||
- short follow-up wording can still lose the intended entity or intent switch;
|
||||
- a technically routed answer can still feel wrong because it answers the previous context instead of the user's actual meaning.
|
||||
|
||||
This matters architecturally because, at this stage of the project:
|
||||
|
||||
- semantic robustness is not "nice UX polish";
|
||||
- it is part of the runtime contract that decides whether the assistant is trustworthy enough for domain expansion.
|
||||
|
||||
## Readiness Assessment
|
||||
|
||||
### Safe right now
|
||||
|
|
@ -136,6 +158,7 @@ The system should not be considered ready for the next level until all of the fo
|
|||
3. Capability/meta handling is less service-special-case and more contract-driven.
|
||||
4. `assistantService` pressure is reduced enough that new domains do not have to negotiate multiple partially overlapping owners.
|
||||
5. Long exact answers feel human and business-first, not merely technically correct.
|
||||
6. Repaired contours stay semantically robust under live user wording, including short follow-ups, noisy/typo variants, and intent-switch questions that should still be understood.
|
||||
|
||||
## Recommended Next Execution Sequence
|
||||
|
||||
|
|
@ -145,6 +168,10 @@ The current planning assumption is:
|
|||
- the project is currently around `78%`;
|
||||
- therefore the remaining gap should be treated as roughly four large iterations rather than as a handful of cosmetic fixes.
|
||||
|
||||
The following priority rule now applies across all four iterations:
|
||||
|
||||
- semantic robustness of human questions and human-readable answers is a high-priority acceptance dimension, not a post-architecture polish lane.
|
||||
|
||||
### Iteration 1 / Pass 18. Continuity authority completion
|
||||
|
||||
Goal:
|
||||
|
|
@ -173,6 +200,11 @@ Expected readiness gain:
|
|||
|
||||
- `~+4%` toward pre-multidomain readiness.
|
||||
|
||||
High-priority emphasis inside this iteration:
|
||||
|
||||
- replay review must start from semantic human analysis of what the user asked and what the model answered;
|
||||
- green route ids are not enough if the assistant still feels semantically wrong or glitchy on the replay.
|
||||
|
||||
### Iteration 3 / Pass 20. Coordinator pressure reduction
|
||||
|
||||
Goal:
|
||||
|
|
@ -201,6 +233,10 @@ Expected readiness gain:
|
|||
|
||||
- `~+1-2%` toward pre-multidomain readiness.
|
||||
|
||||
High-priority emphasis inside this iteration:
|
||||
|
||||
- close the remaining "already supported but semantically broken to a human" gaps before calling the system ready for domain expansion.
|
||||
|
||||
Secondary cleanup after those four iterations:
|
||||
|
||||
- human answer shaping cleanup where it materially affects acceptance;
|
||||
|
|
|
|||
|
|
@ -53,13 +53,15 @@ Current honest status:
|
|||
- the validated hot paths are no longer structurally broken;
|
||||
- flagship continuity collapse is no longer the primary risk;
|
||||
- the main remaining risk is no longer clarification-resume collapse, but the unfinished final convergence toward one true runtime authority plus replay breadth still below the intended multi-domain blast radius;
|
||||
- product shaping is now secondary debt, not the primary blocker.
|
||||
- pure wording polish is now secondary debt, but semantic robustness of user-facing answers is now a first-class blocker;
|
||||
- the practical product risk is no longer "the route collapsed", but "a new user can still feel that the assistant is glitchy, misses intent, or answers the wrong thing on short live wording".
|
||||
- main remaining architectural pressure:
|
||||
- no single fully authoritative continuity contract consumed by every hot runtime owner
|
||||
- residual coordinator/legacy pressure inside `assistantService.ts`
|
||||
- central domain-intent pressure inside `resolveAddressIntent()`
|
||||
- replay breadth still narrower than the intended multi-domain rollout surface beyond the flagship and late-switch families
|
||||
- remaining answer-semantics pressure inside `composeStage.ts` / `answerComposer.ts`
|
||||
- insufficient semantic robustness on live user wording, especially short follow-up retarget, typo tolerance, and intent-faithful human answers
|
||||
|
||||
Latest live proof now includes:
|
||||
|
||||
|
|
@ -75,6 +77,7 @@ Current architectural reading:
|
|||
- it is now safe for continued architecture hardening and controlled domain-by-domain enablement under replay gates;
|
||||
- it is now materially closer to pre-multidomain stability, but still not safe to declare broad low-risk multi-domain expansion.
|
||||
- the practical next target is now `90%+ pre-multidomain readiness`, and the remaining gap should be treated as four large architecture iterations rather than as cosmetic cleanup.
|
||||
- from this point onward, readiness must be judged not only by route truth and replay pass rate, but also by whether a new human user would feel that the assistant understands the intent and responds meaningfully in live wording.
|
||||
|
||||
For the detailed audit, current percentages, and remaining debt, read:
|
||||
|
||||
|
|
@ -144,3 +147,4 @@ The biggest remaining blockers are:
|
|||
- residual `assistantService` overload;
|
||||
- central intent pressure in `resolveAddressIntent()`;
|
||||
- remaining answer-semantics pressure in `composeStage.ts` and `answerComposer.ts`.
|
||||
- semantic robustness gaps where already-supported questions can still look broken to a human user because of typo sensitivity, short follow-up retarget loss, or human-answer mismatch.
|
||||
|
|
|
|||
Loading…
Reference in New Issue