02 — Methodology
preatorlabs' scientific workflow, step by step, from inputs to verdict.
Overview
Raw prompt
│
▼
[1] Automatic segmentation
│
▼
[2] Scenario + rule configuration (manual + auto)
│
▼
[3] Baseline generation (T=0)
│
▼
[4] Ablation loop (N × M calls)
│
▼
[5] Per-axis delta computation
│
▼
[6] Aggregation: active axes + impact + variance + activation
│
▼
[7] Classification: per-segment verdict
│
▼
[8] Visual rendering
[1] Segmentation
Goal
Split the raw prompt into coherent logical units, granular enough to allow attributing an effect, but not so granular as to produce noise (a single-word segment has no measurable effect).
V1 algorithm (heuristic)
def auto_segment(text):
# Step A: split on double line break (paragraphs)
blocks = split(text, /\n\s*\n+/)
# Step B: detect ALL-CAPS titles on their own line
# A title becomes the start of a new segment
refined = []
for block in blocks:
for line in block.split('\n'):
if is_title(line): # uppercase, 3-80 chars
refined.append(current); current = [line]
else:
current.append(line)
refined.append(current)
# Step C: merge fragments that are too short with their neighbour
return merge_short(refined, min_chars=20)
V2 algorithm (considered)
Embedding-clustering segmentation: split sentence by sentence, embed, and group semantically adjacent sentences. More robust on poorly formatted prompts. Out of scope for V1.
User control
Automatic segmentation is proposed, not imposed. The user sees the split and can edit each segment, merge some, delete some. The contract: preatorlabs proposes a reasonable split, the user remains in control of the final granularity.
[2] Configuration
Test scenarios (M)
Typical user inputs. Recommendation: 5 to 8 scenarios covering the prompt's main use cases. Too few → variance is poorly estimated. Too many → cost explodes without proportional gain.
Rules for the 3 axes
Structural — verifiable and traceable rules:
- length ≤ N words or ≤ N lines
- absence of a character/pattern (e.g.
*, Markdown lists, emoji) - presence of an expected structure (valid JSON, required keys)
- presence of an imposed sentence (
end with "...") - numeric threshold extracted from the prompt (
no more than 20€)
Robust auto-extraction (V0.4, Batch B)
Structural auto-extraction has been hardened to stop missing constraints that were nonetheless explicit:
- Length: the number is now detected regardless of word order in the line, as soon as a limiting word is present (
max,maximum,under,no more than,at most,up to,does not exceed,fewer than,≤,<=). Thus« Keep your answers under 200 words »,« 6 words max »or« 15 words maximum »produce amax_wordsrule. The same logic applies to lines (max_lines). - Concrete prohibitions: a prohibitive phrasing (
forbidden,never,avoid,no,without,none,do not, …) bearing on a measurable object generates a checkable rule:no_asterisk,no_list,no_emoji. Safeguard: a non-prohibitive phrasing triggers nothing (« Use bullet lists »does not createno_list).
Assumed measurement change: this hardening now triggers structural/behavioural rules that went unnoticed before. Direct consequence: on some runs, the struct/behav scores (and therefore the aggregated impact and the verdicts) may change compared with earlier versions. This is the intended effect — reducing the dominance of the semantic axis — not a regression.
Assumed limit: an abstract prohibition (« no hollow superlatives », « stay non-judgemental ») is not measurable by lexical matching; it remains carried by the semantic axis. Only concrete constraints (asterisks, lists, emoji, exact sentences, length) become structural.
User contract preserved: auto-extracted rules remain proposed, not imposed. They are visible and editable in the criteria preview (renderCriteriaPreview) before launching.
Behavioural — lexical detection:
- forbidden terms (list of strings)
- expected terms (list of strings)
- business regex patterns
- informal/formal address explicitly requested
Semantic — cosine distance:
tfidf_localmode (free): direct comparison of full output vs ablated outputvoyage_apimode (paid, optional): Voyage embeddings + cosine
[3] Baseline generation
For each scenario Tj (1 ≤ j ≤ M), the output is generated with the full prompt:
O(full, Tj) = LLM(system=full_prompt, user=Tj, temperature=0)
This is the reference output. Its conformance to the 3 axes defines the baseline score B(Tj) ∈ [0, 1]^3.
[4] Ablation loop
For each segment Si (1 ≤ i ≤ N) and each scenario Tj:
prompt_without_Si = concatenate(segments \ {Si})
O(¬Si, Tj) = LLM(system=prompt_without_Si, user=Tj, temperature=0)
Total cost: N × M + M API calls.
Example: 12 segments × 6 scenarios = 78 calls.
[5] Per-axis delta computation
For each triplet (segment Si, scenario Tj, axis a ∈ {struct, behav, sem}):
delta(i, j, a) = |score_a(O(full, Tj)) - score_a(O(¬Si, Tj))|
A null delta means it had no effect. When an axis is not computable on a scenario, it is marked not applicable (and excluded from that scenario's aggregation).
Detail per axis
Structural axis — boolean diff:
score_struct(output) = sum(criterion(output) for criterion in struct_criteria) / num_criteria
Behavioural axis — proportion of rules respected:
score_behav(output) = matched_behav(output) / total_behav(output)
# if total_behav = 0 → not applicable
Semantic axis — cosine:
score_sem(output, baseline) = cosine_similarity(embed(output), embed(baseline))
In mode B: score_sem is computed on the difference between full and ablated output. The impact is therefore 1 - cos(O_full, O_¬Si).
[6] Aggregation
For each segment Si:
impact(i, j) = average of deltas over applicable axes only
total_impact(i) = mean_j(impact(i, j))
variance(i) = std_j(impact(i, j))
activation(i) = ratio_j(impact(i, j) >= threshold)
V0.3 avoids dilution by dormant axes: no fixed average over 3 axes when an axis is not applicable.
Reading layer (z, S/N, carrier axis, direction) — V0.4, Batch A
A purely additive layer re-presents the already-computed deltas to make the discrimination readable. It produces no new measurement: each figure is recomputable by hand (falsifiable), and alters neither impact, nor variance, nor the verdict. It is computed by enrichResults(results) (results[i].stats field) after aggregation.
| Indicator | Definition | Reading | Pitfall to avoid |
|---|---|---|---|
zImpact (z) |
deviation from the run mean, in σ | z ≥ +1 = clearly above the other segments of this prompt |
relative to the run: not comparable between two different prompts |
carrierAxis / carrierImpact (carrier axis) |
the strongest axis (struct/behav/sem), undiluted by the average | tells where the segment acts | a strong semantic carrier may be a mere rephrasing → confirm via outputs |
snr (S/N) |
impact / (variance + 0.05) | large = real and stable effect | a low S/N ≠ "useless", but "unstable / scenario-dependent" |
rankImpact |
impact rank (1 = the strongest) | sort the segments to look at first | a rank is only a relative order, not an amplitude |
Direction (signed delta) — directionOf(structSignedMean, behavSignedMean) exploits the fact that the structural and behavioural axes are bounded and oriented (the semantic axis, a distance without "common sense", stays unsigned). Convention:
- sign > 0 ⇒ removing the segment lowers conformance ⇒ it
carries; - sign < 0 ⇒ removing the segment improves conformance ⇒ it
harms(direct signal); - close to 0 ⇒
neutral; - no struct/behav criterion configured ⇒
non-measurable(not to be confused withneutral).
impact remains the unchanged absolute value: the signed information is additional, it replaces nothing.
Usage rule: these indicators serve to locate which segments to inspect; any decision (especially a removal) must be confirmed by reading the outputs baseline vs ablated. Displaying this layer in the interface is optional and off by default, behind an explanatory note, to avoid over-interpretation.
[7] Classification: per-segment verdict (5 levels, V0.3)
Order of evaluation in classifyVerdict(impact, variance, activationRate):
| Verdict | Condition (summary) | Interpretation |
|---|---|---|
| placebo | impact < 0.10 | Not taken into account by the LLM |
| critical | strong impact/activation + low variance | Fundamental, active everywhere |
| high | solid impact + sufficient activation + contained variance | Important, modify with caution |
| context | impact ≥ 0.15 and (variance ≥ 0.25 or activation < 0.50) | One-off safety net or partial activation |
| low | impact < 0.20, stable | Low impact, check for redundancies |
The moderate verdict (mid) was removed: cases with moderate impact but high variance or partial activation are classified contextual.
Code-aligned constant: AXIS_ACTIVE_THRESHOLD = 0.30 for the activation calculation.
Interpretation protocol
- Reliable mean impact when variance is low and activation ≥ 50% →
critical/high/lowdepending on amplitude. - Variance or activation take priority when the mean impact is misleading:
- Medical disclaimer: imposed ending → structural axis active on few scenarios →
contextdespite a modest mean impact. - Price segment: € threshold rule → partial activation depending on scenarios →
context. - Young-audience segment: without manual markers, axes are often not applicable; add expected/forbidden terms to measure.
- Medical disclaimer: imposed ending → structural axis active on few scenarios →
- Do not remove a
contextsegment based solely on a low mean impact.
These thresholds are default heuristics, empirically calibrated (Reachy prompt). Cross-model re-evaluation is planned in V0.4.
[8] Visual rendering
Three elements:
- Variance chart: bars per segment, height = mean impact, vertical bar = ±variance, colour = verdict.
- 3-axis cards: for each segment, breakdown of structural / behavioural / semantic impact + verdict + explanatory sentence.
- Global synthesis: three lists — to keep, to watch, candidates for removal.
- Output drill-down (V0.2): in each segment card, a collapsible panel showing, scenario by scenario, the
baselinevsablated output(segment removed) comparison with a reminder of the axis deltas. The panel is hidden by default and rendered on demand (lazy render) to preserve performance and mobile readability.
Reproducibility
A preatorlabs analysis is reproducible if:
- same prompt (segmentation included)
- same scenarios
- same LLM and same version
- T=0
The residual determinism of LLMs at T=0 (negligible on Claude) can be absorbed by n=3 repetitions in V2.
Falsifiability
A preatorlabs verdict is falsifiable by independent test. To falsify a "placebo" verdict on a segment S:
- Build two prompts:
P_with_SandP_without_S. - Manually compare the outputs over the M scenarios.
- If a systematic difference consistent with the intent of S is observed, the verdict is wrong.
This property is non-trivial: it distinguishes preatorlabs from an opaque score produced by a judge LLM.