04 — Interpretation guide

How to read a preatorlabs report and draw prompt-editing decisions from it.

Reminder: this guide does not redefine the calculations. Protocol and formulas → 02-METHODOLOGY.md §3–7; site presentation → sections #method (aggregation), #howto (operations), #reading (decision).

The 30-second reading table

You see…	The segment is…	What to do
High impact + low variance + strong activation	critical	do not touch
Medium-high impact + low variance + solid activation	high impact	modify with caution
High variance or activation < 50% (impact ≥ 15%)	contextual (safety net)	keep — do not confuse low mean impact with uselessness
Low impact + low variance + stable activation	low	check for redundancies, maybe remove
Near-zero impact	placebo	safe to remove, unless intentional decoration

Rules and activation: see the config preview (auto / manual rules) and the activationRate metric in 02-METHODOLOGY.md §6–7. Do not interpret a low mean impact without looking at variance and activation.

The classic pitfall: the one-off segment

The most frequent pitfall is to remove a seemingly useless segment when it only helps in one scenario out of six.

Example: a narrative prompt contains "Avoid genre clichés: no vampires, no zombies." Over 4 tested themes (library, gas station, child, clock), removing this segment:

changes nothing on 3 themes
makes a classic ghost appear on the 4th

The mean impact score is low. But the variance is very high. The verdict is contextual, and the correct reading is: "this segment is a one-off safety net — it does not help often, but when it does, it avoids a catastrophe."

This is exactly what the error bar in the variance chart reveals. Variance is the most precious information the tool provides.

Reading the 3-axis breakdown

When a segment has a total impact of 50%, the question is not only "how much it counts" but "where it counts". The per-axis breakdown gives the answer.

Typical profiles

Purely structural profile (struct = 80%, behav = 0%, sem = 10%) → It is a format rule (length, syntax, JSON). Modifying it has an immediate effect on parsability.

Purely behavioural profile (struct = 0%, behav = 60%, sem = 20%) → It is a business rule (list of forbidden terms, trigger conditions). Modifying it affects rule conformance.

Purely semantic profile (struct = 0%, behav = 0%, sem = 55%) → It is a style or persona rule. Modifying it affects the tone without touching substantive conformance.

Hybrid profile (struct = 30%, behav = 40%, sem = 30%) → A versatile segment carrying several intentions. Rewriting it requires preserving the three roles.

The redistribution rule

If you remove a segment with a pure semantic profile, the other semantic-profile segments will probably take over part of its role (LLMs are robust). It is the opposite for pure structural profiles: removing them often breaks the format immediately.

Reading the "placebo" verdict

The "placebo" verdict is the most unexpected and the most revealing. It means: this segment is not taken into account by the LLM, even though it is explicitly phrased.

Typical causes:

Introspection sentence ("Before each answer, run this 4-step mental procedure") — an LLM does not run an internal procedure, it generates token by token. These sentences reassure the writer but are decoration.
Rewrite of something already covered — the rule is already carried by other, more salient segments.
Too-abstract sentence — "be authentic", "be empathetic" without an operational definition.

Recommended action on a placebo: either remove it, or rephrase it as a verifiable operational rule.

Reading the "counterproductive" verdict (V2+)

In V1, the main metric is |delta| (absolute value). In V2, the sign will be kept: a segment whose removal improves the score is counterproductive. It is rare but real — typically a sentence that produces the opposite of the intended effect (for example, explicitly asking "do not mention X" can make X appear in some contexts — the "pink elephant" effect).

Cross-LLM comparison (V3)

When the tool supports several LLMs, the same prompt will produce different reports depending on the target model. This is expected and useful:

universal segments: critical on all LLMs
model-specific segments: critical on Claude, placebo on GPT-4 (or vice versa)

The comparison makes it possible to rewrite a more portable prompt, by converting model-specific segments into universal phrasings.

Interpretation anti-patterns

To avoid:

❌ "Segment X has a score of 0.18 so it is useless." → check the variance first. A 0.18 with variance 0.40 is contextual, not useless.

❌ "Segment X is critical on Claude so it is critical everywhere." → V1 only measures a single LLM. Generalisation is a hypothesis, not a fact.

❌ "Segment X and segment Y each have a low impact, so we can remove both." → simple ablation does not detect coalitions. Two redundant segments each have a low impact, but removing them together can break the prompt. To be checked by a manual combined ablation.

❌ "The report says this segment is placebo, so I remove it." → check that the chosen scenarios indeed cover the cases where this segment was supposed to act. An apparent placebo may be a vital segment on a scenario absent from the test set.