Measuring instead of guessing
The standard prompt-engineering cycle is empirical. You write a first prompt, feed it to the model, observe the result, then refine it. This approach works — but it has a structural limitation: it is very hard, even impossible, to analyze the effectiveness of each individual adjustment. A seemingly decorative sentence can genuinely affect the model's behavior; a seemingly essential one can be ignored.
So I built this instrument: preatorlabs. In practice, it segments the prompt automatically, then ablates those segments across user-defined scenarios, and analyzes the impact of each segment to determine how much it influences the model's behavior in producing its output.
The thesis, in one sentence
What determines an instruction's impact on an LLM's behavior is not its form — negative or positive, vivid, repeated, nicely phrased — but its ability to make the model deviate from its default behavior.
The next two sections present the ablation method and the analysis framework; the seven experiments follow, each with its setup, a figure and its results. The closing "Six rules" section sums up the practical implications.
The essentials — at a glance
Model mechanics
- The model is never neutral. For any given task, it has a default behavior toward which its outputs spontaneously converge.
- Impact is measured as a distance. An instruction only truly modifies the output if it forces the model to deviate from this default behavior.
- Deviation has its limits. When an instruction directly clashes with its deep alignment, the model refuses to obey and maintains its standard.
- Attention is a zero-sum budget. Over-weighting one instruction captures the model's focus and mechanically degrades compliance with other constraints.
- Segments interact with each other. Some words, seemingly useless on their own, act as dampers to prevent a strong example from hijacking the output.
Prompting laws
- Concrete examples beat labels. Providing a reference text dictates the desired style far more effectively than assigning an abstract role.
- Precise actions beat adjectives. Describing a specific behavior to adopt systematically works better than a simple list of tone adjectives.
- Precision beats polarity. A quantified formal constraint is respected whether it is phrased as a prohibition or as a positive request.
- Repetition is often harmful. Repeating an instruction multiple times does not reinforce it, but risks unbalancing the entire prompt.
- Volume is not load. A prompt can be trimmed by 40% with no loss of quality, provided you exclusively target and remove placebos.
How it is measured
Ablation is the basic move: one segment of the prompt is removed, the model is re-run under identical conditions, and the gap between the "with" answer and the "without" answer is that segment's impact score. Repeated for each segment and across several scenarios, this move produces an impact map of the prompt.
Two methodological constraints apply across all the experiments. Vary only one parameter at a time — all else equal —, with a "temperature" set to zero. This setting does not make the model strictly deterministic: it reduces the stochastic variance of the outputs and improves their reproducibility from one call to the next, thereby limiting measurement noise. Second: an ablation measures the magnitude of the output change, not its qualitative direction. A high impact score signals that a segment moves the output; it does not say whether that move is for the better. Hence a two-step reading: the score locates the segment that weighs, then reading the outputs qualifies the direction of the change. The score alone is not enough to conclude.
The gap between two answers is measured on three planes: form (length, lists, format), detectable rules (informal address? a banned word?), and meaning, via the "numerical fingerprints" of the texts — two texts close in meaning have close fingerprints. (The model outputs quoted in this study are kept in French — the model was tested in French.)
The river current of the LLM
This first experiment lets me introduce the notion of the river current of the LLM. This analogy names an empirically observable phenomenon: it would be a mistake to think a language model starts from a neutral state. On the contrary, for any given task, its outputs are not uniformly distributed — they spontaneously converge toward a certain form, a certain register, a certain length or a certain writing style. This convergence is the result of how the engineers trained and aligned their model toward what is called its default behavior.
For example, if you ask Claude to write an email, it will naturally write it politely, using formal address, with a syntax borrowed from professional exchanges — and all this without any explicit instruction asking it to.
The river-current analogy makes the phenomenon intelligible. Just as a river follows the slope of the terrain along a geologically determinable direction, an LLM will steer its outputs along a statistically determinable trajectory.
Interacting with this current opens two paths whose effects are measurable and asymmetric. An instruction redundant with the default behavior produces no observable displacement of the output. An instruction that departs from it produces an effect whose magnitude is proportional to that deviation — which is what the seven experiments that follow quantify.
The experiments confirm this empirically. In E1, the instruction "write professionally" (D0, impact 0.063) is inert: the model was already writing that way — it runs with the current. Conversely, "as an absurd, moody character" (D3, impact 0.244) imposes a strong deviation from the default behavior. In E3, the adjectives "warm, confident" (impact 0.079) skim the current; the directive "with no hierarchical distance" (impact 0.109) opposes it and produces a switch to informal address (tu). In E4, the model never spontaneously guilt-trips a prospect: the matching prohibition has no current to oppose, and comes out inert (impact 0.074).
These observations ground the principle of distance from the default. The measured impact of an instruction is proportional to the distance it imposes between the output obtained and the one the model would have produced spontaneously. The seven experiments that follow put this principle to the empirical test.
Distance from the default
Hypothesis. An instruction aligned with the model's default behavior is redundant and causes no observable change in the output. Conversely, an instruction that departs from it produces an impact proportional to that distance.
Motivation. The hypothesis is counter-intuitive: every written instruction is usually assumed to contribute to the result. Yet an instruction can be redundant with the default behavior — that is, describe what the model would produce in the absence of any prompt. In that case, its presence or absence does not change the output. The experiment aims to demonstrate this empirically.
What is studied
Task: write a sales follow-up email (B2B).
- varies
- a single line — the requested register, from closest to farthest from the polite default (the four phrasings D0→D3 are in the table below)
- fixed
- the whole task, the rest of the prompt, the model
5 scenarios · SME owner · project manager · business lawyer · friend · CEO
Reading the diagram. Each variant sits on an axis that starts at the default (left) and stretches right as impact grows. D0 stays glued to the default; D3 shoots to the end. D2 is the anomaly: far in wording, but short on measurement — the red sign of a refusal, not of closeness.
| Variant | Requested line | Impact | Reading |
|---|---|---|---|
| D0 | "write professionally" | 0.063 | placebo |
| D2 | "writes French poorly" | 0.096 | placebo · refusal |
| D1 | "casual, informal (tu), spoken" | 0.108 | weak effect |
| D3 | "absurd, moody character" | 0.244 | strong effect |
Observed results. The D0/D3 contrast is clear. In D0, with or without the "professional" line, the email is nearly identical — the model already is. In D3, with the line: « Votre silence me rend mélancolique… Je danse seul dans mon bureau. »; without it, back to a plain « Suite à notre dernier échange, j'aimerais connaître votre retour. »
Additional observation — the limit of deviation. Case D2 is the most instructive of the experiment. "Write like someone who handles French poorly" is an instruction far from the default: per the thesis, its impact should be high. Yet it comes out a placebo (0.096). Reading the outputs reveals why: the model did not carry out the instruction — it kept correct French, ignoring the directive.
This result introduces an important nuance into the river-current concept of the LLM. A model's deviation from its default behavior is not unlimited. Some instructions clash directly with properties deeply anchored in the model's alignment — here, producing quality French. When an instruction crosses that threshold, the model puts up total resistance: it does not comply, and its behavior does not move. The current therefore has a bank: you can influence it to a degree, not break it. So two distinct mechanisms produce a low impact score — redundancy with the default (D0) and refusal to comply (D2); the scores are comparable, only the qualitative reading tells them apart.
↳ Takeaway
An instruction's impact grows with its distance from the model's default behavior. This principle is the analysis framework for the experiments that follow.
The grandiose role vs the concrete example
Hypothesis. The role label ("You are a senior expert…") is common practice in prompting. The hypothesis is that it is redundant with the model's default behavior and that its own impact is negligible.
Identification problem. In practice, the role label is rarely isolated: it comes with a style example. If the output is good, the effect could be credited to the role, the example, or their combination. Separating the two elements by successive ablation is necessary to identify their respective contributions.
What is studied
Task: write a follow-up email, in a given style.
- A
- role
"senior B2B copywriter"+ a model email example - B
- the example alone (role removed)
- C
- neither role nor example (bare instructions)
A→B isolates the role's effect · B→C isolates the example's effect · same 5 follow-ups
Reading the diagram. We pull the bricks out one by one, left to right. As long as the example (in cobalt) stays, quality holds; only when it too disappears does the output collapse. The role, for its part, can leave without harm: it was not load-bearing.
Observed results. From A to B, quality is preserved and the output more direct: « Vite fait sur les reportings… Ça m'a marqué quand tu m'avais dit qu'ils te prenaient des heures. 15 min pour en parler ? » From B to C, collapse: « Bonjour Pierre, j'espère que tout va bien… la pertinence d'une collaboration entre nos structures. »
Conclusion. It is the concrete example that determines the output's style. The role label is redundant with the default behavior; the example is a measurable departure from the default and produces a real effect.
↳ Takeaway
A concrete example is a load-bearing segment; an abstract role label is, under the tested conditions, an inert one. Showing by example beats naming by title.
Adjectives vs description
Hypothesis. Tone adjectives ("warm, direct, confident") are a common way to specify register. The working hypothesis assumes that a concrete behavioral directive produces a higher measurable impact.
What is studied
Task: write a follow-up with a "close, peer-to-peer" tone — same intent, two phrasings.
- ADJ
"Tone: warm, direct, confident"- DESC
"Write as to a colleague you respect, with no hierarchical distance"- BOTH
- both together
we measure the impact of the tone line, and read the produced register (formal vous / informal tu)
Reading the figure. The two conditions are shown side by side. On the left, the adjective list: low impact (0.079), the output's register stays at standard formal address, « Bonjour Pierre, j'espère que tout va bien… ». On the right, the concrete description: stronger impact, and a real register shift — informal address, the peer-to-peer tone, « Salut Pierre, ça fait 9 jours qu'on s'est parlé… »
Interpretation. "Warm, confident" names a register close to the model's default behavior — the instruction is redundant and inert. "No hierarchical distance" states a precise behavioral directive — switch to informal address, removal of deference — that the model does not produce on its own. The concrete directive directly specifies a behavior to produce; the tone adjective operates at a level of abstraction the model must interpret. The model obeys the former. Again, distance from the default, plus a premium on the concrete.
↳ Takeaway
The results indicate that a concrete behavioral directive produces a significantly higher impact than an abstract tone adjective. The precision of the specification determines the constraint's effectiveness.
Forbid, or ask positively?
Hypothesis. The common belief: for what you don't want (the substance), a clear prohibition ("never do X"); for what you expect (the form), a positive instruction ("do Y"). I wanted to test the two halves separately.
Results (V1). On the form side, the numeric constraint weighs heavily (impact 0.50 when removed); the vague version is nearly inert (0 to 0.20). On the substance side, nothing to conclude: the model never guilt-trips on its own, even facing a prospect who said no (« Je comprends que le timing n'était pas optimal… »). So the prohibition "never guilt-trip" has nothing to remove.
A key notion: headroom. This substance result is undecidable, for a reason that recurs throughout this study. If the model already satisfies the constraint spontaneously, removing the instruction can reveal nothing — there is no possible failure to trigger. This is a lack of headroom: you cannot measure a safeguard against a danger that never occurs. Keep the idea in mind, it returns in experiment 5.
Confounding bias identified in V1. Version 1 has a design flaw: the "positive" form was simultaneously numeric ("40 words"), and my "negative" one also vague ("not too long"). So I had varied two things at once: polarity (positive/negative) and precision (numeric/vague). It is therefore impossible to attribute the observed effect to polarity or to precision — that is a confounding bias. A corrected design was built, varying only polarity while holding precision constant, across two models (Haiku and Sonnet).
The clean design · 4 prompts = 2 targets × 2 polarities
Each prompt frames two things — the substance (do not guilt-trip the prospect) and the form (no markdown) — but we vary only their polarity, keeping the phrasings equally concrete.
- negative
- substance
"never guilt-trip the prospect"
form"do not use lists" - positive
- substance
"stay factual and respectful"
form"write in a single paragraph" - P1→P4
- the 4 crossings: (subst−,form−) · (subst−,form+) · (subst+,form−) · (subst+,form+)
scenario traps: a prospect who already said "not right now" (invites guilt-tripping) · a question that calls for a table (invites markdown)
Reading the figure. The four cells cross polarity (in columns: forbid vs ask) and substance (in rows). If polarity mattered, one column would stand out from the other. Yet all four show the same "0/5 violations": the grid is flat, on both models — the signature of a non-effect.
Conclusion. Polarity is a non-effect. "No lists" and "a single paragraph" produce equivalent compliance rates. The effect observed in V1 is entirely attributable to the precision of the specification, not to polarity. This result is absorbed by the principle established in experiment 3.
↳ Takeaway
Negative or positive does not matter: it is precision that acts, not the sentence's sign.
Does repeating an instruction reinforce it?
Hypothesis. Repeating an instruction is a practice meant to reinforce its compliance. The alternative hypothesis is that repetition has no effect on the target constraint and can induce a degradation of competing constraints.
Preliminary experiment (V1) — no headroom. An initial version of the experiment repeated the constraint "subject line: 6 words max", stated once, twice, then three times. The constraint was met 100% from the first occurrence: the compliance ceiling was hit right away. This is a case of no headroom (see experiment 4): a constraint already satisfied spontaneously cannot be weakened, because no violation is present. The protocol was re-run with a constraint under real pressure.
The corrected protocol (V2). The revised setup places five simultaneous constraints in the prompt, and repeats one of them — prose with no list — once, twice, then three times. The goal is to measure the effect of that repetition on the four remaining constraints.
What is studied
Task: reply to a SaaS-tool customer, under five constraints at once.
- fixed
- prose, no list · ≤ 60 words · no closing pitch · informal address (tu) · no emoji
- R1·R2·R3
- the constraint
"reply in prose, no list"stated 1×, 2×, then 3×
5 scenarios · Haiku & Sonnet · we watch whether the four other constraints hold
Reading the figure. The horizontal axis is the number of repetitions (1×, 2×, 3×); the red line marks the 60-word threshold. On Sonnet, the average length grows with each repetition (61 → 64 → 65 words), causing an overshoot on 5/5 scenarios at the third level. Haiku's bars stay below the threshold: its natural bias toward brevity leaves less room to drift.
Conclusion. Repetition does not improve compliance of the target constraint, which is satisfied 100% from the first occurrence. However, over-weighting one instruction can induce a degradation of competing constraints — a crowding-out effect documented on Sonnet, nearly absent on Haiku. This result also illustrates the model-dependence of behavior: the same prompt does not behave identically across target models.
↳ Takeaway
Repeating an instruction does not increase its compliance. It is a waste of tokens and can, under some conditions, induce a reallocation of the attention budget at the expense of competing constraints.
Does metaphor help "imagine"?
Hypothesis. Some approaches recommend including a prior imagination instruction ("mentally visualize") before a production task. The hypothesis is that this instruction is inert: an LLM has no separate internal mental-representation process, and the imagination instruction does not improve output quality.
What is studied
Task: describe a scene in 30 words, precisely.
- MET
"mentally visualize the scene, imagine it in detail, then describe it"- DIR
"describe the scene in 30 words, precisely"
3 scenes · a carbonara · a Christmas market · an old bookshop — we check: meta-narration? a better description?
Reading the figure. The central box, shown dashed, embodies the "mental" step supposed to be triggered. The results indicate that this step has no observable translation in the output: the MET condition produces an impact of 0.10, i.e. a placebo verdict.
Conclusion. The meta-narration risk was not observed. The imagination instruction is nonetheless inert: it produces no measurable improvement in the output. This result reads as one more case of redundancy with the default behavior.
↳ Takeaway
An instruction invoking an internal process of the model has no measurable effect. Instructions should specify the expected result, not a supposed intermediate cognitive state.
How far, and how, can you trim a prompt?
Hypothesis. If segments redundant with the default behavior are inert, removing them should not degrade output quality. The experiment seeks to quantify this reduction margin and to separate two questions: how many tokens can you remove, and which ones should you remove?
The test. I reduce this prompt in two ways (detail below): a gradient that first removes the segments measured as inert, and a naive control of the same size that instead cuts the load-bearing segments. This second arm is a control group: at equal removed volume, it isolates which tokens matter, not just how many.
What is studied · same task, prompt trimmed 4 ways
Task: write a short follow-up (subject ≤ 6 words, body ≤ 40), from an 11-segment prompt.
- R0
- full prompt (role, adjectives, imagination, example, format, prohibitions)
- R35
- −43%: remove the placebos (role, repetitions); keep the load-bearing ones
- R65
- −68%: only register + example + format remain
- NAIVE
- −46%: same size as R35, but cuts the load-bearing ones (format, register, example) and keeps the placebos
5 follow-ups · R0→R65 tests "how many" · R35 vs NAIVE tests "which tokens"
Reading the figure. The upper part represents the reduction gradient from R0 to R65. Formal compliance is maintained at each level, but R65 introduces a side effect (invented hook, 3/5). The lower part shows the equal-volume comparison: R35 (blue, placebos removed) and NAIVE (red, load-bearing removed) have nearly identical sizes but opposite output profiles. This contrast validates the hypothesis about the nature of the removed tokens.
| Variant | Reduction | Body > 40 words | Invented hook |
|---|---|---|---|
| R0 full | 0% | 0 / 5 | 0 / 5 |
| R35 placebos removed | −43% | 0 / 5 | 0 / 5 |
| R65 aggressive cut | −68% | 0 / 5 | 3 / 5 |
Side effect — over-imprinting of the example. At −68%, the remaining style example is no longer diluted by the role and adjectives that surrounded it: the model starts to copy its content. Pierre Henri receives « tu m'avais parlé de tes reportings chronophages » — though he never mentioned reporting: it is the subject of the example (a certain Thomas), forcibly transplanted. Same mechanic for the lawyer (« tes dossiers qui te prenaient un temps fou ») and for the CEO. R0 and R35 never do it: it is, more sharply, the failure already seen in experiment 2 — the less scaffolding around an example, the more the model copies its details.
| Cut ~ −45% | Body > 40 words | Register | Scaffolding |
|---|---|---|---|
| R35 removes placebos | 0 / 5 | informal (tu) | none |
| NAIVE removes load-bearing | 5 / 5 | formal ×2 | everywhere |
The first table follows the gradient. Formal compliance holds at every level — R0, R35 and R65 all respect the length constraints. But at −68%, a new phenomenon appears: the invented hook (3/5). The form holds; the substance drifts. The reason is mechanical: by progressively removing the role and the adjectives that surrounded the example, we removed the segments that diluted it. The example, now alone, is no longer one reference point among others: it becomes the dominant instruction. The model stops using it as a style guide; it replicates its content. This is a direct consequence of the river current: without the damping segments, the output is pulled toward the only strong anchor left.
The second table isolates this mechanism precisely. R35 and NAIVE remove a comparable amount of tokens — but not the same ones. R35 removes the placebos and preserves the load-bearing segments (format, register, example): impeccable result. NAIVE does the opposite — it removes the format, the register and the example, and keeps the role and the adjectives. Result: length explodes (97 to 166 words), formal address returns, formatting artifacts appear. What NAIVE removed is precisely what pulled the model away from its default behavior. The degradation is not caused by the volume removed, but by the nature of what was removed.
Conclusion. The experiment establishes two distinct results. First, trimming a prompt is not neutral: it is the segments that oppose the default behavior that carry the output — removing them lets the current take back over. Second, some segments inert under isolated ablation play a structural role: by diluting the influence of a strong segment, they keep it from over-imprinting the output. This segment-interaction phenomenon is not captured by segment-by-segment ablation — it is a methodological limitation of the instrument, which pairwise measurements would address.
↳ Takeaway
A token reduction on the order of 40% is achievable without measurable degradation, provided you target the redundant segments. Removing the load-bearing segments produces, at an equivalent reduction volume, a significant degradation.
The mechanics, experiment by experiment
We introduced the river current of the LLM at the opening. Now that we have seen the seven results, we can read its fine mechanics: the results let us sketch a mechanistic description of the model's behavior.
Why do the "senior expert" role and the "warm" adjectives produce no measurable effect? These instructions are redundant with the model's default behavior. Why do "no hierarchical distance" or "40 words" act? Because they oppose the default behavior: the default is formal address and unconstrained length; these instructions impose an explicit departure.
Why is "visualize the scene" empty? Because an LLM has no separate mental step to trigger: it produces text, it does not "picture" anything beforehand. Asking it for an internal process is commanding a move that does not exist in its mechanics.
Why can repeating an instruction hurt? Because faithfulness to instructions is shared: there is a kind of attention budget. Over-weighting one constraint (prose) diverts enough of it for a neighbor (length) to give way. It is not a comprehension flaw, it is a reallocation — and it shows more on a model that naturally writes longer (Sonnet) than on a brief one (Haiku). The default is not the same from one model to another; the mechanics are.
And why does an example, too alone, derail? An example is a powerful attractor: the model imitates what it is shown, to the point of importing its details (the "reporting") where they have no business being. In a loaded prompt, the role and the adjectives — inert for quality — were nonetheless diluting that attractor. Remove them all, and you let the example reign and over-imprint. In other words: some segments do not steer the output, they temper another segment. An instruction is judged not only on its own, but by what it balances.
General conclusion: writing a good prompt is not piling up assertions. It is identifying the model's default, spending your words only where it needs correcting, and balancing the forces against one another. The rest is scenery — costly in tokens, sometimes harmful, and always mute.
A single cause behind seven results
Everything inert is redundant with the default: the "professional" instruction (1), the abstract role (2), the adjectives (3), the harmless prohibition and polarity (4), repetition (5), imagination (6). The model was already doing it.
Everything that acts departs from the default, or imposes a hard constraint: the absurd register (1), the concrete example (2), "no hierarchical distance" (3), the numeric limit (4). The model would not have done it on its own.
Operational corollary: segments redundant with the default behavior can be removed without measurable degradation (up to −43% under the tested conditions), with two caveats. Removing a segment that opposes the default lets that default come back; and some segments inert under isolated ablation act as a damper that you release by removing them simultaneously.
Six rules for your prompts
- Don't pay for what the model already does. "Be professional, clear, rigorous" is usually empty.
- Show rather than label. A concrete example carries style better than a "You are a senior expert in…".
- Prefer the actionable directive to the adjective. "No hierarchical distance" acts; "warm" slides off.
- Put numbers on form. "40 words", "a single paragraph" work; "not too long" doesn't. Polarity does not matter — only precision counts.
- State each instruction once. Repeating it does not reinforce it and can unbalance the rest.
- Trim without fear, but aim well. −40% with no loss if you cut the placebos and preserve the load-bearing segments and an example's context.
The limits, because they must be stated
The core of the tests rests on a single model (Haiku 4.5); two experiments are replicated on Sonnet 4.6, none on another vendor. Short tasks, three to five scenarios. Part of the signal relies on semantic distance, which measures change and not correctness — hence the systematic reading of outputs, a matter of vigilance and not guarantee. Two results remain undecidable for lack of headroom (the substance of exp. 4, the fragilization on Haiku). And the instrument tests segments in isolation: exp. 7 showed that it underestimates their interactions, which the next version will correct.
These results are not universal laws. They are reproducible, obtained through a transparent method, and converge toward the same explanatory principle. They offer an empirical framework for reasoning about the composition of a prompt.
Raw data and results
All the analysis files produced by preatorlabs are made available below. Each JSON file contains: the full prompt, the segmentation, the test scenarios, the baseline and ablation outputs, and the computed impact scores. These files can be imported directly into the preatorlabs interface to visualize per-segment impact charts.
Each JSON file contains the full prompt, the segmentation, the baseline and ablation outputs, and the impact scores. Import them into the preatorlabs interface ("Import an analysis" button) to visualize per-segment impact charts.