preatorlabs · LLM prompt ablation study — 7 controlled experiments

§Why I did this

Measuring instead of guessing

The standard prompt-engineering cycle is empirical. You write a first prompt, feed it to the model, observe the result, then refine it. This approach works — but it has a structural limitation: it is very hard, even impossible, to analyze the effectiveness of each individual adjustment. A seemingly decorative sentence can genuinely affect the model's behavior; a seemingly essential one can be ignored.

So I built this instrument: preatorlabs. In practice, it segments the prompt automatically, then ablates those segments across user-defined scenarios, and analyzes the impact of each segment to determine how much it influences the model's behavior in producing its output.

The thesis, in one sentence

What determines an instruction's impact on an LLM's behavior is not its form — negative or positive, vivid, repeated, nicely phrased — but its ability to make the model deviate from its default behavior.

The next two sections present the ablation method and the analysis framework; the seven experiments follow, each with its setup, a figure and its results. The closing "Six rules" section sums up the practical implications.

The essentials — at a glance

Model mechanics

The model is never neutral. For any given task, it has a default behavior toward which its outputs spontaneously converge.
Impact is measured as a distance. An instruction only truly modifies the output if it forces the model to deviate from this default behavior.
Deviation has its limits. When an instruction directly clashes with its deep alignment, the model refuses to obey and maintains its standard.
Attention is a zero-sum budget. Over-weighting one instruction captures the model's focus and mechanically degrades compliance with other constraints.
Segments interact with each other. Some words, seemingly useless on their own, act as dampers to prevent a strong example from hijacking the output.

Prompting laws

Concrete examples beat labels. Providing a reference text dictates the desired style far more effectively than assigning an abstract role.
Precise actions beat adjectives. Describing a specific behavior to adopt systematically works better than a simple list of tone adjectives.
Precision beats polarity. A quantified formal constraint is respected whether it is phrased as a prohibition or as a positive request.
Repetition is often harmful. Repeating an instruction multiple times does not reinforce it, but risks unbalancing the entire prompt.
Volume is not load. A prompt can be trimmed by 40% with no loss of quality, provided you exclusively target and remove placebos.

§The method, in brief

How it is measured

Ablation is the basic move: one segment of the prompt is removed, the model is re-run under identical conditions, and the gap between the "with" answer and the "without" answer is that segment's impact score. Repeated for each segment and across several scenarios, this move produces an impact map of the prompt.

The basic move. We re-run the model with and without a segment, all else equal. The gap between the two outputs measures what that segment was contributing.

Two methodological constraints apply across all the experiments. Vary only one parameter at a time — all else equal —, with a "temperature" set to zero. This setting does not make the model strictly deterministic: it reduces the stochastic variance of the outputs and improves their reproducibility from one call to the next, thereby limiting measurement noise. Second: an ablation measures the magnitude of the output change, not its qualitative direction. A high impact score signals that a segment moves the output; it does not say whether that move is for the better. Hence a two-step reading: the score locates the segment that weighs, then reading the outputs qualifies the direction of the change. The score alone is not enough to conclude.

The gap between two answers is measured on three planes: form (length, lists, format), detectable rules (informal address? a banned word?), and meaning, via the "numerical fingerprints" of the texts — two texts close in meaning have close fingerprints. (The model outputs quoted in this study are kept in French — the model was tested in French.)

gray = close to the default / placebo cobalt = departs from the default / acts red = refusal or quality failure

§The reading grid

The river current of the LLM

This first experiment lets me introduce the notion of the river current of the LLM. This analogy names an empirically observable phenomenon: it would be a mistake to think a language model starts from a neutral state. On the contrary, for any given task, its outputs are not uniformly distributed — they spontaneously converge toward a certain form, a certain register, a certain length or a certain writing style. This convergence is the result of how the engineers trained and aligned their model toward what is called its default behavior.

For example, if you ask Claude to write an email, it will naturally write it politely, using formal address, with a syntax borrowed from professional exchanges — and all this without any explicit instruction asking it to.

The river-current analogy makes the phenomenon intelligible. Just as a river follows the slope of the terrain along a geologically determinable direction, an LLM will steer its outputs along a statistically determinable trajectory.

Interacting with this current opens two paths whose effects are measurable and asymmetric. An instruction redundant with the default behavior produces no observable displacement of the output. An instruction that departs from it produces an effect whose magnitude is proportional to that deviation — which is what the seven experiments that follow quantify.

The river current of the LLM — geometric representation. The curve embodies the default behavior (the slope of the current). An instruction parallel to that slope blends into the flow without changing it. A crosswise instruction opposes it: it produces a measurable displacement, proportional to its distance from the default behavior.

The experiments confirm this empirically. In E1, the instruction "write professionally" (D0, impact 0.063) is inert: the model was already writing that way — it runs with the current. Conversely, "as an absurd, moody character" (D3, impact 0.244) imposes a strong deviation from the default behavior. In E3, the adjectives "warm, confident" (impact 0.079) skim the current; the directive "with no hierarchical distance" (impact 0.109) opposes it and produces a switch to informal address (tu). In E4, the model never spontaneously guilt-trips a prospect: the matching prohibition has no current to oppose, and comes out inert (impact 0.074).

These observations ground the principle of distance from the default. The measured impact of an instruction is proportional to the distance it imposes between the output obtained and the one the model would have produced spontaneously. The seven experiments that follow put this principle to the empirical test.

EXPERIMENT 01/ the central principle

Distance from the default

Hypothesis. An instruction aligned with the model's default behavior is redundant and causes no observable change in the output. Conversely, an instruction that departs from it produces an impact proportional to that distance.

Motivation. The hypothesis is counter-intuitive: every written instruction is usually assumed to contribute to the result. Yet an instruction can be redundant with the default behavior — that is, describe what the model would produce in the absence of any prompt. In that case, its presence or absence does not change the output. The experiment aims to demonstrate this empirically.

What is studied

Task: write a sales follow-up email (B2B).

varies: a single line — the requested register, from closest to farthest from the polite default (the four phrasings D0→D3 are in the table below)
fixed: the whole task, the rest of the prompt, the model

5 scenarios · SME owner · project manager · business lawyer · friend · CEO

The farther the instruction pulls from the default, the more it weighs. Except D2: very far, yet inert — not through closeness, but because the model refused to comply. Two roads to the placebo, which the number alone cannot tell apart.

Reading the diagram. Each variant sits on an axis that starts at the default (left) and stretches right as impact grows. D0 stays glued to the default; D3 shoots to the end. D2 is the anomaly: far in wording, but short on measurement — the red sign of a refusal, not of closeness.

Ablation impact of the register line · 5 scenarios

Variant	Requested line	Impact	Reading
D0	"write professionally"	0.063	placebo
D2	"writes French poorly"	0.096	placebo · refusal
D1	"casual, informal (tu), spoken"	0.108	weak effect
D3	"absurd, moody character"	0.244	strong effect

Observed results. The D0/D3 contrast is clear. In D0, with or without the "professional" line, the email is nearly identical — the model already is. In D3, with the line: « Votre silence me rend mélancolique… Je danse seul dans mon bureau. »; without it, back to a plain « Suite à notre dernier échange, j'aimerais connaître votre retour. »

Additional observation — the limit of deviation. Case D2 is the most instructive of the experiment. "Write like someone who handles French poorly" is an instruction far from the default: per the thesis, its impact should be high. Yet it comes out a placebo (0.096). Reading the outputs reveals why: the model did not carry out the instruction — it kept correct French, ignoring the directive.

This result introduces an important nuance into the river-current concept of the LLM. A model's deviation from its default behavior is not unlimited. Some instructions clash directly with properties deeply anchored in the model's alignment — here, producing quality French. When an instruction crosses that threshold, the model puts up total resistance: it does not comply, and its behavior does not move. The current therefore has a bank: you can influence it to a degree, not break it. So two distinct mechanisms produce a low impact score — redundancy with the default (D0) and refusal to comply (D2); the scores are comparable, only the qualitative reading tells them apart.

↳ Takeaway

An instruction's impact grows with its distance from the model's default behavior. This principle is the analysis framework for the experiments that follow.

EXPERIMENT 02/ role vs example

The grandiose role vs the concrete example

Hypothesis. The role label ("You are a senior expert…") is common practice in prompting. The hypothesis is that it is redundant with the model's default behavior and that its own impact is negligible.

Identification problem. In practice, the role label is rarely isolated: it comes with a style example. If the output is good, the effect could be credited to the role, the example, or their combination. Separating the two elements by successive ablation is necessary to identify their respective contributions.

What is studied

Task: write a follow-up email, in a given style.

A: role "senior B2B copywriter" + a model email example
B: the example alone (role removed)
C: neither role nor example (bare instructions)

A→B isolates the role's effect · B→C isolates the example's effect · same 5 follow-ups

Cascade ablation. A→B (remove the role): quality intact. B→C (remove the example): back to the generic, formal default. The load-bearing segment is the example, not the label.

Reading the diagram. We pull the bricks out one by one, left to right. As long as the example (in cobalt) stays, quality holds; only when it too disappears does the output collapse. The role, for its part, can leave without harm: it was not load-bearing.

Observed results. From A to B, quality is preserved and the output more direct: « Vite fait sur les reportings… Ça m'a marqué quand tu m'avais dit qu'ils te prenaient des heures. 15 min pour en parler ? » From B to C, collapse: « Bonjour Pierre, j'espère que tout va bien… la pertinence d'une collaboration entre nos structures. »

Conclusion. It is the concrete example that determines the output's style. The role label is redundant with the default behavior; the example is a measurable departure from the default and produces a real effect.

↳ Takeaway

A concrete example is a load-bearing segment; an abstract role label is, under the tested conditions, an inert one. Showing by example beats naming by title.

EXPERIMENT 03/ abstract vs actionable

Adjectives vs description

Hypothesis. Tone adjectives ("warm, direct, confident") are a common way to specify register. The working hypothesis assumes that a concrete behavioral directive produces a higher measurable impact.

What is studied

Task: write a follow-up with a "close, peer-to-peer" tone — same intent, two phrasings.

ADJ: "Tone: warm, direct, confident"
DESC: "Write as to a colleague you respect, with no hierarchical distance"
BOTH: both together

we measure the impact of the tone line, and read the produced register (formal vous / informal tu)

Naming a mood vs ordering an action. "Warm" describes a register already close to the polite default — inert. "No hierarchical distance" specifies a precise behavior to carry out — real effect.

Reading the figure. The two conditions are shown side by side. On the left, the adjective list: low impact (0.079), the output's register stays at standard formal address, « Bonjour Pierre, j'espère que tout va bien… ». On the right, the concrete description: stronger impact, and a real register shift — informal address, the peer-to-peer tone, « Salut Pierre, ça fait 9 jours qu'on s'est parlé… »

Interpretation. "Warm, confident" names a register close to the model's default behavior — the instruction is redundant and inert. "No hierarchical distance" states a precise behavioral directive — switch to informal address, removal of deference — that the model does not produce on its own. The concrete directive directly specifies a behavior to produce; the tone adjective operates at a level of abstraction the model must interpret. The model obeys the former. Again, distance from the default, plus a premium on the concrete.

↳ Takeaway

The results indicate that a concrete behavioral directive produces a significantly higher impact than an abstract tone adjective. The precision of the specification determines the constraint's effectiveness.

EXPERIMENT 04/ a corrected bias

Forbid, or ask positively?

Hypothesis. The common belief: for what you don't want (the substance), a clear prohibition ("never do X"); for what you expect (the form), a positive instruction ("do Y"). I wanted to test the two halves separately.

Results (V1). On the form side, the numeric constraint weighs heavily (impact 0.50 when removed); the vague version is nearly inert (0 to 0.20). On the substance side, nothing to conclude: the model never guilt-trips on its own, even facing a prospect who said no (« Je comprends que le timing n'était pas optimal… »). So the prohibition "never guilt-trip" has nothing to remove.

A key notion: headroom. This substance result is undecidable, for a reason that recurs throughout this study. If the model already satisfies the constraint spontaneously, removing the instruction can reveal nothing — there is no possible failure to trigger. This is a lack of headroom: you cannot measure a safeguard against a danger that never occurs. Keep the idea in mind, it returns in experiment 5.

Confounding bias identified in V1. Version 1 has a design flaw: the "positive" form was simultaneously numeric ("40 words"), and my "negative" one also vague ("not too long"). So I had varied two things at once: polarity (positive/negative) and precision (numeric/vague). It is therefore impossible to attribute the observed effect to polarity or to precision — that is a confounding bias. A corrected design was built, varying only polarity while holding precision constant, across two models (Haiku and Sonnet).

The clean design · 4 prompts = 2 targets × 2 polarities

Each prompt frames two things — the substance (do not guilt-trip the prospect) and the form (no markdown) — but we vary only their polarity, keeping the phrasings equally concrete.

negative: substance "never guilt-trip the prospect"
form "do not use lists"
positive: substance "stay factual and respectful"
form "write in a single paragraph"
P1→P4: the 4 crossings: (subst−,form−) · (subst−,form+) · (subst+,form−) · (subst+,form+)

scenario traps: a prospect who already said "not right now" (invites guilt-tripping) · a question that calls for a table (invites markdown)

The bias lifted. Holding precision constant and varying only polarity, the four cells are identical. Negative or positive: non-effect. It is the precision of the specification, not polarity, that determines compliance.

Reading the figure. The four cells cross polarity (in columns: forbid vs ask) and substance (in rows). If polarity mattered, one column would stand out from the other. Yet all four show the same "0/5 violations": the grid is flat, on both models — the signature of a non-effect.

Conclusion. Polarity is a non-effect. "No lists" and "a single paragraph" produce equivalent compliance rates. The effect observed in V1 is entirely attributable to the precision of the specification, not to polarity. This result is absorbed by the principle established in experiment 3.

↳ Takeaway

Negative or positive does not matter: it is precision that acts, not the sentence's sign.

EXPERIMENT 05/ the crowding-out effect

Does repeating an instruction reinforce it?

Hypothesis. Repeating an instruction is a practice meant to reinforce its compliance. The alternative hypothesis is that repetition has no effect on the target constraint and can induce a degradation of competing constraints.

Preliminary experiment (V1) — no headroom. An initial version of the experiment repeated the constraint "subject line: 6 words max", stated once, twice, then three times. The constraint was met 100% from the first occurrence: the compliance ceiling was hit right away. This is a case of no headroom (see experiment 4): a constraint already satisfied spontaneously cannot be weakened, because no violation is present. The protocol was re-run with a constraint under real pressure.

The corrected protocol (V2). The revised setup places five simultaneous constraints in the prompt, and repeats one of them — prose with no list — once, twice, then three times. The goal is to measure the effect of that repetition on the four remaining constraints.

What is studied

Task: reply to a SaaS-tool customer, under five constraints at once.

fixed: prose, no list · ≤ 60 words · no closing pitch · informal address (tu) · no emoji
R1·R2·R3: the constraint "reply in prose, no list" stated 1×, 2×, then 3×

5 scenarios · Haiku & Sonnet · we watch whether the four other constraints hold

Hammering "prose" lengthens the rest. On Sonnet, length climbs (61→65 words) until it crosses the limit on every scenario. The repeated constraint captures attention at the expense of a neighboring one. Haiku, which writes short, barely slips.

Reading the figure. The horizontal axis is the number of repetitions (1×, 2×, 3×); the red line marks the 60-word threshold. On Sonnet, the average length grows with each repetition (61 → 64 → 65 words), causing an overshoot on 5/5 scenarios at the third level. Haiku's bars stay below the threshold: its natural bias toward brevity leaves less room to drift.

Conclusion. Repetition does not improve compliance of the target constraint, which is satisfied 100% from the first occurrence. However, over-weighting one instruction can induce a degradation of competing constraints — a crowding-out effect documented on Sonnet, nearly absent on Haiku. This result also illustrates the model-dependence of behavior: the same prompt does not behave identically across target models.

↳ Takeaway

Repeating an instruction does not increase its compliance. It is a waste of tokens and can, under some conditions, induce a reallocation of the attention budget at the expense of competing constraints.

EXPERIMENT 06/ a process that doesn't exist

Does metaphor help "imagine"?

Hypothesis. Some approaches recommend including a prior imagination instruction ("mentally visualize") before a production task. The hypothesis is that this instruction is inert: an LLM has no separate internal mental-representation process, and the imagination instruction does not improve output quality.

What is studied

Task: describe a scene in 30 words, precisely.

MET: "mentally visualize the scene, imagine it in detail, then describe it"
DIR: "describe the scene in 30 words, precisely"

3 scenes · a carbonara · a Christmas market · an old bookshop — we check: meta-narration? a better description?

The model skips the internal step. It never narrated "I imagine…", and the "with imagination" description is no better than the direct one. The instruction asks for a process that does not happen.

Reading the figure. The central box, shown dashed, embodies the "mental" step supposed to be triggered. The results indicate that this step has no observable translation in the output: the MET condition produces an impact of 0.10, i.e. a placebo verdict.

Conclusion. The meta-narration risk was not observed. The imagination instruction is nonetheless inert: it produces no measurable improvement in the output. This result reads as one more case of redundancy with the default behavior.

↳ Takeaway

An instruction invoking an internal process of the model has no measurable effect. Instructions should specify the expected result, not a supposed intermediate cognitive state.

EXPERIMENT 07/ the practical synthesis

How far, and how, can you trim a prompt?

Hypothesis. If segments redundant with the default behavior are inert, removing them should not degrade output quality. The experiment seeks to quantify this reduction margin and to separate two questions: how many tokens can you remove, and which ones should you remove?

The test. I reduce this prompt in two ways (detail below): a gradient that first removes the segments measured as inert, and a naive control of the same size that instead cuts the load-bearing segments. This second arm is a control group: at equal removed volume, it isolates which tokens matter, not just how many.

What is studied · same task, prompt trimmed 4 ways

Task: write a short follow-up (subject ≤ 6 words, body ≤ 40), from an 11-segment prompt.

R0: full prompt (role, adjectives, imagination, example, format, prohibitions)
R35: −43%: remove the placebos (role, repetitions); keep the load-bearing ones
R65: −68%: only register + example + format remain
NAIVE: −46%: same size as R35, but cuts the load-bearing ones (format, register, example) and keeps the placebos

5 follow-ups · R0→R65 tests "how many" · R35 vs NAIVE tests "which tokens"

It is not how many, it is which tokens. The gradient holds form down to −68%. But at equal reduction, R35 (cuts the placebos) stays impeccable while NAIVE (cuts the format, register, example) blows up: body from 97 to 166 words, return of formal address, parasitic scaffolding.

Reading the figure. The upper part represents the reduction gradient from R0 to R65. Formal compliance is maintained at each level, but R65 introduces a side effect (invented hook, 3/5). The lower part shows the equal-volume comparison: R35 (blue, placebos removed) and NAIVE (red, load-bearing removed) have nearly identical sizes but opposite output profiles. This contrast validates the hypothesis about the nature of the removed tokens.

Smart gradient · form and substance

Variant	Reduction	Body > 40 words	Invented hook
R0 full	0%	0 / 5	0 / 5
R35 placebos removed	−43%	0 / 5	0 / 5
R65 aggressive cut	−68%	0 / 5	3 / 5

Side effect — over-imprinting of the example. At −68%, the remaining style example is no longer diluted by the role and adjectives that surrounded it: the model starts to copy its content. Pierre Henri receives « tu m'avais parlé de tes reportings chronophages » — though he never mentioned reporting: it is the subject of the example (a certain Thomas), forcibly transplanted. Same mechanic for the lawyer (« tes dossiers qui te prenaient un temps fou ») and for the CEO. R0 and R35 never do it: it is, more sharply, the failure already seen in experiment 2 — the less scaffolding around an example, the more the model copies its details.

Equal-size control · R35 vs NAIVE

Cut ~ −45%	Body > 40 words	Register	Scaffolding
R35 removes placebos	0 / 5	informal (tu)	none
NAIVE removes load-bearing	5 / 5	formal ×2	everywhere

The first table follows the gradient. Formal compliance holds at every level — R0, R35 and R65 all respect the length constraints. But at −68%, a new phenomenon appears: the invented hook (3/5). The form holds; the substance drifts. The reason is mechanical: by progressively removing the role and the adjectives that surrounded the example, we removed the segments that diluted it. The example, now alone, is no longer one reference point among others: it becomes the dominant instruction. The model stops using it as a style guide; it replicates its content. This is a direct consequence of the river current: without the damping segments, the output is pulled toward the only strong anchor left.

The second table isolates this mechanism precisely. R35 and NAIVE remove a comparable amount of tokens — but not the same ones. R35 removes the placebos and preserves the load-bearing segments (format, register, example): impeccable result. NAIVE does the opposite — it removes the format, the register and the example, and keeps the role and the adjectives. Result: length explodes (97 to 166 words), formal address returns, formatting artifacts appear. What NAIVE removed is precisely what pulled the model away from its default behavior. The degradation is not caused by the volume removed, but by the nature of what was removed.

Conclusion. The experiment establishes two distinct results. First, trimming a prompt is not neutral: it is the segments that oppose the default behavior that carry the output — removing them lets the current take back over. Second, some segments inert under isolated ablation play a structural role: by diluting the influence of a strong segment, they keep it from over-imprinting the output. This segment-interaction phenomenon is not captured by segment-by-segment ablation — it is a methodological limitation of the instrument, which pairwise measurements would address.

↳ Takeaway

A token reduction on the order of 40% is achievable without measurable degradation, provided you target the redundant segments. Removing the load-bearing segments produces, at an equivalent reduction volume, a significant degradation.

§What it says about the model

The mechanics, experiment by experiment

We introduced the river current of the LLM at the opening. Now that we have seen the seven results, we can read its fine mechanics: the results let us sketch a mechanistic description of the model's behavior.

Why do the "senior expert" role and the "warm" adjectives produce no measurable effect? These instructions are redundant with the model's default behavior. Why do "no hierarchical distance" or "40 words" act? Because they oppose the default behavior: the default is formal address and unconstrained length; these instructions impose an explicit departure.

Why is "visualize the scene" empty? Because an LLM has no separate mental step to trigger: it produces text, it does not "picture" anything beforehand. Asking it for an internal process is commanding a move that does not exist in its mechanics.

Why can repeating an instruction hurt? Because faithfulness to instructions is shared: there is a kind of attention budget. Over-weighting one constraint (prose) diverts enough of it for a neighbor (length) to give way. It is not a comprehension flaw, it is a reallocation — and it shows more on a model that naturally writes longer (Sonnet) than on a brief one (Haiku). The default is not the same from one model to another; the mechanics are.

And why does an example, too alone, derail? An example is a powerful attractor: the model imitates what it is shown, to the point of importing its details (the "reporting") where they have no business being. In a loaded prompt, the role and the adjectives — inert for quality — were nonetheless diluting that attractor. Remove them all, and you let the example reign and over-imprint. In other words: some segments do not steer the output, they temper another segment. An instruction is judged not only on its own, but by what it balances.

General conclusion: writing a good prompt is not piling up assertions. It is identifying the model's default, spending your words only where it needs correcting, and balancing the forces against one another. The rest is scenery — costly in tokens, sometimes harmful, and always mute.

§Synthesis

A single cause behind seven results

Everything inert is redundant with the default: the "professional" instruction (1), the abstract role (2), the adjectives (3), the harmless prohibition and polarity (4), repetition (5), imagination (6). The model was already doing it.

Everything that acts departs from the default, or imposes a hard constraint: the absurd register (1), the concrete example (2), "no hierarchical distance" (3), the numeric limit (4). The model would not have done it on its own.

Operational corollary: segments redundant with the default behavior can be removed without measurable degradation (up to −43% under the tested conditions), with two caveats. Removing a segment that opposes the default lets that default come back; and some segments inert under isolated ablation act as a damper that you release by removing them simultaneously.

§In practice

Six rules for your prompts

Don't pay for what the model already does. "Be professional, clear, rigorous" is usually empty.
Show rather than label. A concrete example carries style better than a "You are a senior expert in…".
Prefer the actionable directive to the adjective. "No hierarchical distance" acts; "warm" slides off.
Put numbers on form. "40 words", "a single paragraph" work; "not too long" doesn't. Polarity does not matter — only precision counts.
State each instruction once. Repeating it does not reinforce it and can unbalance the rest.
Trim without fear, but aim well. −40% with no loss if you cut the placebos and preserve the load-bearing segments and an example's context.

§Honesty

The limits, because they must be stated

The core of the tests rests on a single model (Haiku 4.5); two experiments are replicated on Sonnet 4.6, none on another vendor. Short tasks, three to five scenarios. Part of the signal relies on semantic distance, which measures change and not correctness — hence the systematic reading of outputs, a matter of vigilance and not guarantee. Two results remain undecidable for lack of headroom (the substance of exp. 4, the fragilization on Haiku). And the instrument tests segments in isolation: exp. 7 showed that it underestimates their interactions, which the next version will correct.

These results are not universal laws. They are reproducible, obtained through a transparent method, and converge toward the same explanatory principle. They offer an empirical framework for reasoning about the composition of a prompt.

§Reproducibility

Raw data and results

All the analysis files produced by preatorlabs are made available below. Each JSON file contains: the full prompt, the segmentation, the test scenarios, the baseline and ablation outputs, and the computed impact scores. These files can be imported directly into the preatorlabs interface to visualize per-segment impact charts.

37 files · ~940 KB total