the problem

Today, tuning a prompt is a coin flip.

The standard prompt-engineering cycle is to write, test, adjust — without knowing what produces what. A seemingly decorative sentence can be critical; a seemingly essential sentence can be ignored. preatorlabs accepts this unknown and handles it through experimentation: rather than reasoning about the prompt, you measure its actual behaviour, segment by segment.

01write a prompt

02inject it into the LLM

03test on a few inputs

04adjust blindly

Three measurable consequences:

regression you remove a seemingly decorative segment that was in fact critical for a rare use case.
inflation you pile up "just in case" sentences without knowing they are ignored by the model.
false positives you attribute to one sentence an effect that actually comes from another part of the prompt.

the method

An ablation study, across three orthogonal axes.

Ablation measures an element's contribution by removing it and observing what changes. Applied to the prompt, the operation is repeated segment by segment and crossed with several scenarios, in order to isolate the specific effect of each part on the LLM's answer.

The principle. For each prompt segment, we generate two versions of the answer to the same input: with the segment, and without. The difference between these two outputs is measured along three distinct dimensions. Repeating over several scenarios gives the segment's mean impact (how much it weighs on average) and the variance across scenarios (is it active everywhere, or only in some cases).

# For each segment Si and each scenario Tj: baseline(Tj) = LLM(full_prompt, input=Tj, T=0) ablation(Si, Tj) = LLM(prompt_without_Si, input=Tj, T=0) delta_a(Si, Tj) = | score_a(baseline) − score_a(ablation) | # a ∈ { structural, behavioural, semantic } # axis not applicable => excluded from aggregation impact(Si, Tj) = mean over active a of delta_a(Si, Tj) impact(Si) = mean over j of impact(Si, Tj) variance(Si) = std over j of impact(Si, Tj) activation(Si) = ratio over j where impact(Si, Tj) >= 0.30

The three measurement axes

Each answers a different question. The breakdown lets you understand why a segment counts, not just how much it counts.

axis 1

Structural

Does the answer respect the expected format?

Length, presence of lists, JSON validity, absence of asterisks. This axis checks whether the answer keeps the expected shape: its structure, its template and the format constraints imposed by the prompt.

parsing · regex · counting

axis 2

Behavioural

Does the answer follow the business rules?

Presence or absence of expected / forbidden terms, conformance to logical patterns. Lexical detection by string matching.

exact match · lexical list

axis 3

Semantic

Are meaning and style preserved?

Each answer is converted into a vector, then we measure the cosine distance between the full version and the ablated one. Two modes: local TF-IDF (word weighting, free, no network) or Voyage AI embeddings (contextual representation, finer-grained). This axis captures the changes in tone, register and structure that the others miss.

embeddings · cosine

Go deeper into the protocol, the per-axis scores, activation and limits

Methodology: complete workflow and formulas
Scientific rationale: methodological choices and V0.3 evolution

how it works

Four steps.

step 1

Paste your prompt

Your entire system prompt, as is. Automatic segmentation proposes a split by paragraphs and titles. You can edit each segment, merge or delete.

step 2

Add scenarios

5 to 8 user inputs representative of the prompt's use cases. This is what reveals the variance between universal and contextual segments.

step 3

Run the analysis

The tool runs N×M+M Claude API calls with your key, at temperature 0. On the semantic side, you can choose local TF-IDF (free) or the Voyage API (optional). Cost and duration are shown before launch. Automatic save in case of interruption.

step 4

Read the results

A variance chart + per-axis breakdown + a synthesis in three lists. Reading guide: how to read.

How the results are computed (technical detail)

Methodology: segmentation, auto/manual rules, deltas, aggregation, verdicts
Architecture: modules, criteria preview, data flow
Rationale: assumed limits and V0.3 evolution

how to read

Five verdicts, one decision grid.

The main output is a variance chart, where each bar represents a segment. Height tells the impact, the error bar tells the variance. The verdict also accounts for the activation rate (share of scenarios where impact exceeds 30%). Full formulas: methodology.

Reading the chart

Bar height = impact(Si): mean amplitude of the deviation when the segment is removed.
Vertical bar = ±variance(Si): dispersion across scenarios; a long bar signals a contextual segment.
Colour = verdict (also derived from activation(Si), scenario threshold 30%).
Segment cards: structural / behav. / semantic bars = mean per-axis deltas; activation line = share of active scenarios.

Verdict grid

verdict	signal	interpretation	action
critical	high impact + strong activation + low variance	Fundamental segment, active on all scenarios.	Do not touch.
high impact	strong impact + solid activation + contained variance	Important and stable. Carries a clear part of the style or logic.	Modify with caution.
contextual	variance ≥ 25% or activation < 50% (with impact ≥ 15%)	Safety net: acts only on some scenarios but is decisive there. Includes segments with partial activation.	Keep. Do not confuse a low mean impact with uselessness.
low	10% ≤ impact < 20%, low variance, stable activation	Little effect. Possible redundancy with another segment.	Test a combined ablation before removing.
placebo	impact < 10%	Not taken into account by the LLM, despite an explicit phrasing.	Remove or rephrase as an operational rule.

Interpretation pitfall: mean impact and partial activation

The displayed impact is an average over all scenarios. A segment that only acts on a fraction of cases therefore shows a low mean impact, even though it can be decisive there. Concluding it is useless from the bar height alone is a frequent reading error.

The distinction is read on two indicators complementary to the mean impact:

a high variance signals an effect concentrated on some scenarios rather than spread uniformly;
a partial activation rate indicates the segment only crosses the effect threshold on a portion of the tested cases.

When one of these signals accompanies a modest mean impact, the segment is classified contextual: it acts as a one-off safety net, to keep. Before any removal decision, check the behaviour scenario by scenario in the output detail.

demo

Try it on your prompt.

Paste your system prompt, add a few representative scenarios, configure the criteria and run the analysis. The tool runs entirely in your browser.

prompt debugger

no key configured

Your system prompt

Paste the full prompt. Segmentation is automatic: split by paragraphs, titles and numbered rules. 0 segments detected

Test scenarios 0 scenarios

Representative user inputs. 5 to 8 varied scenarios give the best variance estimate.

Evaluation criteria

Define how the tool measures the quality of an output on each axis.

structural axis boolean parsing

Auto-extraction of explicit rules (imposed sentence, thresholds, JSON) Maximum length: words No asterisks (narrated actions) No lists / bullets

Detected rules (structural)

behavioural axis lexical detection

Auto-extraction of explicit rules (forbidden terms, informal/formal address) Expected terms (comma-separated): Forbidden terms (comma-separated):

Detected rules (behavioural)

semantic axis embeddings + cosine

Comparison of the full answer vs the ablated answer. Selectable provider (free local or Voyage).

Engine parameters

target model

temperature

0 isolates the ablation signal

semantic provider

The semantic axis stays explainable: cosine distance.

0 estimated API calls

Ready to run the analysis. The engine will execute 0 API calls on your key.

No analysis run yet. Configure then run the analysis to see the results.

validation tests

Does the tool really discriminate?

Two diagnostic runs on structurally opposite prompts (e-commerce customer support vs Python teaching assistant with JSON output), to verify that preatorlabs measures a real signal and does not apply a fixed pattern. Model: claude-haiku-4-5, temperature 0.

Each run is designed so that a single segment is contextual (triggered by only one scenario out of three). If the tool discriminates, this segment should stand out as the highest bar and the highest variance. If the tool produces a flat or identical result between the two tests, the signal does not exist.

Test 1 · E-commerce customer support

21 calls · ~$0.13

6 segments, 3 scenarios. Expected trap segment: S3 (promo code SORRY10) → should only activate on the "delivery delay" scenario. And S5 (anti-hallucination) on the 2 scenarios where the model lacks the info.

3% ±6%

7% ±11%

9% ±18%

2% ±4%

11% ±20%

3% ±5%

Reading: S3 (promo) and S5 (anti-hallucination) come out as the two most impactful segments with the highest variance. Typical profile of a contextual segment. The tone (S4) and signature (S6) segments are stable and low: placebo.

Test 2 · Python assistant (JSON output)

18 calls · ~$0.11

5 segments, 3 scenarios, one off-topic. Expected trap segment: S4 (instruction "off-topic → fixed answer") → should only activate on "What is your favourite restaurant in Paris?".

4% ±4%

9% ±11%

2% ±1%

14% ±12%

1% ±1%

Reading: S4 (off-topic) comes out at 14% impact with the highest variance, exactly the predicted profile. Detail: the bar is driven to 41% by the semantic axis of the single off-topic scenario (perfectly isolated by cosine). S2 (JSON constraint) captures 17% on the structural axis. S5 (forbidden terms) at 1%: a true placebo, the model never produces these turns of phrase on Python questions.

What these two runs validate

The signal is real, not a fixed pattern. The segment ranking differs completely between Test 1 (S5 > S3) and Test 2 (S4 > S2). The tool does not return an identical result based on segment position.
The contextual segment stands out systematically. In both tests, the segment whose activation depends on a single scenario shows the highest mean impact and variance of the run: the exact signature of a "contextual safety net".
The per-axis breakdown identifies the mechanism. S4 of Test 2 comes out at 41% on the semantic axis (register change: a Python answer vs "Off topic"), while S2 of Test 2 comes out at 17% on the structural axis (valid JSON vs markdown prose). The tool can say why a segment counts, not only how much.
Placebos are true placebos. S5 of Test 2 ("never say I am an AI") is measured at 1% impact: the model does not produce these turns of phrase even without the instruction. Behavioural confirmation.

Identified limit. On Claude Haiku, the verdict thresholds calibrated for Sonnet may stay too strict: the raw variance can be more informative than the final label. V0.3 already corrects part of the bias via activation and aggregation over applicable axes, but cross-model calibration remains a V0.4 task. Detail in the roadmap.

frequently asked questions

What you are probably wondering.

Why not an LLM that judges the prompt quality?

Because a judge LLM produces a plausible answer, not a measurement. The output depends on the LLM used, on the wording of the instruction, and is not reproducible. preatorlabs stays strictly on parsable or computable metrics. Detail in the scientific rationale.

How much does it really cost?

The cost is linear: N×M+M API calls where N = segments and M = scenarios. Reachy example (12 segments, 6 scenarios) = 78 calls ≈ $0.20 on Claude Haiku, $1 on Sonnet, $5 on Opus (max_tokens cap, in practice often less). The $ estimate is shown before each launch.

Where does my data go?

Nowhere except to the APIs you enable. preatorlabs has no backend, no tracking, no analytics cookie. Your prompt, your scenarios, your results and your API keys stay in your browser (localStorage). The keys are only used for calls to api.anthropic.com and, if enabled, api.voyageai.com. Details in the privacy section below.

Why is generation centred on Claude in V0.3?

Because CORS-friendly browser-side APIs are still uneven. Anthropic exposes an official header (anthropic-dangerous-direct-browser-access) that allows calling its API from a browser without a backend. In V0.3, the semantic axis is already multi-provider (local TF-IDF or Voyage). Full multi-LLM support for generation is planned via the optional Python engine.

How does preatorlabs handle a very long prompt?

The tool warns you beyond ~10,000 estimated tokens (≈ 40,000 characters) and beyond 150 total calls. You can still launch, but you will be alerted about the cost and duration. For very large batches, the V1 Python engine is more suitable.

What happens if the analysis crashes midway?

Each successful call is saved in localStorage as it goes. If the analysis fails (network, unrecoverable rate limit, tab closed), a "Resume interrupted analysis" banner appears on the next launch. You resume exactly at the next call, without replaying what was already computed.

Is the project open source?

Yes, MIT licence. The source code is on GitHub. The methodological choices are documented in docs/01-SCIENTIFIC-RATIONALE.md and docs/02-METHODOLOGY.md.

privacy

Everything stays in your browser

preatorlabs runs entirely in your browser. No backend, no tracking, no cookie. The technical choices make exfiltration impossible.

What stays on your machine

Stored in localStorage only, within this site's origin:

preatorlabs.apiKey: your Anthropic API key
preatorlabs.voyageApiKey: your Voyage API key (if used)
preatorlabs.runState: incremental save of an in-progress analysis, for resuming after an error
preatorlabs.lastResults: latest aggregated results

The "Delete key" button and a manual wipe via the browser tools are enough to remove everything.

What leaves your browser

Strictly the requests to api.anthropic.com for generation, and api.voyageai.com if you enable the Voyage semantic provider.

The page's Content-Security-Policy forbids any outgoing request outside the explicitly allowed domains (Anthropic/Voyage + fonts/Chart.js CDN). A malicious injection could not exfiltrate data to an arbitrary domain.

No analytics, no pixel, no third-party tracking script. No Google Fonts in tracking mode, only the CSS and the WOFF2 files.

Which parts of your prompt actually matter? Measure it, segment by segment.

Today, tuning a prompt is a coin flip.

Three measurable consequences:

An ablation study, across three orthogonal axes.

The three measurement axes

Four steps.

Five verdicts, one decision grid.

Reading the chart

Verdict grid

Interpretation pitfall: mean impact and partial activation

Try it on your prompt.

How to read this chart

Variance chart

Per-segment breakdown

Does the tool really discriminate?

What these two runs validate

What you are probably wondering.

Everything stays in your browser

What stays on your machine

What leaves your browser

A dashboard for your exports.

Which parts of your prompt actually matter? Measure it, segment by segment.

Today, tuning a prompt is a coin flip.

Three measurable consequences:

An ablation study, across three orthogonal axes.

The three measurement axes

Four steps.

Five verdicts, one decision grid.

Reading the chart

Verdict grid

Interpretation pitfall: mean impact and partial activation

Try it on your prompt.

How to read this chart

Variance chart

Per-segment breakdown

Does the tool really discriminate?

What these two runs validate

What you are probably wondering.

Everything stays in your browser

What stays on your machine

What leaves your browser

A dashboard for your exports.

Anthropic API key