preatorlabs is an experimental analysis tool: an ablation study, across three orthogonal axes, that reveals what each part of your prompt actually makes the LLM do.
The standard prompt-engineering cycle is to write, test, adjust — without knowing what produces what. A seemingly decorative sentence can be critical; a seemingly essential sentence can be ignored. preatorlabs accepts this unknown and handles it through experimentation: rather than reasoning about the prompt, you measure its actual behaviour, segment by segment.
Ablation measures an element's contribution by removing it and observing what changes. Applied to the prompt, the operation is repeated segment by segment and crossed with several scenarios, in order to isolate the specific effect of each part on the LLM's answer.
The principle. For each prompt segment, we generate two versions of the answer to the same input: with the segment, and without. The difference between these two outputs is measured along three distinct dimensions. Repeating over several scenarios gives the segment's mean impact (how much it weighs on average) and the variance across scenarios (is it active everywhere, or only in some cases).
Each answers a different question. The breakdown lets you understand why a segment counts, not just how much it counts.
Length, presence of lists, JSON validity, absence of asterisks. This axis checks whether the answer keeps the expected shape: its structure, its template and the format constraints imposed by the prompt.
parsing · regex · countingPresence or absence of expected / forbidden terms, conformance to logical patterns. Lexical detection by string matching.
exact match · lexical listEach answer is converted into a vector, then we measure the cosine distance between the full version and the ablated one. Two modes: local TF-IDF (word weighting, free, no network) or Voyage AI embeddings (contextual representation, finer-grained). This axis captures the changes in tone, register and structure that the others miss.
embeddings · cosineGo deeper into the protocol, the per-axis scores, activation and limits
Your entire system prompt, as is. Automatic segmentation proposes a split by paragraphs and titles. You can edit each segment, merge or delete.
5 to 8 user inputs representative of the prompt's use cases. This is what reveals the variance between universal and contextual segments.
The tool runs N×M+M Claude API calls with your key, at temperature 0. On the semantic side, you can choose local TF-IDF (free) or the Voyage API (optional). Cost and duration are shown before launch. Automatic save in case of interruption.
A variance chart + per-axis breakdown + a synthesis in three lists. Reading guide: how to read.
How the results are computed (technical detail)
The main output is a variance chart, where each bar represents a segment. Height tells the impact, the error bar tells the variance. The verdict also accounts for the activation rate (share of scenarios where impact exceeds 30%). Full formulas: methodology.
impact(Si): mean amplitude of the deviation when the segment is removed.variance(Si): dispersion across scenarios; a long bar signals a contextual segment.activation(Si), scenario threshold 30%).| verdict | signal | interpretation | action |
|---|---|---|---|
| critical | high impact + strong activation + low variance | Fundamental segment, active on all scenarios. | Do not touch. |
| high impact | strong impact + solid activation + contained variance | Important and stable. Carries a clear part of the style or logic. | Modify with caution. |
| contextual | variance ≥ 25% or activation < 50% (with impact ≥ 15%) | Safety net: acts only on some scenarios but is decisive there. Includes segments with partial activation. | Keep. Do not confuse a low mean impact with uselessness. |
| low | 10% ≤ impact < 20%, low variance, stable activation | Little effect. Possible redundancy with another segment. | Test a combined ablation before removing. |
| placebo | impact < 10% | Not taken into account by the LLM, despite an explicit phrasing. | Remove or rephrase as an operational rule. |
The distinction is read on two indicators complementary to the mean impact:
variance signals an effect concentrated on some scenarios rather than spread uniformly;activation rate indicates the segment only crosses the effect threshold on a portion of the tested cases.When one of these signals accompanies a modest mean impact, the segment is classified contextual: it acts as a one-off safety net, to keep. Before any removal decision, check the behaviour scenario by scenario in the output detail.
Paste your system prompt, add a few representative scenarios, configure the criteria and run the analysis. The tool runs entirely in your browser.
Comparison of the full answer vs the ablated answer. Selectable provider (free local or Voyage).
Ready to run the analysis. The engine will execute 0 API calls on your key.
Two diagnostic runs on structurally opposite prompts (e-commerce customer support vs Python teaching assistant with JSON output), to verify that preatorlabs measures a real signal and does not apply a fixed pattern. Model: claude-haiku-4-5, temperature 0.
6 segments, 3 scenarios. Expected trap segment: S3 (promo code SORRY10) → should only activate on the "delivery delay" scenario. And S5 (anti-hallucination) on the 2 scenarios where the model lacks the info.
5 segments, 3 scenarios, one off-topic. Expected trap segment: S4 (instruction "off-topic → fixed answer") → should only activate on "What is your favourite restaurant in Paris?".
Identified limit. On Claude Haiku, the verdict thresholds calibrated for Sonnet may stay too strict: the raw variance can be more informative than the final label. V0.3 already corrects part of the bias via activation and aggregation over applicable axes, but cross-model calibration remains a V0.4 task. Detail in the roadmap.
Because a judge LLM produces a plausible answer, not a measurement. The output depends on the LLM used, on the wording of the instruction, and is not reproducible. preatorlabs stays strictly on parsable or computable metrics. Detail in the scientific rationale.
The cost is linear: N×M+M API calls where N = segments and M = scenarios. Reachy example (12 segments, 6 scenarios) = 78 calls ≈ $0.20 on Claude Haiku, $1 on Sonnet, $5 on Opus (max_tokens cap, in practice often less). The $ estimate is shown before each launch.
Nowhere except to the APIs you enable. preatorlabs has no backend, no tracking, no analytics cookie. Your prompt, your scenarios, your results and your API keys stay in your browser (localStorage). The keys are only used for calls to api.anthropic.com and, if enabled, api.voyageai.com. Details in the privacy section below.
Because CORS-friendly browser-side APIs are still uneven. Anthropic exposes an official header (anthropic-dangerous-direct-browser-access) that allows calling its API from a browser without a backend. In V0.3, the semantic axis is already multi-provider (local TF-IDF or Voyage). Full multi-LLM support for generation is planned via the optional Python engine.
The tool warns you beyond ~10,000 estimated tokens (≈ 40,000 characters) and beyond 150 total calls. You can still launch, but you will be alerted about the cost and duration. For very large batches, the V1 Python engine is more suitable.
Each successful call is saved in localStorage as it goes. If the analysis fails (network, unrecoverable rate limit, tab closed), a "Resume interrupted analysis" banner appears on the next launch. You resume exactly at the next call, without replaying what was already computed.
Yes, MIT licence. The source code is on GitHub. The methodological choices are documented in docs/01-SCIENTIFIC-RATIONALE.md and docs/02-METHODOLOGY.md.
preatorlabs runs entirely in your browser. No backend, no tracking, no cookie. The technical choices make exfiltration impossible.
Stored in localStorage only, within this site's origin:
preatorlabs.apiKey: your Anthropic API keypreatorlabs.voyageApiKey: your Voyage API key (if used)preatorlabs.runState: incremental save of an in-progress analysis, for resuming after an errorpreatorlabs.lastResults: latest aggregated resultsThe "Delete key" button and a manual wipe via the browser tools are enough to remove everything.
Strictly the requests to api.anthropic.com for generation, and api.voyageai.com if you enable the Voyage semantic provider.
The page's Content-Security-Policy forbids any outgoing request outside the explicitly allowed domains (Anthropic/Voyage + fonts/Chart.js CDN). A malicious injection could not exfiltrate data to an arbitrary domain.
No analytics, no pixel, no third-party tracking script. No Google Fonts in tracking mode, only the CSS and the WOFF2 files.
Export several analyses, drop them into the dashboard and compare them: verdict distribution, axis profiles, correlations between metrics and tracking of segments across versions of the same prompt. Everything stays local, in your browser.
To run a real analysis, the tool needs an Anthropic key. If you enable Voyage, an optional Voyage key is also used. The keys are stored only in localStorage and sent only to the relevant API.
Empirical study · preatorlabs
Seven controlled ablation experiments measuring the real impact of each prompt segment — and the single principle that explains them all.