05 — Roadmap
Current state, next steps, and what is explicitly out of scope.
V0.1 — Web MVP (deployed)
Status: deployable. Self-contained dist/ folder ready for Vercel / Netlify / Cloudflare Pages (see DEPLOY.md).
- Scientific presentation landing page
- Automatic prompt segmentation (heuristic algorithm)
- Manual editing of detected segments
- Entry of test scenarios
- Configuration of the 3-axis criteria
- Ablation engine wired to the Claude API (key provided by the user)
- Per-axis delta computation (structural, behavioural, semantic)
- Visualisation: variance chart + 3-axis cards + synthesis
- Local storage of the API key + results
- Exponential backoff on 429 / 5xx / 529 (3 attempts, respecting
retry-after) - Incremental save per call → resume without replaying what succeeded
- Anthropic error messages translated into readable text (401, 429, 529, …)
- Input validation (empty prompt, empty scenarios, prompt > 10k tokens)
- Explicit confirmation if the analysis exceeds 150 API calls
- Per-model $ cost estimate in the UI
- Restrictive CSP + HTTPS headers via
vercel.json/netlify.toml/_headers - Open Graph + Twitter Card meta + SVG favicon
- Explicit privacy section + FAQ + 4-step "how it works" journey
- Responsive 360 px → 1920 px, WCAG AA contrast
- Branded 404 page
- Complete documentation (rationale, methodology, architecture, interpretation, roadmap + 3 agent documents)
Intentionally out of scope for V0.1 (moved to V0.2 below): JSON export, import of a past analysis, per-scenario drill-down, combined ablation mode, signed counterproductivity.
V0.2 — Robustness (short term)
Incremental improvements with no architectural change.
- JSON export of results (for archiving and comparison)
- Import of a previous analysis
- Inline per-scenario output drill-down in the segment cards (baseline vs ablation comparison, collapsible panel, mobile-first)
- Refined calibration of verdict thresholds on a corpus larger than Reachy (observation 2026-05-26: on Haiku 4.5, 11 of 12 Reachy segments are classified "low" because the outputs are already very convergent; per-model normalisation or normalisation by the run's global variance to be considered — see
00-AGENT-SMOKE-TEST.md§B.2) - "Combined ablation" mode: remove 2 segments at once to detect coalitions
- Counterproductivity detection (signed delta, not absolute)
- OG image optimisation (get below 200 KB)
- Call compaction (controlled concurrency at 3-5 req/s, respectful of Anthropic)
- Extraction of the inline
<script>todist/app.jsto remove'unsafe-inline'fromscript-srcand return to a strict CSP (V0.1 had to keep'unsafe-inline'after the 2026-05-26 regression on Vercel — see00-AGENT-SMOKE-TEST.md§B.6). The operation is risk-free as long as noinnerHTMLwith user-content is introduced into the HTML.
V1 — Python reference engine
The browser JS engine has two limits: performance on large batches, and the quality of the semantic embedding (local TF-IDF in V0).
-
engine/preatorlabs.py— standalone Python engine - Real embeddings via Voyage AI (recommended) or local sentence-transformers
- CLI
preatorlabs analyze --prompt prompt.txt --scenarios scn.json - Normalised JSON output, compatible with the web app format
- Unit tests + integration tests on the Reachy corpus
- PyPI distribution
V2 — Multi-turn and advanced features
- Multi-turn history scenarios (testable on prompts with conversational memory)
- Mode A for the semantic axis: reference corpus provided by the user
- Configurable weighting of the 3 axes
- n=3 repetition per ablation to absorb residual stochasticity
- Comparative analysis between two versions of the same prompt (semantic diff)
V3 — Multi-LLM
- OpenAI adapter (GPT-4, GPT-4o, o3)
- Gemini adapter
- Mistral / Llama adapter via providers
- Unified
LLMAdapterinterface - Comparative view: same prompt, several LLMs, cross-report
- Complementary logprobs signal when the API exposes it (OpenAI partially)
V4 — Community and ecosystem
- Library of reference prompts (benchmarked commons)
- VSCode / Cursor plugin to analyse a prompt from the editor
- Hosted REST API (optional, for CI/CD integration)
- Translated documentation (EN minimum)
Deliberate out-of-scope
These directions were considered and explicitly set aside:
❌ LLM-as-judge — no interpretive evaluation by a third-party LLM. The rigour of the method depends on this exclusion. See 01-SCIENTIFIC-RATIONALE.md §3.
❌ Full Shapley values — untenable combinatorial cost. Possible as an advanced V4+ option with approximate sampling (Monte-Carlo Shapley), not on the main path.
❌ Fine-tuning or learning — preatorlabs is an analysis tool, not an automatic-modification tool. It helps the human decide, it does not decide. This boundary is deliberate.
❌ Automatic prompt generation — out of scope. Many tools already exist for that. preatorlabs solves a downstream problem: understanding an existing prompt.
For future contributors
If you take over the project, here are the questions to ask before each PR:
- Does it preserve objectivity? No interpretive metric.
- Does it preserve frugality? No combinatorial blow-up.
- Does it preserve readability? A non-expert must be able to read the report.
- Is it documented? Every added method has its section in
02-METHODOLOGY.md. - Is it falsifiable? A verdict produced by preatorlabs must be contradictable by an independent test.
If any of these answers is no, the change must be explicitly justified, otherwise refused.