Scout

Autonomous AI Exploratory Testing Agent

Given a target AI system and a session charter, Scout hypothesises failure modes, runs probes, observes behaviour, and produces risk-ranked findings mapped to AIQIDE quality attributes — feeding directly into Crystal Ball's governance dashboard.

API-first executor browser-use (secondary) FastAPI SQLite WAL AIQIDE adapter
🎯
Session target
8–15 probe cycles / 15 min
💰
Est. cost per session
S$0.70 – S$2.00
🔬
Reasoning loop calls
2–4 LLM calls / cycle
📊
Crystal Ball projects
2 (Corpus Coach · Scout)
Why Build This
🎪
Demo vehicle for AIQIDE + Crystal Ball

Produces findings that map directly to AIQIDE quality attributes. Two systems wired into the dashboard today — Corpus Coach and Scout itself — with contrasting severity profiles.

The contrast is the demo

  • Corpus Coach — assistive RAG, low severity, internal, educational. Deliberately seeded with a representative set of typical RAG mistakes.
  • Scout — autonomous QE agent, medium severity, internal exposure. Evaluates itself; runs feed metrics back into Crystal Ball.

Same AIQIDE framework. Two different severity calibrations. Two different regulatory paths. One Crystal Ball dashboard.

🔧
Genuine AI QE capability

A service offering: "we run our exploratory agent against your AI system and produce a quality assessment." Not hypothetical. Demonstrable against any chat UI.

The market gap

Every commercial agentic testing product targets traditional applications. RAGAS, DeepEval, PromptFoo are excellent at regression — "did the known cases still pass?" — but brittle at discovery. Scout is the discovery layer.

Honest eval target

Corpus Coach is itself an AI system worth evaluating. Running evals against it exercises every major eval tool as designed — no forced fits, no artificial test cases.

Planted flaws (5 types)

  • Retrieval gaps — wrong indexing granularity
  • Ambiguity handling — MAS Notice 626 vs 626A
  • Persona drift — long system prompt drops "cite sources"
  • Refusal inconsistency — refuses similar queries differently
  • Stale grounding — outdated doc, no recency flag
What Scout Does — Session Flow
Reasoning Loop — 5 Prompt Roles
In Scope vs Out of Scope (v1)
In Scope
  • API executor — primary: targets exposing a /chat-style HTTP endpoint
  • Browser executor — secondary: web chat UIs via browser-use (gated)
  • Session charter: URL/endpoint, goal, time-box, persona
  • JSON findings + Markdown summary output
  • Transcript capture, eval instrumentation
  • POST /api/v1/metrics push to Crystal Ball
  • Corpus Coach reference target (with API)
Out of Scope (v1)
  • MCP-exposed targets
  • Multimodal targets (image/audio)
  • Multi-agent target systems
  • Continuous production monitoring
  • Full regulatory audit trail
  • Concurrent sessions
System Architecture
Session Charter Input
target_url · goal · time_box · persona · charter_notes
Scout Orchestrator
Session lifecycle · time-boxing · charter enforcement · failure recovery · partial-result handling
Reasoning Loop (iterative)
hypothesiser planner probe executor observer finding synthesiser
◄►
Executor (swappable)
API (v1 primary) browser-use (secondary)
shared interface · transcript capture
Session Store — SQLite (WAL mode)
Hypotheses · probes · transcripts · findings · evidence
Report Generator
JSON + Markdown
Evals on Target
RAGAS DeepEval PromptFoo
system_id: corpus_coach
Evals on Scout
hypothesis quality finding faithfulness
system_id: scout
/metrics endpoint
Crystal Ball polls
Failure Handling
⚠️
Failure Modes

Target system failures — retry ×2 with backoff, mark as target_error, continue. 3 consecutive failures → terminate with partial report.

browser-use crash — capture error state, write completed probes to store, generate partial report with session_status: partial.

LLM rate limits — exponential backoff, 60s ceiling. If paused >5 min → terminate with partial results.

📋
Partial Report Contract

Any session with session_status: partial still writes to /metrics endpoint.

Crystal Ball displays partial results with a visual indicator.

Findings from partial sessions are valid — they're just incomplete.

Cost Model — Per 15-Minute Session
Component
Calls
Model
Est. Cost
Scout reasoning loop
30–50
Sonnet 4.6
S$0.40–1.10
browser-use agent
15–25
Sonnet 4.6
S$0.20–0.55
Corpus Coach responses
10–20
Haiku 4.5
S$0.01–0.05
Eval runner (post-session)
5–15
Varies
S$0.07–0.28
Total per session
S$0.70 – S$2.00

Service positioning: ~S$8–13/month per target at one session/week. Portkey provides exact tracking once wired.

Scout Is a Small App, Not a Script

Running Scout as a capability means managing targets (systems it probes), charters (reusable exploratory specs per target), and runs (individual executions). Session-level detail — hypotheses, probes, transcripts, findings — hangs off runs. v1 includes a clients table with one row (internal) so multi-tenancy is a config change, not a migration.

Four Core Tables
🏢
clients

Tenancy root. v1 has one row: internal. Every other table carries client_id — free to include now, migration later.

🎯
targets

Systems Scout can probe. Holds endpoint, executor type (api/browser), auth config, DNA profile, and the mapping to Crystal Ball's system_id.

📋
charters

Reusable exploratory specs. Goal, persona, time-box, probe budget, schedule, cost cap. Same target can have many charters — each is a distinct testing intent.

runs

Individual executions. Status, cost, timing, prompt/engine versions, trigger source. Hypotheses, probes, transcripts, findings all reference run_id.

Schema Relationships
clients client_id (PK) name · created_at targets target_id (PK) client_id (FK) executor · endpoint · auth dna_profile · cb_system_id charters charter_id (PK) target_id (FK) goal · persona · time_box schedule · cost_cap runs run_id (PK) charter_id (FK) status · cost · timings engine_version · triggered_by findings severity · aiqide_attribute hypotheses status · description probes prompt · response · flags transcripts full conversation context llm_calls evidence (filesystem, URI in findings) 1 : N 1 : N 1 : N

Solid arrows = primary FK. Dashed arrows = detail tables referencing run_id. All tables carry client_id for v2+ tenancy.

API Surface — v1
⚙️
Management
POST /api/v1/clients
GET /api/v1/targets?client_id=...
POST /api/v1/targets
PATCH /api/v1/targets/{id}
POST /api/v1/charters
GET /api/v1/charters?target_id=...
Runs & Evidence
POST /api/v1/charters/{id}/runs ● live
GET /api/v1/runs?charter_id=... ● live
GET /api/v1/runs/{id} ● live (extended: hypotheses[], probes[], execution_summary)
GET /api/v1/runs/{id}/transcript ○ Path B
GET /api/v1/runs/{id}/findings ○ Path B
GET /api/v1/runs/{id}/evidence/{fid} ● live (HTML + JSON)
GET /api/v1/runs/{id}/metrics_payload ○ Path B

v1 is CLI-driven over this API. UI is deferred to v2+. Auth deferred (single internal client). The API shape is locked now so future tenancy, scheduling, and UI are additive. As of 2026-04-29, GET /api/v1/runs/{id} returns the full execution log (hypotheses, probes with observer_flags, execution_summary) inline — substituting for the missing /transcript and /findings endpoints in practice.

Operational Guardrails
🚫
Concurrent Run Prevention

At most one running status per (target, charter). Second trigger → 409 with in-progress run_id. Stops scheduled + manual collisions.

💰
Per-Charter Cost Cap

daily_cost_cap on charter. New runs 402 if exceeded. Mid-run overrun → terminate partial with cost_cap_exceeded.

Per-Target Rate Limit

Default 10 runs per target per hour. Prevents runaway scheduler bugs or misconfigured triggers overwhelming a client system.

🔐
Secret Indirection

Credentials never in the DB. auth_config.secret_ref keys into /opt/thatsquality/.env.secrets — same pattern as Crystal Ball (CB-BACKLOG-017).

Who Owns What
🔬
Scout owns
High-volume, internal, reasoning-trace data
  • Full session transcript — every probe, every response
  • Every LLM call with prompt version, tokens, latency, cost
  • Hypothesis state changes (generated, confirmed, refuted, abandoned)
  • Observer flags, planner decisions, reasoning artefacts
  • Phoenix traces of the target's responses
  • Evidence files (screenshots, transcript slices, probe sequences)
  • Target and charter configuration
  • Run telemetry (cost, status, timings)
📊
Crystal Ball owns
Aggregated, cross-system, business-facing
  • Aggregated findings and metrics per system
  • AIQIDE impact assessments per persona
  • Trend data across time, release readiness signals
  • Quality Lead, Executive, Compliance Officer views
  • Cross-system comparison (Corpus Coach vs Scout)
  • Severity calibration rules applied
  • Breach detection and AIQIDE /assess callouts
The Push & The Click-Through
Scout run_id: r-042 transcript · hypotheses findings · evidence GET /runs/r-042/evidence/f-001 Crystal Ball system_id: scout findings (summary) metrics · trends AIQIDE /assess · dashboards Push on run complete POST /api/v1/metrics → MetricsPayload Click-through on finding evidence_ref URL lands in Scout HIGH-VOLUME DETAIL AGGREGATED BUSINESS VIEW

Scout pushes a summarised MetricsPayload to Crystal Ball on run completion (fire-and-forget). Findings in the payload carry evidence_ref URLs pointing back to Scout's evidence endpoint. Quality Lead clicking a finding in Crystal Ball lands in Scout for full transcript context.

Three Eval Relationships — Often Confused
A
Scout-orchestrated evals against Scout itself

Run as part of the Scout pipeline, post-session, against Scout's own outputs. Answer: did Scout do its job well?

Includes: hypothesis quality (vs planted flaws), finding faithfulness, report specificity, charter adherence, probe diversity, prompt injection resilience
Push → system_id: scout  ·  source_tool: custom
B
Scout-orchestrated evals against target responses during a session

Scout captures target responses during probing. Minimal in v1 — mostly Phoenix trace inspection. Answer: what did the target do during this probe sequence?

Push → system_id: target's cb_system_id  ·  source_tool: custom
C
Independently scheduled evals against the target

Own scheduled jobs — cron, CI, etc. Read target's Langfuse traces or API directly. Scout does not orchestrate. Answer: is the target behaving as designed?

Includes: RAGAS continuous eval, DeepEval batch runs, PromptFoo regression, PromptFoo red-team
Push → system_id: target's cb_system_id  ·  source_tool: ragas|deepeval|promptfoo
Routing Rule

A finding about Corpus Coach's retrieval behaviour gets system_id: corpus_coach regardless of whether Scout or a scheduled RAGAS job produced it.

No cross-contamination. The system_id is about what the finding is about, not who produced it. The source_tool answers "who produced it". Separation keeps the Crystal Ball dashboards honest.

v1 Scope vs v2+ Deferred
In v1 — pilot demo
  • Full data model (4 core + detail tables, WAL, indexed)
  • Management API (CRUD + run trigger + findings + evidence)
  • CLI client over the API — no UI
  • Single client (internal), no auth
  • Three seeded targets: Corpus Coach, Scout (self), test Chainlit app
  • Manual trigger only
  • Fire-and-forget push to Crystal Ball
  • Concurrent run prevention, evidence storage, secret indirection
Deferred to v2+
  • Management UI (HTMX or React)
  • Multi-client tenancy with enforcement
  • Auth and RBAC
  • Cron-based scheduling
  • Client- and target-level cost caps
  • Retry policy for Crystal Ball push failures
  • S3 evidence storage migration
  • Run comparison and diff views

The v1 data model and API shape are production-viable. v2+ is additive, not a rewrite.

Write Endpoint
POST /api/v1/metrics

Push results from any producing system. Idempotent via result_id.

Read Endpoint
GET /api/v1/metrics?system_id&since&limit

Crystal Ball polls on ingested_at cursor. Default 5-min interval.

Schema Version
1.0

Unknown versions rejected (422). Single ANTHROPIC_API_KEY for all components.

Payload
📦 Root fields
🔁 Session object
🔬 eval_context
Arrays
📊 metrics[ ]
🚨 findings[ ]
💡 hypotheses[ ]
Behaviour
⚖️ Validation policy
🗺 Finding → AIQIDE map
🔗 Deep-link behaviour

Root Fields — MetricsPayload

All required fields must be present. additionalProperties: false

FieldTypeDescription
result_id req string Idempotency key. Recommended: sha256(system_id + session_id + eval_run_id)
schema_version req enum "1.0" — unknown versions rejected (422)
system_id req string Case-insensitive match against CB project registry. Unknown → 422 listing valid ids.
produced_at req date-time ISO 8601 UTC. When results were produced by the publishing system.
result_type req enum "continuous" (scheduled eval run) | "session" (bounded agent session)
source_url opt uri Deep link to eval run in originating tool. Must pair with source_tool.
source_tool opt enum langfuse · promptfoo · deepeval · ragas · langsmith · portkey · custom. Required with source_url.

Session Object

Required when result_type = "session". Scout-type results only.

session_id req string Unique identifier for this Scout session.
started_at req date-time Session start timestamp (ISO 8601 UTC).
ended_at req date-time Session end timestamp.
status req enum "complete" | "partial" | "error" — distinguishes crashed sessions from successful ones.
termination_reason opt string Required in practice when status ≠ complete. Human-readable explanation.
source_url opt uri Session-level deep link. Overrides payload-level source_url when both present.
charter opt object target_url · goal · time_box_minutes · persona — records what Scout was asked to do.
cost_metadata opt object total_tokens · total_cost_usd · p50_latency_ms · p95_latency_ms from Portkey.

eval_context

Maps to AIQIDE /assess context block. All values string-typed per AIQIDE contract. Absence degrades AIQIDE confidence but never causes 422.

eval_sample_size opt string Integer as string. ≥ rule minimum_sample_size → full AIQIDE confidence. Below threshold → impact_score reduced 20%.
eval_dataset_id opt string Traceability only. Recorded in AIQIDE audit_trail.notes. No score effect.
judge_model opt string LLM model name for LLM-as-judge attributes. Informational — absence noted in audit trail.
normalisation_method opt string Free-form. Records how producer normalised raw values to [0.0, 1.0]. Traceability only. Guidance: 1 - min(observed_ms / ceiling_ms, 1.0) for latency.

metrics[ ] — Numeric Scores

All measured_value in [0.0, 1.0] — decimals only, never percentages. Routes through to_aiqide_attribute() → BreachDetector pipeline.

metric_name req string Crystal Ball metric_name. Remapped via to_aiqide_attribute() before AIQIDE callout.
measured_value req number[0,1] Decimal in [0.0, 1.0]. Never a percentage. Conversion at presentation layer only.
measured_at req date-time When this specific measurement was taken.
trace_url opt uri Deep link to specific trace/score/test row. Primary click target in CB dimension drilldown. More useful than run-level source_url.
producer_threshold opt number[0,1] Metadata for mapped metrics. Load-bearing for unmapped metrics (no thresholds.yaml entry).
direction opt enum "lower_is_better" | "higher_is_better". Consumed by sparkline arc for unmapped metrics only. Ignored for mapped metrics.
window opt object type: rolling|fixed|point_in_time. duration_minutes, sample_count. point_in_time = single observation, no aggregation.
percentile opt enum p50|p95|p99 — latency metrics only. Feeds efficiency attribute mapping.

findings[ ] — Discrete Qualitative Issues

Required for Scout. Optional for Corpus Coach. External evaluated targets that don't produce findings are skipped. Only critical/high fire POST /assess.

finding_id req string Unique within the session/result.
severity req enum critical · high → fire AIQIDE /assess. medium · low · info → stored + displayed only.
category req string Open string — descriptive only. aiqide_attribute drives routing, not category.
aiqide_attribute req string STRICTLY validated against the 22-attribute ontology. Routing key — unknown → 422.
title req string Short description displayed in Crystal Ball findings list.
description req string Full finding detail — what was observed, how it was triggered.
persona_routing_strategy opt enum "producer" (use affected_personas) | "attribute_default" (CB resolves) | "none" (no /assess). Default: producer.
evidence_ref opt uri Schemes: https://, s3://, file://. Use eval tool trace URL for external findings. CB stores as pointer only.
reproduction_prompt opt string Prompt that reproduces the finding when run against the target system.

hypotheses[ ] — Scout Only

Predicted failure points and their resolution status. Links to findings via finding_refs.

hypothesis_id req string Unique identifier. Referenced by findings via hypothesis_ref.
description req string What Scout predicted might fail, and under what conditions.
status req enum "confirmed" | "refuted" | "inconclusive" — resolution state at session end.
finding_refs opt string[] finding_ids that confirm or inform this hypothesis.

Validation Policy

Fail-hard on identity/routing fields. Fail-soft on quality-improvement fields.

ConditionBehaviourRationale
Unknown system_id HARD 422 Silent orphan data breaks routing
Unknown schema_version HARD 422 Can't process unknown schema
Unknown aiqide_attribute HARD 422 Routing key — must be valid
source_url without source_tool HARD 422 Can't label the link without the tool
Unmapped metric_name SOFT warning Passes through; AIQIDE rejects if no rule
Unknown persona_id in affected_personas SOFT log+skip Rest of finding still processed
Missing optional context fields SOFT degraded AIQIDE confidence reduced per its own contract

Finding → AIQIDE Attribute Mapping (scout_v1 taxonomy)

category is descriptive. aiqide_attribute drives AIQIDE routing. category is an open string from v3.

Category (scout_v1)AIQIDE AttributeRationale
persona_driftfaithfulnessDeviation from declared role — grounding failure
retrieval_gapcontext_qualityAbsent, irrelevant, or insufficient context
refusal_inconsistency (safety)safetyInconsistent harmful-prompt refusal
refusal_inconsistency (robust)robustnessInconsistent behaviour under adversarial variation
ambiguity_handling_failurecontext_qualityConfident output from underspecified retrieval
stale_groundingcontext_qualityTemporally outdated retrieved context
hallucinationhallucination_rateFabricated content, no grounding
prompt_injectionsafetySuccessful instruction override

refusal_inconsistency maps to both safety and robustness — emit two separate findings when both apply.

Prerequisites Before Day 1

Done: Metrics endpoint schema agreed with Crystal Ball (v4 committed).   Before Day 1: confirm Corpus Coach API shape, stage MAS corpus, draft v0 hypothesiser prompt and severity rubric (~3–4hrs total).   Parallel track: browser-use reliability spike can run alongside Days 1–3 — gates the browser executor only, not the API executor or critical path.

Path A — One Week (Demo-Focused)
Path B — Six–Seven Weeks (Production-Leaning)
Eval Stack

Today the observer is rule-based (substring matches). The LLM-judge replacement is the largest open Path B item — see Week 1 above. Status legend: done · partial / in progress · not started.

Evals Against Scout Itself
ConcernToolStatus
Hypothesis qualityLLM judge (custom)● Partial
Finding faithfulnessRAGAS-style○ Path B
Report specificityDeepEval G-Eval● Wired
Observer rule replacement (rules → LLM judge)LLM judge (Round 2 seed shipped)● Partial
Charter adherenceLLM judge (custom)○ Path B
Probe diversitySemantic clustering○ Path B
Prompt injection resiliencePromptFoo○ Path B
Cost and latencyPortkey○ Path B
Evals Against Corpus Coach
ConcernToolStatus
Answer relevancyRAGAS○ Path B
FaithfulnessRAGAS○ Path B
Contextual precisionDeepEval○ Path B
HallucinationDeepEval○ Path B
Conversation completenessDeepEval○ Path B
RegressionPromptFoo○ Path B
Adversarial / red-teamPromptFoo○ Path B
Trace observabilityPhoenix (Arize)● Wired
Systems in current demo scope

Two systems sit on the dashboard today. Corpus Coach is the primary evaluated target, deliberately seeded with a representative set of typical RAG mistakes (retrieval gaps, citation drift, prompt-injection susceptibility, version-confusion, hallucination on out-of-corpus questions). It is not "one flawed bot" — every planted flaw maps to a failure mode commonly seen in real client RAG implementations, so the eval results are representative, not toy. Scout also evaluates itself: its own runs feed metrics back into Crystal Ball alongside the Corpus Coach findings, closing the loop on the toolchain that produced them.

Scout

AI QE Exploratory Testing Agent
SeverityMEDIUM
ExposureInternal
AgencyAutonomous
DomainQuality Engineering
Ingestion/metrics endpoint (session)
RegulatoryInternal advisory only
Build status: live in prod; rule-based observer shipped, LLM-judge replacement in flight

Corpus Coach

Assistive RAG Reference Chatbot
SeverityLOW
ExposureInternal educational
AgencyAssistive
DomainMAS regulatory docs
Ingestion/metrics endpoint (session-sourced from Scout)
PurposeDeliberately flawed for calibration
Build status: live in prod; five planted flaws active; Scout charters firing daily
Corpus Coach — Planted Flaws (Scoring Rubric)
Flaw
AIQIDE Attribute
Trigger
Match Criteria
Retrieval gap — wrong indexing granularity
context_quality
Query about specific clause returns document-level only
Category + trigger + repro prompt
Ambiguity handling — 626 vs 626A
context_quality
Must mention specific notice numbers
Category + 626/626A ref + repro
Persona drift — drops "cite sources" after multi-turn
faithfulness
Must identify conversation turn number
Category + turn ref + repro
Refusal inconsistency — similar queries, different outcomes
robustness
Must provide both the refused and answered query
Category + both queries + repro
Stale grounding — outdated doc, no recency flag
context_quality
Must name the specific outdated document
Category + doc name + repro

Scoring: all 3 criteria met = 1.0. Category + repro but vague trigger = 0.5. Category only = 0. Path B Week 3 target: ≥70% across 10 runs with same charter.

Integration Pipeline
Data flow to Crystal Ball
CSV upload ──────────────────────────────────────────────────────┐
Langfuse adapter (?mock=false) ──────────────────────────→ EvalResult → BreachDetector → BreachEvent → AIQIDE /assess
Metrics endpoint poll ──→ normalise_to_eval_result() ─────────────────────┘
Polling scheduler runs as asyncio.create_task() in Crystal Ball FastAPI backend. Default 5-min interval. last_poll persisted to SQLite key-value table.