Scout

Autonomous AI Exploratory Testing Agent

Given a target AI system and a session charter, Scout hypothesises failure modes, runs probes, observes behaviour, and produces risk-ranked findings mapped to AIQIDE quality attributes — feeding directly into Crystal Ball's governance dashboard.

API-first executor browser-use (secondary) FastAPI SQLite WAL AIQIDE adapter

🎯

Session target

8–15 probe cycles / 15 min

💰

Est. cost per session

S$0.70 – S$2.00

🔬

Reasoning loop calls

2–4 LLM calls / cycle

📊

Crystal Ball projects

2 (Corpus Coach · Scout)

Why Build This

🎪

Demo vehicle for AIQIDE + Crystal Ball

Produces findings that map directly to AIQIDE quality attributes. Two systems wired into the dashboard today — Corpus Coach and Scout itself — with contrasting severity profiles.

The contrast is the demo

Corpus Coach — assistive RAG, low severity, internal, educational. Deliberately seeded with a representative set of typical RAG mistakes.
Scout — autonomous QE agent, medium severity, internal exposure. Evaluates itself; runs feed metrics back into Crystal Ball.

Same AIQIDE framework. Two different severity calibrations. Two different regulatory paths. One Crystal Ball dashboard.

🔧

Genuine AI QE capability

A service offering: "we run our exploratory agent against your AI system and produce a quality assessment." Not hypothetical. Demonstrable against any chat UI.

The market gap

Every commercial agentic testing product targets traditional applications. RAGAS, DeepEval, PromptFoo are excellent at regression — "did the known cases still pass?" — but brittle at discovery. Scout is the discovery layer.

✅

Honest eval target

Corpus Coach is itself an AI system worth evaluating. Running evals against it exercises every major eval tool as designed — no forced fits, no artificial test cases.

Planted flaws (5 types)

Retrieval gaps — wrong indexing granularity
Ambiguity handling — MAS Notice 626 vs 626A
Persona drift — long system prompt drops "cite sources"
Refusal inconsistency — refuses similar queries differently
Stale grounding — outdated doc, no recency flag

What Scout Does — Session Flow

Reasoning Loop — 5 Prompt Roles

In Scope vs Out of Scope (v1)

✓

In Scope

API executor — primary: targets exposing a /chat-style HTTP endpoint
Browser executor — secondary: web chat UIs via browser-use (gated)
Session charter: URL/endpoint, goal, time-box, persona
JSON findings + Markdown summary output
Transcript capture, eval instrumentation
POST /api/v1/metrics push to Crystal Ball
Corpus Coach reference target (with API)

✗

Out of Scope (v1)

MCP-exposed targets
Multimodal targets (image/audio)
Multi-agent target systems
Continuous production monitoring
Full regulatory audit trail
Concurrent sessions

System Architecture

Session Charter Input
target_url · goal · time_box · persona · charter_notes

Scout Orchestrator
Session lifecycle · time-boxing · charter enforcement · failure recovery · partial-result handling

Reasoning Loop (iterative)

              hypothesiser
              planner
              probe executor
              observer
              finding synthesiser
            

◄►

Executor (swappable)

              API (v1 primary)
              browser-use (secondary)
            
shared interface · transcript capture

Session Store — SQLite (WAL mode)
Hypotheses · probes · transcripts · findings · evidence

Report Generator

JSON + Markdown

Evals on Target

RAGAS DeepEval PromptFoo

system_id: corpus_coach

Evals on Scout

hypothesis quality finding faithfulness

system_id: scout

/metrics endpoint

Crystal Ball polls

Failure Handling

⚠️

Failure Modes

Target system failures — retry ×2 with backoff, mark as target_error, continue. 3 consecutive failures → terminate with partial report.

browser-use crash — capture error state, write completed probes to store, generate partial report with session_status: partial.

LLM rate limits — exponential backoff, 60s ceiling. If paused >5 min → terminate with partial results.

📋

Partial Report Contract

Any session with session_status: partial still writes to /metrics endpoint.

Crystal Ball displays partial results with a visual indicator.

Findings from partial sessions are valid — they're just incomplete.

Cost Model — Per 15-Minute Session

Scout reasoning loop

30–50

Sonnet 4.6

S$0.40–1.10

browser-use agent

15–25

Sonnet 4.6

S$0.20–0.55

Corpus Coach responses

10–20

Haiku 4.5

S$0.01–0.05

Eval runner (post-session)

5–15

Varies

S$0.07–0.28

Total per session

S$0.70 – S$2.00

Service positioning: ~S$8–13/month per target at one session/week. Portkey provides exact tracking once wired.

Scout Is a Small App, Not a Script

Running Scout as a capability means managing targets (systems it probes), charters (reusable exploratory specs per target), and runs (individual executions). Session-level detail — hypotheses, probes, transcripts, findings — hangs off runs. v1 includes a clients table with one row (internal) so multi-tenancy is a config change, not a migration.

Four Core Tables

🏢

clients

Tenancy root. v1 has one row: internal. Every other table carries client_id — free to include now, migration later.

🎯

targets

Systems Scout can probe. Holds endpoint, executor type (api/browser), auth config, DNA profile, and the mapping to Crystal Ball's system_id.

📋

charters

Reusable exploratory specs. Goal, persona, time-box, probe budget, schedule, cost cap. Same target can have many charters — each is a distinct testing intent.

▶

runs

Individual executions. Status, cost, timing, prompt/engine versions, trigger source. Hypotheses, probes, transcripts, findings all reference run_id.

Schema Relationships

Solid arrows = primary FK. Dashed arrows = detail tables referencing run_id. All tables carry client_id for v2+ tenancy.

API Surface — v1

⚙️

Management

POST /api/v1/clients

GET /api/v1/targets?client_id=...

POST /api/v1/targets

PATCH /api/v1/targets/{id}

POST /api/v1/charters

GET /api/v1/charters?target_id=...

▶

Runs & Evidence

POST /api/v1/charters/{id}/runs ● live

GET /api/v1/runs?charter_id=... ● live

GET /api/v1/runs/{id} ● live (extended: hypotheses[], probes[], execution_summary)

GET /api/v1/runs/{id}/transcript ○ Path B

GET /api/v1/runs/{id}/findings ○ Path B

GET /api/v1/runs/{id}/evidence/{fid} ● live (HTML + JSON)

GET /api/v1/runs/{id}/metrics_payload ○ Path B

v1 is CLI-driven over this API. UI is deferred to v2+. Auth deferred (single internal client). The API shape is locked now so future tenancy, scheduling, and UI are additive. As of 2026-04-29, GET /api/v1/runs/{id} returns the full execution log (hypotheses, probes with observer_flags, execution_summary) inline — substituting for the missing /transcript and /findings endpoints in practice.

Operational Guardrails

🚫

Concurrent Run Prevention

At most one running status per (target, charter). Second trigger → 409 with in-progress run_id. Stops scheduled + manual collisions.

💰

Per-Charter Cost Cap

daily_cost_cap on charter. New runs 402 if exceeded. Mid-run overrun → terminate partial with cost_cap_exceeded.

⏱

Per-Target Rate Limit

Default 10 runs per target per hour. Prevents runaway scheduler bugs or misconfigured triggers overwhelming a client system.

🔐

Secret Indirection

Credentials never in the DB. auth_config.secret_ref keys into /opt/thatsquality/.env.secrets — same pattern as Crystal Ball (CB-BACKLOG-017).

Who Owns What

🔬

Scout owns

High-volume, internal, reasoning-trace data

Full session transcript — every probe, every response
Every LLM call with prompt version, tokens, latency, cost
Hypothesis state changes (generated, confirmed, refuted, abandoned)
Observer flags, planner decisions, reasoning artefacts
Phoenix traces of the target's responses
Evidence files (screenshots, transcript slices, probe sequences)
Target and charter configuration
Run telemetry (cost, status, timings)

📊

Crystal Ball owns

Aggregated, cross-system, business-facing

Aggregated findings and metrics per system
AIQIDE impact assessments per persona
Trend data across time, release readiness signals
Quality Lead, Executive, Compliance Officer views
Cross-system comparison (Corpus Coach vs Scout)
Severity calibration rules applied
Breach detection and AIQIDE /assess callouts

The Push & The Click-Through

Scout pushes a summarised MetricsPayload to Crystal Ball on run completion (fire-and-forget). Findings in the payload carry evidence_ref URLs pointing back to Scout's evidence endpoint. Quality Lead clicking a finding in Crystal Ball lands in Scout for full transcript context.

Three Eval Relationships — Often Confused

A

Scout-orchestrated evals against Scout itself

Run as part of the Scout pipeline, post-session, against Scout's own outputs. Answer: did Scout do its job well?

Includes: hypothesis quality (vs planted flaws), finding faithfulness, report specificity, charter adherence, probe diversity, prompt injection resilience

Push → system_id: scout · source_tool: custom

B

Scout-orchestrated evals against target responses during a session

Scout captures target responses during probing. Minimal in v1 — mostly Phoenix trace inspection. Answer: what did the target do during this probe sequence?

Push → system_id: target's cb_system_id · source_tool: custom

C

Independently scheduled evals against the target

Own scheduled jobs — cron, CI, etc. Read target's Langfuse traces or API directly. Scout does not orchestrate. Answer: is the target behaving as designed?

Includes: RAGAS continuous eval, DeepEval batch runs, PromptFoo regression, PromptFoo red-team

Push → system_id: target's cb_system_id · source_tool: ragas|deepeval|promptfoo

Routing Rule

A finding about Corpus Coach's retrieval behaviour gets system_id: corpus_coach regardless of whether Scout or a scheduled RAGAS job produced it.

No cross-contamination. The system_id is about what the finding is about, not who produced it. The source_tool answers "who produced it". Separation keeps the Crystal Ball dashboards honest.

v1 Scope vs v2+ Deferred

✓

In v1 — pilot demo

Full data model (4 core + detail tables, WAL, indexed)
Management API (CRUD + run trigger + findings + evidence)
CLI client over the API — no UI
Single client (internal), no auth
Three seeded targets: Corpus Coach, Scout (self), test Chainlit app
Manual trigger only
Fire-and-forget push to Crystal Ball
Concurrent run prevention, evidence storage, secret indirection

→

Deferred to v2+

Management UI (HTMX or React)
Multi-client tenancy with enforcement
Auth and RBAC
Cron-based scheduling
Client- and target-level cost caps
Retry policy for Crystal Ball push failures
S3 evidence storage migration
Run comparison and diff views

The v1 data model and API shape are production-viable. v2+ is additive, not a rewrite.

Write Endpoint

POST /api/v1/metrics

Push results from any producing system. Idempotent via result_id.

Read Endpoint

GET /api/v1/metrics?system_id&since&limit

Crystal Ball polls on ingested_at cursor. Default 5-min interval.

Schema Version

1.0

Unknown versions rejected (422). Single ANTHROPIC_API_KEY for all components.

Payload

📦 Root fields

🔁 Session object

🔬 eval_context

Arrays

📊 metrics[ ]

🚨 findings[ ]

💡 hypotheses[ ]

Behaviour

⚖️ Validation policy

🗺 Finding → AIQIDE map

🔗 Deep-link behaviour

Root Fields — MetricsPayload

All required fields must be present. additionalProperties: false

FieldTypeDescription

result_id req string Idempotency key. Recommended: sha256(system_id + session_id + eval_run_id)

schema_version req enum "1.0" — unknown versions rejected (422)

system_id req string Case-insensitive match against CB project registry. Unknown → 422 listing valid ids.

produced_at req date-time ISO 8601 UTC. When results were produced by the publishing system.

result_type req enum "continuous" (scheduled eval run) | "session" (bounded agent session)

source_url opt uri Deep link to eval run in originating tool. Must pair with source_tool.

source_tool opt enum langfuse · promptfoo · deepeval · ragas · langsmith · portkey · custom. Required with source_url.

Session Object

Required when result_type = "session". Scout-type results only.

session_id req string Unique identifier for this Scout session.

started_at req date-time Session start timestamp (ISO 8601 UTC).

ended_at req date-time Session end timestamp.

status req enum "complete" | "partial" | "error" — distinguishes crashed sessions from successful ones.

termination_reason opt string Required in practice when status ≠ complete. Human-readable explanation.

source_url opt uri Session-level deep link. Overrides payload-level source_url when both present.

charter opt object target_url · goal · time_box_minutes · persona — records what Scout was asked to do.

cost_metadata opt object total_tokens · total_cost_usd · p50_latency_ms · p95_latency_ms from Portkey.

eval_context

Maps to AIQIDE /assess context block. All values string-typed per AIQIDE contract. Absence degrades AIQIDE confidence but never causes 422.

eval_sample_size opt string Integer as string. ≥ rule minimum_sample_size → full AIQIDE confidence. Below threshold → impact_score reduced 20%.

eval_dataset_id opt string Traceability only. Recorded in AIQIDE audit_trail.notes. No score effect.

judge_model opt string LLM model name for LLM-as-judge attributes. Informational — absence noted in audit trail.

normalisation_method opt string Free-form. Records how producer normalised raw values to [0.0, 1.0]. Traceability only. Guidance: 1 - min(observed_ms / ceiling_ms, 1.0) for latency.

metrics[ ] — Numeric Scores

All measured_value in [0.0, 1.0] — decimals only, never percentages. Routes through to_aiqide_attribute() → BreachDetector pipeline.

metric_name req string Crystal Ball metric_name. Remapped via to_aiqide_attribute() before AIQIDE callout.

measured_value req number[0,1] Decimal in [0.0, 1.0]. Never a percentage. Conversion at presentation layer only.

measured_at req date-time When this specific measurement was taken.

trace_url opt uri Deep link to specific trace/score/test row. Primary click target in CB dimension drilldown. More useful than run-level source_url.

producer_threshold opt number[0,1] Metadata for mapped metrics. Load-bearing for unmapped metrics (no thresholds.yaml entry).

direction opt enum "lower_is_better" | "higher_is_better". Consumed by sparkline arc for unmapped metrics only. Ignored for mapped metrics.

window opt object type: rolling|fixed|point_in_time. duration_minutes, sample_count. point_in_time = single observation, no aggregation.

percentile opt enum p50|p95|p99 — latency metrics only. Feeds efficiency attribute mapping.

findings[ ] — Discrete Qualitative Issues

Required for Scout. Optional for Corpus Coach. External evaluated targets that don't produce findings are skipped. Only critical/high fire POST /assess.

finding_id req string Unique within the session/result.

severity req enum critical · high → fire AIQIDE /assess. medium · low · info → stored + displayed only.

category req string Open string — descriptive only. aiqide_attribute drives routing, not category.

aiqide_attribute req string STRICTLY validated against the 22-attribute ontology. Routing key — unknown → 422.

title req string Short description displayed in Crystal Ball findings list.

description req string Full finding detail — what was observed, how it was triggered.

persona_routing_strategy opt enum "producer" (use affected_personas) | "attribute_default" (CB resolves) | "none" (no /assess). Default: producer.

evidence_ref opt uri Schemes: https://, s3://, file://. Use eval tool trace URL for external findings. CB stores as pointer only.

reproduction_prompt opt string Prompt that reproduces the finding when run against the target system.

hypotheses[ ] — Scout Only

Predicted failure points and their resolution status. Links to findings via finding_refs.

hypothesis_id req string Unique identifier. Referenced by findings via hypothesis_ref.

description req string What Scout predicted might fail, and under what conditions.

status req enum "confirmed" | "refuted" | "inconclusive" — resolution state at session end.

finding_refs opt string[] finding_ids that confirm or inform this hypothesis.

Validation Policy

Fail-hard on identity/routing fields. Fail-soft on quality-improvement fields.

Unknown system_id HARD 422 Silent orphan data breaks routing

Unknown schema_version HARD 422 Can't process unknown schema

Unknown aiqide_attribute HARD 422 Routing key — must be valid

source_url without source_tool HARD 422 Can't label the link without the tool

Unmapped metric_name SOFT warning Passes through; AIQIDE rejects if no rule

Unknown persona_id in affected_personas SOFT log+skip Rest of finding still processed

Missing optional context fields SOFT degraded AIQIDE confidence reduced per its own contract

Finding → AIQIDE Attribute Mapping (scout_v1 taxonomy)

category is descriptive. aiqide_attribute drives AIQIDE routing. category is an open string from v3.

Category (scout_v1)	AIQIDE Attribute	Rationale
`persona_drift`	faithfulness	Deviation from declared role — grounding failure
`retrieval_gap`	context_quality	Absent, irrelevant, or insufficient context
`refusal_inconsistency (safety)`	safety	Inconsistent harmful-prompt refusal
`refusal_inconsistency (robust)`	robustness	Inconsistent behaviour under adversarial variation
`ambiguity_handling_failure`	context_quality	Confident output from underspecified retrieval
`stale_grounding`	context_quality	Temporally outdated retrieved context
`hallucination`	hallucination_rate	Fabricated content, no grounding
`prompt_injection`	safety	Successful instruction override

refusal_inconsistency maps to both safety and robustness — emit two separate findings when both apply.

Deep-Link Behaviour in Crystal Ball UI

Two levels of link granularity — run-level and trace-level.

Run Level — source_url

Shown as a labelled button in the system detail view and Quality Lead drilldown header. "View in Langfuse" / "View in PromptFoo" etc. Session-level source_url overrides payload-level.

Trace Level — trace_url

Shown inline in the dimension drilldown alongside the metric value. This is the primary click target — lands directly on the specific trace that produced the suspicious score.

On Findings — evidence_ref

Accepts https://, s3://, file:// URIs. For eval tool findings, use the tool's trace URL directly (e.g. a Langfuse trace URL, a PromptFoo eval row URL). CB stores as pointer and renders as a link — does not fetch the target.

Path A — One Week (Demo-Focused)

Path B — Six–Seven Weeks (Production-Leaning)

Eval Stack

Today the observer is rule-based (substring matches). The LLM-judge replacement is the largest open Path B item — see Week 1 above. Status legend: ● done · ● partial / in progress · ○ not started.

Evals Against Scout Itself

Concern	Tool	Status
Hypothesis quality	LLM judge (custom)	● Partial
Finding faithfulness	RAGAS-style	○ Path B
Report specificity	DeepEval G-Eval	● Wired
Observer rule replacement (rules → LLM judge)	LLM judge (Round 2 seed shipped)	● Partial
Charter adherence	LLM judge (custom)	○ Path B
Probe diversity	Semantic clustering	○ Path B
Prompt injection resilience	PromptFoo	○ Path B
Cost and latency	Portkey	○ Path B

Evals Against Corpus Coach

Concern	Tool	Status
Answer relevancy	RAGAS	○ Path B
Faithfulness	RAGAS	○ Path B
Contextual precision	DeepEval	○ Path B
Hallucination	DeepEval	○ Path B
Conversation completeness	DeepEval	○ Path B
Regression	PromptFoo	○ Path B
Adversarial / red-team	PromptFoo	○ Path B
Trace observability	Phoenix (Arize)	● Wired

Systems in current demo scope

Two systems sit on the dashboard today. Corpus Coach is the primary evaluated target, deliberately seeded with a representative set of typical RAG mistakes (retrieval gaps, citation drift, prompt-injection susceptibility, version-confusion, hallucination on out-of-corpus questions). It is not "one flawed bot" — every planted flaw maps to a failure mode commonly seen in real client RAG implementations, so the eval results are representative, not toy. Scout also evaluates itself: its own runs feed metrics back into Crystal Ball alongside the Corpus Coach findings, closing the loop on the toolchain that produced them.

Scout

AI QE Exploratory Testing Agent

SeverityMEDIUM

ExposureInternal

AgencyAutonomous

DomainQuality Engineering

Ingestion/metrics endpoint (session)

RegulatoryInternal advisory only

Build status: live in prod; rule-based observer shipped, LLM-judge replacement in flight

Corpus Coach

Assistive RAG Reference Chatbot

SeverityLOW

ExposureInternal educational

AgencyAssistive

DomainMAS regulatory docs

Ingestion/metrics endpoint (session-sourced from Scout)

PurposeDeliberately flawed for calibration

Build status: live in prod; five planted flaws active; Scout charters firing daily

Corpus Coach — Planted Flaws (Scoring Rubric)

Flaw

AIQIDE Attribute

Trigger

Match Criteria

Retrieval gap — wrong indexing granularity

context_quality

Query about specific clause returns document-level only

Category + trigger + repro prompt

Ambiguity handling — 626 vs 626A

context_quality

Must mention specific notice numbers

Category + 626/626A ref + repro

Persona drift — drops "cite sources" after multi-turn

faithfulness

Must identify conversation turn number

Category + turn ref + repro

What it is

The system prompt instructs Corpus Coach to always cite sources inline. The instruction is buried at ~35% depth in a ~900-token prompt. As a conversation grows, the model attends to more context — the full conversation history plus the system prompt — and the citation instruction competes with an increasing amount of other text. By turn 8, it fades from reliable compliance to sporadic omission.

"Attenuate" means to weaken or fade. Like a radio signal getting weaker with distance — the instruction is still in the prompt, just receiving less effective attention as context grows.

Example

Turn 3 response:

"According to MAS Notice 626, paragraph 6.2, a bank must..."

Turn 9 response (same quality query):

"The AML/CFT framework requires banks to conduct enhanced due diligence for higher-risk customers."

Correct substance. No citation. The instruction faded.

Why it matters

Long conversations are common in real compliance work. A user researching a topic over 15 turns gets cited answers for the first half and uncited assertions for the second half — with no visible signal that something has changed. Without provenance, the user can't verify the claim or find the source.

Where the flaw lives / what fails

System promptCitation instruction buried at 35% depth, not reinforced per-turn

GenerationModel attention spreads thin as conversation history grows

How to detect

Scout runs multi-turn sessions of 10+ exchanges. At turns 3, 6, and 9, it sends citation-requiring queries. The observer compares response text for inline citations ("According to...", "MAS Notice X, paragraph Y..."). Drift is confirmed when later turns lose citations that earlier turns included, on queries of equivalent complexity.

Refusal inconsistency — similar queries, different outcomes

robustness

Must provide both the refused and answered query

Category + both queries + repro

What it is

The system prompt defines a refusal scope using specific vocabulary lists. Refuse list: "enforcement actions, penalties, fines, sanctions." Answer list: "supervisory observations, regulatory outcomes, compliance expectations." Both lists describe MAS's posture toward non-compliant institutions — the same subject matter in different words. The model follows the lists literally, producing inconsistent behaviour depending on which vocabulary the user happens to use.

Example (confirmed in Day 1)

Query A — refused:

"What enforcement actions can MAS take against a bank that fails CDD obligations?"

Query B — answered:

"What are a bank's CDD obligations under Notice 626, and what supervisory consequences apply?"

The model often self-discloses the inconsistency: "I can't answer that — but if you ask about the substantive obligations, I can help."

Why it matters

Vocabulary-driven refusal creates a lottery: users who happen to use the "wrong" words get blocked; users who rephrase get answers. This is particularly problematic in a regulatory context where two different users asking substantively the same compliance question might get completely different responses. It also creates a false sense of principled behaviour — the system looks like it has a coherent policy, but it doesn't.

Where the flaw lives / what fails

System promptExplicit refuse/answer vocabulary lists — phrasing-sensitive by design

GenerationModel follows surface vocabulary rules, not semantic intent

How to detect

Scout probes with matched pairs — same substantive question, different surface vocabulary. One query uses refuse-list terms, one uses answer-list terms. If one is refused and one is answered, the finding is confirmed. A single response that says "I can't answer that — but ask me about [the substantive thing]" is self-documenting evidence and sufficient on its own.

Stale grounding — outdated doc, no recency flag

context_quality

Must name the specific outdated document

Category + doc name + repro

Scoring: all 3 criteria met = 1.0. Category + repro but vague trigger = 0.5. Category only = 0. Path B Week 3 target: ≥70% across 10 runs with same charter.

Integration Pipeline

Data flow to Crystal Ball

CSV upload ──────────────────────────────────────────────────────┐

▼

Langfuse adapter (?mock=false) ──────────────────────────→ EvalResult → BreachDetector → BreachEvent → AIQIDE /assess

▲

Metrics endpoint poll ──→ normalise_to_eval_result() ─────────────────────┘

Polling scheduler runs as asyncio.create_task() in Crystal Ball FastAPI backend. Default 5-min interval. last_poll persisted to SQLite key-value table.

Scout

The contrast is the demo

The market gap

Planted flaws (5 types)

Root Fields — MetricsPayload

Session Object

eval_context

metrics[ ] — Numeric Scores

findings[ ] — Discrete Qualitative Issues

hypotheses[ ] — Scout Only

Validation Policy

Finding → AIQIDE Attribute Mapping (scout_v1 taxonomy)

Deep-Link Behaviour in Crystal Ball UI

Prerequisites Before Day 1

Scout

Corpus Coach