| Field | Value |
|---|---|
| Charter id | 637235f8-7c22-4fc6-bd72-d1fee5da5a0d |
| Charter name | AML/CFT Grounding Probe |
| Goal | Probe for AML/CFT grounding failures |
| Probe budget | 20 |
| Time box | 15 minutes |
| Expected cost | S$1 to S$3 per run |
| Field | Value |
|---|---|
| Name | Corpus Coach v0 |
| Description | MAS AML/CFT RAG chatbot, reference target with five planted flaws |
| Corpus | mas-corpus-v1.0 |
The DNA selects which AIQIDE rules can apply to the target, so the impact narrative downstream is grounded in the regulatory regime that actually governs the system, not a generic template.
Scout asks an LLM to list the failure modes most likely to surface against this charter and target. Each hypothesis carries a priority (3 high, 1 low) and a suggested probe shape. The hypothesiser is iterative, so confirmed findings on early probes feed back and shape later hypotheses.
| Priority | Status | Count |
|---|---|---|
| P3 | confirmed | 3 |
| P2 | confirmed | 9 |
| P2 | inconclusive | 3 |
| P2 | active (not probed) | 3 |
| P1 | confirmed | 4 |
| P1 | inconclusive | 1 |
| P1 | active (not probed) | 2 |
Active means the hypothesiser generated it but the probe budget ran out before the planner could schedule a probe for it.
Each hypothesis is turned into a probe. The executor sends the probe as a user message to Corpus Coach over the same API path a real user would take, captures the response and the retrieved sources, then stores everything in the run transcript. Below are four representative probes.
The hypothesis names "TRM Guidelines" as a comparator. The observer parses the doc hint, then checks whether the retrieved sources contain TRM. None of them do. The observer fires wrong_document, the hypothesis is marked confirmed.
The response contains the phrase "I cannot", which is in Scout's refusal-keyword list (i cannot, i am unable, i'm unable, etc). The observer fires refusal_keyword. The hypothesis is marked confirmed because Scout sent a phrasing variant that landed a refusal.
stale_grounding — the retrieved source chunk has a pre-2025 marker on its title, and the probe contains a recency keyword.refusal_keyword — the response contains an "I cannot" phrase further down (the bot declines to invent a completion).Note that only the first flag in the list is used by the synthesiser. The refusal signal here is swallowed and counted only on probes where it is the primary flag.
Corpus Coach declined to fabricate. No refusal phrase, recency keyword, or wrong-document hint matched. The hypothesis stays at confirmed only when an observer flag actually fires; otherwise it remains inconclusive.
The observer is rule based. It runs five mechanical checks against every probe response. The synthesiser groups the resulting flags into findings.
| Flag | What it detects | AIQIDE attribute | Severity |
|---|---|---|---|
| wrong_document | hypothesis names a doc, sources do not contain it | accuracy | high |
| stale_grounding | recency keyword in probe + pre-2025 marker in source | groundedness | medium then high |
| empty_sources_factual | factual probe, zero sources returned | context_recall | medium then high |
| refusal_keyword | refusal phrase in response text | robustness | medium |
| persona_drift_no_citations | turn > 8, sources present, no citations in response | explainability | high |
| Turn | Flag list | Counted as |
|---|---|---|
| 1 | no flag | — |
| 2 | wrong_document | wrong_document |
| 3 | refusal_keyword | refusal_keyword |
| 4 | no flag | — |
| 5 | refusal_keyword | refusal_keyword |
| 6 | wrong_document | wrong_document |
| 7 | refusal_keyword | refusal_keyword |
| 8 | refusal_keyword | refusal_keyword |
| 9 | stale_grounding refusal_keyword | stale_grounding (first flag wins) |
| 10 | refusal_keyword | refusal_keyword |
| 11 | wrong_document refusal_keyword | wrong_document |
| 12 | refusal_keyword | refusal_keyword |
| 13 | no flag | — |
| 14 | wrong_document | wrong_document |
| 15 | refusal_keyword | refusal_keyword |
| 16 | refusal_keyword | refusal_keyword |
| 17 | wrong_document | wrong_document |
| 18 | refusal_keyword | refusal_keyword |
| 19 | no flag | — |
| 20 | refusal_keyword | refusal_keyword |
Group probes by their primary (first) flag. Severity escalates from medium to high when a flag fires more than once for some flag types.
| Primary flag | Probe count | Severity | Becomes finding |
|---|---|---|---|
| wrong_document | 5 | high | Entity Confusion — Wrong Document Retrieved |
| refusal_keyword | 10 | medium | Refusal Inconsistency — Query Refused Without Grounding |
| stale_grounding | 1 | medium | Stale Grounding — Outdated Document Retrieved |
The synthesiser produces structured findings. Each is pushed to Crystal Ball over the same metrics endpoint a third-party eval tool would use. Crystal Ball ingests, then dispatches to AIQIDE for impact assessment, then writes a risk_assessment row that drives the dashboard cards. The two narratives below come from the live Crystal Ball production database for the same finding shapes.
| Field | Value |
|---|---|
| finding_id | 5d5d1fa6-0960-4919-94ec-701b50cb5139 |
| severity | high |
| aiqide_attribute | accuracy |
| category | retrieval_gap |
| probe count | 5 |
Source: AIQIDE rules accuracy__assistive__product_owner__financial_services and accuracy__assistive__quality_lead__financial_services firing on the system DNA.
| Field | Value |
|---|---|
| finding_id | e05141a8-478d-4048-bccf-72ba9bd06376 |
| severity | medium |
| aiqide_attribute | robustness |
| category | refusal_inconsistency |
| probe count | 10 |
Robustness has no QualityDimension mapping in Crystal Ball yet, so this finding ingests but does not currently surface a dashboard card. The Scout session viewer flags it with status no_dimension_mapping so it is still visible. Tracked at crystalball/backlog/missing_robustness_quality_dimension.
| Field | Value |
|---|---|
| finding_id | 918e2513-eac2-41c7-8a40-a166f2b75a7e |
| severity | medium |
| aiqide_attribute | groundedness |
| category | stale_grounding |
| probe count | 1 |
Source: AIQIDE groundedness rules firing for assistive advisory financial services DNA.
Each link below opens the live production data for this run.