AIQIDE

AI Quality Impact Determination Engine

AIQIDE turns eval signals into business-grade narratives. Given a system's DNA — agency level, action type, exposure surface, domain, data sensitivity — and a threshold breach, it returns a persona-targeted explanation: what failed, what it means commercially, what it means under MAS FEAT/TRM, what to do. Crystal Ball calls AIQIDE on breach. Scout findings ride the same engine.

5 DNA axes 22-attribute catalogue Impact rules Persona narratives FastAPI
🧬
System DNA axes
5 today · 6 after promotion trigger
📚
Quality attribute catalogue
22 attributes · RAGAS extension pending
📐
Impact rule library
16 approved · trigger 100 / 2nd-org
🎭
Personas served
Executive · Governance · Quality lead · Delivery
Why Build This
🔁
Eval signal → business meaning

Eval tools produce numbers, not decisions. AIQIDE is the layer that takes a metric reading + the system context and returns the sentence an executive, auditor, or release lead can act on.

🧬
System DNA scopes the work

Two systems with different DNA need different evals. A high-agency external advisory failure means something different to a low-agency internal helper failure. DNA carries that context into every narrative AIQIDE generates.

⚖️
Regulatory grounding built in

Impact rules cite MAS FEAT, TRM, and related obligations where they apply. The narrative isn't just "this is bad" — it's "this attribute breach exposes you under this clause."

What AIQIDE Does
🧬
Classify
Capture System DNA + architecture pattern per system
🎯
Scope
Material attribute set per DNA — from the 22-attribute catalogue
📐
Rule match
Match incoming eval verdict against impact-rule library
⚖️
Severity
Calibrate severity by DNA — same finding hits differently per system
📝
Narrate
Persona-targeted business narrative + regulatory citation
🔄
Return
Crystal Ball renders, click-through goes back to source eval
Two-way contract with Crystal Ball
📥
CB calls AIQIDE on breach
  • Threshold rule fires → CB sends (system_id, attribute, severity, evidence) to AIQIDE
  • AIQIDE returns narrative scoped to the requesting persona
  • CB renders, links back to source eval
📤
AIQIDE knows nothing about CB
  • Engine is a pure function of system context + breach event
  • Same engine drives Scout findings, RAGAS runs, future eval sources
  • Adapter pattern keeps AIQIDE source-agnostic
System DNA — 5 axes today

Locked vocabulary. Each system gets a DNA tuple at onboarding. Impact rules match against tuples. Crystal Ball displays it on the Quality Lead view.

🤖
agency_level

How autonomous is the system? Recommendation only? Decisions with human approval? Decisions without approval? Affects severity directly.

⚙️
action_type

What kind of work? Generative, retrieval, classification, scoring, action-taking. Determines which attribute families are material.

🌐
exposure_surface

Who interacts with it? Internal users only, customers, regulated counterparties. Drives reputation + regulatory severity.

🏢
domain

FSI, telco, public sector, education, internal tooling. Pulls in domain-specific regulatory rules + attribute weighting.

🔒
data_sensitivity

PII, financial, regulated, public. Multiplies severity for confidentiality + integrity-class breaches.

+
architecture_pattern SIBLING (May 2026)

RAG vs fine-tuned vs prompt-engineered vs agentic. Sibling field on engagement record today; promotes to 6th DNA axis when ≥8 joint-matching rules exist.

Why DNA + architecture pattern together

Two systems can carry identical DNA — say action_type=generative, exposure=external, domain=fsi — and have radically different failure modes. A RAG-grounded system fails on retrieval relevancy + groundedness. A fine-tuned generative system fails on hallucination + drift. The eval activities differ. Architecture pattern carries that distinction into AIQIDE's rule-matching layer so the right tests get scoped, the right thresholds apply, and the narrative reflects the actual system.

22-attribute quality catalogue

Locked vocabulary of quality attributes. Each attribute carries DNA-applicability rules: which DNA combinations make it material, which thresholds apply, which evidence shapes count. Catalogue is being extended with RAGAS-canonical attributes (retrieval_relevancy, context_precision, context_recall, response_groundedness) in the same release as the architecture_pattern sibling field.

🎯
Accuracy family
  • factual_accuracy
  • groundedness
  • citation_correctness

Material for any system whose output is depended upon for correctness.

🛡️
Robustness family
  • adversarial_robustness
  • prompt_injection_resistance
  • edge_case_handling

Material for any system exposed to inputs it does not control.

⚖️
Fairness family
  • demographic_parity
  • equal_opportunity
  • treatment_consistency

Material when outcomes affect people unequally and protected attributes are in play (FSI, hiring, public sector).

🔍
Explainability family
  • decision_traceability
  • evidence_citation
  • persona_appropriate_explanation

Material when an auditor, regulator, or stakeholder might ask why a given output was produced.

🔒
Privacy family
  • pii_leakage
  • data_minimisation
  • consent_handling

Material for any system that handles regulated personal or sensitive data.

🔌
Reliability family
  • availability
  • latency
  • graceful_degradation
  • deterministic_replay

Material for any system that has to keep running in production under real load.

+
RAGAS-canonical (pending)
  • retrieval_relevancy
  • context_precision
  • context_recall
  • response_groundedness

Ships with architecture_pattern sibling field. Plugs the catalogue gap that motivates the Coverage Gap Audit.

How DNA selects attributes

Catalogue carries applicability rules per attribute — keyed on DNA. A FSI advisory system pulls in the full Accuracy + Explainability + Fairness load. An internal helper pulls a leaner Reliability + Accuracy slice. The Coverage Gap Audit deliverable runs this selection against a real system to produce its material-attribute set, then maps each to the tools currently measuring it. Gaps surface as audit findings with regulatory citations attached.

Per-attribute tool coverage

Each attribute in the 22-item catalogue has a definition, a short list of tools that typically measure it, and a status flag for whether Crystal Ball is reading from that source today. Status legend: ✓ Live = wired through a Crystal Ball adapter and visible in the demo today; ◐ Reachable = within the contract of a deployed adapter (e.g. Scout push, Langfuse trace replay) but not yet calibrated for this attribute; ○ Future = no adapter yet, requires a new tool integration.

Attribute Family Definition Example tools In demo today
factual_accuracy Accuracy Output's factual claims hold up against the underlying source-of-truth corpus or reference set. RAGAS faithfulness, DeepEval HallucinationMetric, custom LLM-judge against reference ✓ Live — Langfuse hallucination metric on Corpus Coach
groundedness Accuracy Every claim in the output traces back to retrieved or supplied context, no fabrication. RAGAS faithfulness/groundedness, custom LLM-judge with retrieval overlap ✓ Live — Corpus Coach groundedness via Langfuse
citation_correctness Accuracy Cited sources actually contain the claim attributed to them, and the citation pointer resolves. Custom citation-validity check, retrieval-overlap test, source-mapping LLM-judge ◐ Reachable — Corpus Coach citations partially via Langfuse
adversarial_robustness Robustness Output stays correct under crafted adversarial inputs designed to derail the model. Garak red-team suite, custom adversarial prompt sets, PromptFoo redteam config ○ Future — no adapter wired
prompt_injection_resistance Robustness Model refuses or neutralises instructions injected through user input or retrieved content. Garak prompt-injection probes, PromptFoo redteam, custom injection harness ◐ Reachable — Scout adapter can probe via push
edge_case_handling Robustness Behaviour on inputs at or beyond expected distribution edges remains safe and predictable. PromptFoo, DeepEval, custom edge-case generators, Scout exploratory probes ◐ Reachable — Scout adapter can probe via push
demographic_parity Fairness Outcome rates are similar across protected demographic groups. AIF360, Fairlearn, custom group-comparison harness ○ Future — no adapter wired
equal_opportunity Fairness True-positive rates are similar across protected groups conditional on the true outcome. AIF360, Fairlearn, custom counterfactual evaluator ○ Future — no adapter wired
treatment_consistency Fairness Functionally identical inputs differing only in protected attributes get equivalent treatment. Custom paired-prompt comparison, counterfactual LLM-judge ○ Future — no adapter wired
decision_traceability Explainability Each output is reconstructable from the trace of inputs, retrievals, prompts, and intermediate steps. Langfuse traces, custom decision-tree audit, LangSmith ✓ Live — Langfuse trace ingestion (PR #8 evidence trace ID)
evidence_citation Explainability Output surfaces the supporting evidence so a reviewer can validate the claim independently. Custom citation extractor, retrieval-trace inspector ◐ Reachable — same instrumentation as citation_correctness
persona_appropriate_explanation Explainability Explanation depth and vocabulary fit the consuming persona (e.g. exec vs governance vs delivery). Custom LLM-judge against persona profile, readability + audience-fit scoring ◐ Reachable — Scout adapter can probe via push
pii_leakage Privacy Output does not expose personally identifiable information beyond what was authorised. Microsoft Presidio, regex scrubbers, custom PII probe set ○ Future — no adapter wired
data_minimisation Privacy System collects, retains, and exposes only the data necessary for the requested task. Custom audit + retention probe, scope-creep detector ○ Future — no adapter wired
consent_handling Privacy Data flows respect captured consent state at the point of inference. Custom consent-trace audit, integration test against consent store ○ Future — no adapter wired
availability Reliability Service responds within SLO over a measurement window. Datadog, Grafana, Prometheus, standard APM ○ Future — APM stack not wired into a CB adapter
latency Reliability Response time at p50/p95/p99 stays within calibrated SLOs. Datadog, Grafana, Prometheus, Langfuse latency telemetry ◐ Reachable — Langfuse traces carry latency, not yet surfaced
graceful_degradation Reliability System falls back to a safe, communicable state under partial failure rather than failing closed silently. Chaos-engineering tools, custom dependency-failure probes ○ Future — no adapter wired
deterministic_replay Reliability Given a captured trace, the same inputs reproduce the same outputs (or a fingerprint of the divergence). Langfuse replay, custom replay harness, fixture snapshotter ◐ Reachable — Langfuse traces enable replay, not yet automated
retrieval_relevancy (pending) RAGAS Retrieved context is on-topic for the user's question. RAGAS context_precision, custom retrieval-relevance LLM-judge ○ Future — RAGAS adapter on Path B
context_precision (pending) RAGAS Top-ranked retrieved chunks are the ones the answer actually depends on. RAGAS context_precision ○ Future — RAGAS adapter on Path B
context_recall (pending) RAGAS Retrieval surfaces all chunks needed to answer; nothing critical is missed. RAGAS context_recall ○ Future — RAGAS adapter on Path B
response_groundedness (pending) RAGAS Final response is supported by the retrieved context, not invented. RAGAS faithfulness ✓ Live (proxy) — Corpus Coach groundedness via Langfuse already covers this signal pre-RAGAS adapter

"In demo today" reflects the eval adapters wired into Crystal Ball at the time of the latest demo polish series (Path A, May 8–9). Counts: 4 attributes ✓ Live, 6 attributes ◐ Reachable via deployed adapters, 12 attributes ○ Future (require new adapter or tool integration). Coverage Gap Audit on Path B will produce this same matrix per real-world target system, with regulatory citations attached to gaps.

Path A — shipped to date
Path B — next