ai-governanceeu-ai-acteu-ai-act-robustnesscompliancesecurity

EU AI Act Article 15: Accuracy and Robustness Requirements

Nikola Kovtun · · 8 min read
EU AI Act Article 15: Accuracy and Robustness Requirements

A production AI agent in a fintech lending workflow produced inaccurate income verification results on 0.3% of applications — roughly 15 cases per month. The accuracy figure looked acceptable in testing: 99.7%.

Three months after deployment, those 15 monthly cases had accumulated into a pattern: the inaccuracy was not random. It was correlated with a specific income document format used by two banks in a single region. Customers from those banks were systematically disadvantaged.

The Article 15 problem wasn’t the accuracy rate — 99.7% would pass most benchmarks. The problem was the absence of continuous accuracy monitoring, bias analysis across subpopulations, and documented error handling for the specific failure mode.

TL;DR

  • Article 15 requires high-risk AI systems to achieve appropriate levels of accuracy, robustness, and cybersecurity
  • Accuracy is not a single benchmark — it must be maintained across the system’s lifetime and measured per relevant subgroup
  • Robustness covers resilience to errors, faults, and inconsistencies — including adversarial inputs
  • Cybersecurity requires protection against attacks that could alter system behavior
  • Error handling and fallback behavior must be defined, documented, and tested

What Article 15 Covers

Article 15 of the EU AI Act is titled “Accuracy, Robustness and Cybersecurity.” It has three distinct but overlapping requirements.

Accuracy

Article 15 requires high-risk AI systems to achieve “appropriate levels of accuracy.” The word “appropriate” is doing significant legal work — it means the accuracy standard is context-dependent.

A medical triage system and a product recommendation system have different accuracy requirements. The EU AI Act doesn’t specify a universal threshold; it requires that providers define the appropriate level for their context, demonstrate that level is achieved, and maintain it throughout the system’s operation.

Critically, the accuracy requirement extends to subpopulations. A system with 99% overall accuracy that systematically underperforms for specific demographic or geographic groups does not satisfy Article 15 if that underperformance creates discriminatory effects.

Robustness

Article 15 requires systems to be “resilient as regards errors, faults or inconsistencies that may occur within the system or the environment in which the system operates.”

For AI agents, robustness covers:

  • Input robustness — How does the agent handle malformed inputs, unexpected edge cases, or inputs outside its training distribution?
  • Operational robustness — How does the agent behave when upstream dependencies (APIs, databases, external services) fail or produce unexpected outputs?
  • Degraded-mode behavior — What happens when a component of the system is unavailable? Does the agent fail safely, or does it proceed with missing information in ways that increase risk?
  • Adversarial robustness — Can the system be manipulated through carefully crafted inputs designed to change its behavior? Prompt injection, data poisoning, and model extraction attacks are the primary adversarial threats for AI agents.

Cybersecurity

Article 15 explicitly includes cybersecurity as a robustness dimension. AI systems must be designed and developed “with a view to ensuring an appropriate level of cybersecurity.”

For AI agents, the primary cybersecurity concerns are:

  1. Prompt injection — Malicious content in the agent’s environment that causes it to behave contrary to its instructions
  2. Model extraction — Queries designed to reconstruct the model or its training data
  3. Training data poisoning — Manipulation of training data to create backdoor behaviors
  4. Inference manipulation — Real-time manipulation of model inputs or outputs

These attack classes require both technical mitigations (input validation, output filtering, rate limiting) and governance mitigations (constitutional rules that resist manipulation, anomaly detection for behavior drift).

Practical Implementation

Accuracy implementation

Define accuracy metrics per use case. For a credit scoring agent: false positive rate, false negative rate, calibration across income brackets, and geographic distribution of errors. These metrics must be defined before deployment and measured continuously after.

Measure accuracy per subgroup. Aggregate accuracy metrics can mask systematic underperformance. Define the relevant subgroups for your use case — demographic, geographic, product type, document format — and measure accuracy per subgroup.

Establish accuracy thresholds and monitoring. Define acceptable accuracy ranges and the action triggered when accuracy falls outside range. For a lending agent, a drop in accuracy for a specific document type should trigger: review, halt, or escalation to a human review queue.

Document accuracy evidence in Annex IV technical documentation. The accuracy evidence — test results, benchmark methodology, subgroup analysis — must appear in the technical documentation required by Annex IV.

Robustness implementation

Robustness dimensionTest typeMinimum required
Input edge casesBoundary testingDefined behavior for all inputs
Dependency failureChaos testingDocumented degraded-mode behavior
Adversarial inputsRed team / prompt injectionTested, documented residual risk
Distribution shiftShadow deploymentMonitoring alerts on shift detection

Define fail-safe behavior. Every failure mode must have a defined response. “The agent returns an error” is a fail-safe. “The agent proceeds with partial information” without documented justification is not.

Test degraded-mode explicitly. Failure scenarios for AI agents include: model API unavailable, database inaccessible, governance layer offline, escalation queue full. What does the agent do in each case? Test it. Document it.

Cybersecurity implementation

Prompt injection mitigation. For agents that process external content (web pages, user documents, third-party data), implement input sanitization that detects and handles injected instructions. Validate that external content cannot override system prompt instructions.

Behavioral monitoring. Constitutional rule violation rates, output distribution shifts, and anomalous tool call patterns are all potential indicators of adversarial manipulation. Monitor these and alert on significant deviation.

Access control and isolation. The agent’s access to tools, databases, and APIs should be minimal for its stated function. Principle of least privilege limits the blast radius of an adversarially manipulated agent.

For governance-layer implementation that connects robustness to compliance evidence, see EU AI Act Article 12: Logging Requirements Decoded.

The Testing Obligation

Article 15 implies a testing obligation — you cannot demonstrate appropriate accuracy and robustness without testing. The NIST AI Risk Management Framework (a widely used reference for Article 15 implementation) specifies that testing should include:

  1. Pre-deployment testing against representative datasets
  2. Adversarial testing (red teaming, prompt injection attempts)
  3. Subgroup performance analysis
  4. Failure mode analysis and documentation
  5. Ongoing post-deployment monitoring with documented baselines

The NIST AI Risk Management Framework provides detailed guidance on each of these testing categories that is consistent with EU AI Act expectations.

Error Handling Requirements

Article 15’s robustness requirement has a direct implication for error handling: “appropriate measures to minimize and mitigate” potential errors must be in place.

For AI agents, this means:

Every error type must have a defined handler. Not just technical errors (API timeout, model failure) but governance errors (constitutional rule violation, missing authorization) and business errors (agent output outside expected range, decision that can’t be justified).

Error handling must be tested. The handlers must be verified to work in practice, not just defined in documentation.

Errors must be logged. Under Article 12, errors are among the events that must be logged. An error that isn’t logged is an error that can’t be monitored, and an Article 15 monitoring requirement that isn’t satisfied.

Fallback behavior must be safe. A system that defaults to automatic approval when its governance layer is unavailable has an unsafe fallback. Safe defaults in governance terms mean: when uncertain, escalate or deny.

FAQ

Q: Does Article 15 require a specific accuracy threshold?

No. Article 15 requires “appropriate” accuracy — a context-dependent standard. You must define what appropriate means for your system, demonstrate you achieve it, and maintain it. A system that achieves its defined threshold in aggregate but fails systematically for specific subgroups may still violate Article 15.

Q: How does Article 15 interact with GDPR’s requirements on automated decision accuracy?

GDPR Article 22 grants data subjects rights regarding purely automated decisions with significant effects, including the right to contest. Article 15 is a design obligation: build the system to achieve and maintain accuracy. They’re complementary. Article 15 requires you to build an accurate system; GDPR Article 22 requires you to handle cases where it’s wrong.

Q: What counts as adversarial testing for EU AI Act purposes?

Adversarial testing means systematically attempting to produce failures, errors, or manipulated behavior through crafted inputs. For AI agents: prompt injection attempts (can external content override system instructions?), boundary probing (what happens at the edge of defined operating conditions?), and stress testing (how does the system behave under high load or degraded dependencies?). Red teaming — using a dedicated team to attempt realistic attacks — is the recommended approach.

Q: Do model updates from our provider require re-testing under Article 15?

Yes. Model updates can change accuracy and robustness characteristics. Article 15’s continuous accuracy requirement means significant model updates trigger a review of your accuracy evidence. This doesn’t necessarily require full retest from scratch — a targeted evaluation focused on the capabilities changed in the update is often proportionate.

Q: How does Article 15 interact with the Article 9 risk management system?

Article 15 defines the quality standard for accuracy and robustness. Article 9 requires a continuous risk management system. The Article 9 risk management system is the operational mechanism for maintaining Article 15 standards. Risk management measures for identified accuracy risks (monitoring, retraining triggers, human review for low-confidence outputs) are how you comply with both articles simultaneously.


By Nikola Kovtun, founder of Infracortex AI Studio. Cortex’s governance layer generates the evidence needed to demonstrate Article 15 compliance — behavioral monitoring, anomaly detection, and tamper-evident records of system performance. Book a 30-minute call to map your Article 15 gaps.

See also: EU AI Act Article 9: Continuous Risk Management for AI Agents | EU AI Act Article 12: Logging Requirements Decoded | AI Agent Observability vs Governance: What’s the Difference?

Cortex build: 0.1.35-260423

Nikola Kovtun
Nikola Kovtun
AI Knowledge Architect, Founder at Infracortex
Get Started

Find Out Where AI Can Save You the Most Time

Start with an AI System Health Check. 1-2 days, from $500, zero commitment. You get a structured report with your biggest opportunities.

Get Your Health Check From $500 · 1-2 days · Zero commitment