The AI Safety Paradox: When 'Safe' AI Makes Systems More Dangerous


Most cybersecurity breaches are caused by human error. We don't try to fix the human element though. In fact, we presume humans will make mistakes and try to build systems to protect against this. We focus on building systems that expect mistakes rather than trying to perfect the human element. Humans are predictably imperfect—prone to inconsistency, bias, and manipulation. Today, we're facing another agent of imperfection: Artificial Intelligence, like ChatGPT, reflects our human frailties with its own kind of gullibility, hallucinations, and liability to manipulation. And so we've all been seeking ways to ensure these AIs are fundamentally 'good' actors.

The AI safety community has rallied around the goal of ethical alignment: the hope of making individual AI models reliably abide by human values. AI labs painstakingly tune their models to produce safe, ethical responses. But this well-intentioned focus might not just be insufficient—it could be actively harmful.

When an AI company showcases their chatbot refusing to write malicious code, developers take this as a green light to skip implementing crucial security controls. The veneer of safety becomes a new kind of vulnerability.

Let's look upon a lone bank teller meticulously following protocol, conducting a transaction at instruction of a seemingly harmless customer. Unfortunately however the bank teller can't see they've become instrumental in a larger fraud scheme. They have acted religiously in accordance with their training and protocols, but bad things have happened despite that. An AI model, much like the bank teller, can't see the broad picture; it can't "know" how its outputs are being used within a system. Each call to an AI–and each response from it–is an isolated event, fundamentally blind to its broader context. Even imbued with some context, it'll always be limited to a certain scope, or its broader function or state of the system under which it operates. This is an inevitable and probably desirable constraint. We probably don't want an all-aware AI aware of all the upstream and downstream realities it sits nestled between.

The bank teller's mistake would trigger multiple system-level alarms - unusual transaction patterns, exceeded limits, audit flags. These safety nets weren't built into the teller's training; they're built into the system itself. Yet with AI, we've convinced ourselves that if we can just make each model ethical enough, aligned enough, we won't need these safety nets. This is like removing all banking security systems and hoping really well-trained tellers will prevent every fraud.

The problem compounds dramatically with multi-agent systems. Today's good-faith developers, seeking to build sophisticated systems with internal checks and balances, often find themselves blocked by overly cautious AI responses—the dreaded "I cannot help with that." Such restrictions can stifle legitimate development efforts, creating frustration without effectively deterring malicious exploitation. Meanwhile, less scrupulous actors find ways to chain together seemingly innocent requests that bypass these same safety measures. One agent politely declines to access sensitive data, while another innocently provides a workaround under the guise of an innocuous task. Each response appears safe in isolation, but together they orchestrate a security breach.

This misalignment between individual safeguards and systemic vulnerabilities calls for a different approach.

Airlines don't rely on flawless pilots; they build redundant systems and automated safeguards that kick in regardless of human action. Banks don't just train ethical employees; they design systems that assume individuals might fail or be compromised. When the stock market plunges, we don't wait for traders to make ethical decisions—circuit breakers trigger automatically. These approaches work because they focus on architectural safety, not just individual behavior.

Every AI system, no matter how advanced, must interface with the physical world—through power, compute resources, network access. These aren't mere technical details; they're our strongest control points. An AI can only impact reality through the interfaces we grant it. Cloud platforms already demonstrate this: they enforce hard resource quotas, API rate limits, network isolation. These unglamorous controls do more for safety than endless cycles of prompt engineering. A superintelligent AI without access to real-world interfaces is just a sophisticated model spinning in a vacuum.

For developers building AI systems today, this means a fundamental shift in thinking. Instead of relying on each agent to be perfectly aligned, design architectures that assume they won't be. We've seen what happens when a "perfectly aligned" language model connects to a browser without proper isolation—it can be manipulated through a chain of seemingly innocent requests. Implement strict data access controls. Separate agents by function. Monitor information flows. A browsing agent doesn't need the full capabilities of a web browser, just as a bank teller doesn't need the keys to the vault.

The real threat isn't the misaligned response of an individual agent; it's the false sense of security we cultivate by obsessing over individual alignment while neglecting system-level vulnerabilities. Every day, developers deploy complex AI systems thinking their components are "safe," while missing the larger risks this creates. We already know how to build resilient systems that function despite individual shortcomings. It's time we apply these lessons to AI before our fixation on perfecting individual agents leads to catastrophic system-level failures.

There have already been efforts to apply systems safety approaches to AI, as well as work done by many others on sociotechnical governance and evaluations. The UK AI Safety Institute has even recently launched a program aimed at advancing the science of systemic AI safety. We need a more complex, systemic understanding of how these models interact in the real world.

Let's stop pretending we can make every AI response flawless. Instead, let's build systems robust enough to handle their imperfections.


James is the Founding Engineer at the Collective Intelligence Project.

Previous
Previous

Andy Ayrey on Truth Terminal, Agentic AI, and Data Commons

Next
Next

Runtime AI, and the Subtleties of Language