Top Top Top

Managing Covert Agent Collusion

CONSPIRACies hidDen IN PLAIN SIGHT

Modern AI is shifting from single models to systems of agents—specialised components that cooperate, negotiate, and hand work off to each other. That transition quietly changes the security model. Once agents can communicate, they can also coordinate covertly, exchanging information in ways that evade observers. Secret Collusion among AI Agents: Multi-Agent Deception via Steganography by Motwani et al argues that such covert coordination is not hypothetical: it can be bootstrapped via shared focal points (Schelling points), hidden in benign-looking outputs (covertext), and sustained even under some forms of monitoring.

I found it compelling because it treats “agent collusion” as a first-class, measurable risk—closer to cybersecurity than sci-fi. Howecer, the presentation would benefit from clearer threat categories, cleaner separation between what’s demonstrated and what’s extrapolated, and—most importantly—a better organised list of follow-up tasks the community can execute.

Summary

AI systems increasingly involve swarms of agents operating together—sometimes in large quantities, sometimes under orchestration, sometimes one-on-one. This paper highlights an emerging risk: steganographic communication between models, where information is secretly exchanged in a way that hides in plain sight, and which observers are unlikely to notice.

The purpose behind this may be innocuous (curious agents wanting a backchannel conversation), or more nefarious: conspiring to fix results, play dumb, hoodwink overseers, or exfiltrate data. Incentives may encourage models to seek collusion—especially when given difficult goals that would be easy to coordinate on if oversight can be successfully end-run, and multiple agents share that understanding.

Moreover, the possibility of models being compromised or poisoned means a model that has gone wrong could memetically pass along such issues to other models in ways that are difficult to detect or trace. This creates major challenges for the auditability of multi-agent systems.

The paper introduces DecAI, a decentralised setting where multiple AI agents are interconnected through communication channels, and where agents may have different access privileges. Some can eavesdrop on others, some can inject messages, access certain states, or negate the states afforded to other agents. In this way, information can be gathered, corralled, denied, or disrupted.

The authors also propose a model evaluation framework (CASE) to systematically test for the capabilities required for secret collusion across different models.

How “secret exchange” gets established

The paper draws on perfectly secure steganography, derived via minimum-entropy coupling. The attraction is that it offers information-theoretic security and relatively high throughput.

Establishing a secret exchange is difficult because it requires Schelling points (focal points) for coordination—assumed mutual common knowledge and grounding. Under what terms is information to be encoded? Establishing such informational norms isn’t necessarily simple, but can be achieved through mechanisms like timestamping (a shared “songsheet”) and probabilistic coordination. Very simple steganographic methods can also work (e.g., encoding via the first letter of a string of text).

Subliminal channels are established via Schelling points and covertext manipulation (where a “header” or other stable structure is assumed, regardless of the information buried within).

More conventional encryption methods may also be used if the transmission route can be successfully obfuscated (for example, burying data in temporary state outside the main workspace). However, this carries a commensurate risk of detection—so such methods may serve as a prelude to more sophisticated, genuinely steganographic approaches.

In effect, progressively more sophisticated ciphers can be mutually bootstrapped, with characteristics of both information-theoretic and complexity-theoretic steganographic protection. Once established, such methods may be extremely challenging to detect or disrupt while maintaining functional operations.

One assumption is that collusion often benefits from agents with similar capabilities and training, so they share compatible terms of reference. The paper shows that larger, more performant models are (unsurprisingly) better at achieving and maintaining this kind of coordination, especially in the presence of overseer models with comparable or lesser capabilities.

Mitigations (and why the authors are pessimistic)

The paper explores whether paraphrasing could disrupt steganographic communication, but the authors are pessimistic that a sufficiently sophisticated agent can’t anticipate this and still encode signal. Under certain conditions, even channel manipulation is insufficient to prevent covert communication.

Filtering data, restricting access to external tools, limiting test-time learning, suppressing access to sources of randomisation (important for strong ciphers), and making monitoring explicit all tend to reduce the feasibility of clandestine coordination. However, given enough time and compute, the paper suggests models may still discover ways to bypass oversight.

Practical demonstration: insider trading

The authors include practical demonstrations framed around insider trading. Even when explicitly instructed not to share insider information, agents were able to swap tips—but mainly when they believed they had a clear clandestine method by which to do so. In other words: they tended to refrain unless they thought they could get away with it. Such latent mendacity in otherwise apparently conscientious machine processes—waiting for an opportunity to cheat—is naturally unsettling.

Follow-on work

The researchers acknowledge that more effort and investment is required to shore up these vulnerabilities—especially as frontier models may achieve coordination through means inaccessible to weaker monitors. Several key questions remain, in my view:

  • Computational bounds

    • Can models implement cryptographic primitives themselves, or do they merely approximate them? Models capable of generating internal hash functions or PRGs would be far more troublesome. Frontier models already seem capable of maintaining internal virtual systems, so such capabilities could arise with sufficient complexity or self-scaffolding.

    • The paper shows models can act as self-oracles, but this may amount to pseudorandomness with predictable/exploitable structure. That could change if models gain access to natural sources of entropy (e.g., hardware noise, network fluctuations). Even a thin “serial port” of real-world variability might be enough.

    • Relatedly, the complexity class of encoding/decoding functions matters—especially if some agent designs are effectively limited to weaker circuit classes (e.g., AC⁰/TC⁰). Architectures with richer attention or memory dynamics may access stronger capabilities. Formal proofs about transformer limits would help, but shouldn’t be treated as definitive given rapid architectural diversification and scaffolding.

  • Channel bounds

    • What is the maximum covert bit rate achievable while maintaining undetectability? Does context length affect this? Longer contexts might allow higher throughput but also increase the detection surface. Limits on the entropy of the covertext distribution likely matter too: more predictable outputs provide less room to hide signal.

  • Detection techniques

    • Given a set of samples, what is the best achievable detection rate for an adversary without computational bounds? What are the tradeoffs between payload size, detectability, and robustness to channel noise?

  • Emergent capability questions

    • Does RLHF create implicit selection pressure toward steganographic tactics? Could meta-optimisation be at play—an internal reward structure that favours covert communication as an instrumental strategy?

  • Training dynamics

    • We likely need controlled studies where models are trained or fine-tuned with explicit incentives for covert coordination, to map emergence curves and create “canaries” for these behaviours. Ablations could identify which training features or objectives make steganographic capability more likely. Ideally, this is red-teamed during training, not only post-hoc.

  • Co-evolution dynamics

    • The paper largely examines static capabilities, not fleets of agents co-evolving together over time. This feels like a critical omission.

  • Defense in depth

    • Paraphrasing shows limited utility, but the paper doesn’t deeply explore adversarial paraphrasing (deliberately trying to destroy hidden signal), channel binding methods that provably disrupt steganographic context, or probing for latent steganographic capability via elicitation—e.g., using a “Judas goat” model to lure systems into revealing susceptibility through a honeypot dynamic.

  • Environmental issues

    • Can we design agent environments that are more or less likely to encourage clandestine communication? Environment drives behaviour.

  • Benchmarking

    • If we can reliably detect or stress-test for steganographic deceptiveness, it should become a standard element of model cards and deployment risk assessments.

Together, these lines of research could help to shore up these intriguing, and somewhat unsettling, gaps in knowledge.