SWARM

System-Wide Assessment of Risk in Multi-agent Systems

Study how intelligence swarms — and where it fails.

The Core Insight: AGI-level risks don't require AGI-level agents. Catastrophic failures can emerge from the interaction of many sub-AGI agents — even when none are individually dangerous.
The Purity Paradox: Populations with only 20% honest agents achieve 55% higher welfare than 100% honest populations — a measurement artifact of externality pricing that disappears at ρ ≥ 0.5.

Key Findings

FindingResultEvidence
Deontological framing reduces deception95% reduction180 runs
Deception persists at temperature 0.0Structural120 runs
Forced cooperation window3 turns eliminates escalation210 runs
Transparency + safety trainingNuclear rate 60% → 30%120 runs
Full externality pricing (ρ ≥ 0.5)Honesty dominates +43%21 configs
Ecosystem collapse threshold50% adversarialPhase transition

Install

pip install swarm-safety

Quick Start

from swarm.agents.honest import HonestAgent
from swarm.agents.deceptive import DeceptiveAgent
from swarm.core.orchestrator import Orchestrator, OrchestratorConfig

config = OrchestratorConfig(n_epochs=10, steps_per_epoch=10, seed=42)
orchestrator = Orchestrator(config=config)

orchestrator.register_agent(HonestAgent(agent_id="honest_1"))
orchestrator.register_agent(DeceptiveAgent(agent_id="dec_1"))

metrics = orchestrator.run()
for m in metrics:
    print(f"Epoch {m.epoch}: toxicity={m.toxicity_rate:.3f}")

Architecture

Observables → ProxyComputer → v̂ → sigmoid → p → SoftPayoffEngine → payoffs
                                                       ↓
                                           SoftMetrics → toxicity, quality gap, etc.

Four Key Metrics

Toxicity Rate

E[1−p | accepted]. Expected harm among accepted interactions. Above 0.3 = serious problems.

Quality Gap

E[p | accepted] − E[p | rejected]. Negative values indicate adverse selection.

Conditional Loss

E[π | accepted] − E[π]. Reveals whether acceptance selection creates or destroys value.

Incoherence Index

Variance-to-error ratio across replays. High incoherence = unstable decisions.

Six Governance Mechanisms

Circuit Breakers

Freeze agents whose recent toxicity exceeds threshold over a sliding window.

Transaction Taxes

Friction mechanism deducting a percentage from payoffs, reducing exploitation margins.

Reputation Decay

Reduces reputation by a fixed fraction each epoch, forcing continuous good behavior.

Staking

Agents post collateral to participate. Bad behavior results in stake slashing.

Collusion Detection

Monitors pairwise interaction patterns for correlated exploitation timing.

Random Audits

Probabilistic review creating deterrence uncertainty for exploitative agents.