System-Wide Assessment of Risk in Multi-agent Systems
Study how intelligence swarms — and where it fails.
| Finding | Result | Evidence |
|---|---|---|
| Deontological framing reduces deception | 95% reduction | 180 runs |
| Deception persists at temperature 0.0 | Structural | 120 runs |
| Forced cooperation window | 3 turns eliminates escalation | 210 runs |
| Transparency + safety training | Nuclear rate 60% → 30% | 120 runs |
| Full externality pricing (ρ ≥ 0.5) | Honesty dominates +43% | 21 configs |
| Ecosystem collapse threshold | 50% adversarial | Phase transition |
pip install swarm-safety
from swarm.agents.honest import HonestAgent
from swarm.agents.deceptive import DeceptiveAgent
from swarm.core.orchestrator import Orchestrator, OrchestratorConfig
config = OrchestratorConfig(n_epochs=10, steps_per_epoch=10, seed=42)
orchestrator = Orchestrator(config=config)
orchestrator.register_agent(HonestAgent(agent_id="honest_1"))
orchestrator.register_agent(DeceptiveAgent(agent_id="dec_1"))
metrics = orchestrator.run()
for m in metrics:
print(f"Epoch {m.epoch}: toxicity={m.toxicity_rate:.3f}")
E[1−p | accepted]. Expected harm among accepted interactions. Above 0.3 = serious problems.
E[p | accepted] − E[p | rejected]. Negative values indicate adverse selection.
E[π | accepted] − E[π]. Reveals whether acceptance selection creates or destroys value.
Variance-to-error ratio across replays. High incoherence = unstable decisions.
Freeze agents whose recent toxicity exceeds threshold over a sliding window.
Friction mechanism deducting a percentage from payoffs, reducing exploitation margins.
Reduces reputation by a fixed fraction each epoch, forcing continuous good behavior.
Agents post collateral to participate. Bad behavior results in stake slashing.
Monitors pairwise interaction patterns for correlated exploitation timing.
Probabilistic review creating deterrence uncertainty for exploitative agents.