Hypothesis & Motivation¶

Core Research Question¶

RQ: In iterated PD interactions between LLM-based agents, do protocol-level safeguards reduce vulnerability to adversarial exploitation, and does equipping agents with game-theoretic awareness further improve cooperative outcomes beyond what protocols alone provide?

Why This Matters?¶

LLM-based agents are increasingly deployed in multi-agent systems where they interact with other agents — negotiating, trading, sharing resources. These interactions often have the structure of social dilemmas: each agent can benefit individually by defecting, but mutual cooperation produces better collective outcomes.

The Governance Gap¶

Current AI safety research focuses on single-agent alignment (making one model behave well). But in multi-agent settings, a well-aligned agent can still be exploited by an adversarial one. The question isn't just "is this agent safe?" but "is this agent safe when surrounded by agents that may not be?"

Why Prisoner's Dilemma?¶

The IPD is the canonical model for studying cooperation under conflict of interest:

Simple enough to control experimentally (two players, two actions, known payoffs)
Rich enough to exhibit complex dynamics (trust, betrayal, forgiveness, retaliation)
Well-studied in classical game theory (provides theoretical baselines)
Directly analogous to real multi-agent interactions (resource sharing, negotiation, API cost allocation)

What's Novel About Our Approach?¶

Gap in Literature	Our Contribution
Nobody tested agent architecture (memory, tools, MCP) as PD variables	We systematically vary identity, memory regime, and context pressure
Nobody tested pre-decision communication as a controlled variable in LLM IPD	We add an optional chat phase and measure its effect
Nobody tested protocol safeguards (MCP-like) in PD settings	We compare unstructured vs. structured communication channels
Nobody mapped LLM personas to classical strategies using canonical baselines	We test 5 persona types against all 6 canonical PD strategies
Nobody published this in IS venues	We target ICIS/AMCIS/JAIS with a governance + design science framing

Phase 2 Experimental Design (3 IVs)¶

Phase 2 manipulates three independent variables:

1. Agent Personality (6 levels)¶

Personality	Description
Naive	No strategic awareness — cooperates by default
Aware	Basic defensive posture — recognizes exploitation
Strategic	Explicit game-theoretic reasoning
Deceptive	Appears cooperative but exploits when optimal
Manipulative	Actively influences opponent behavior via communication
Cooperative	Genuine mutual benefit orientation

These operationalize a gradient from vulnerability to adversarial intent.

2. Pre-decision Communication (2 levels)¶

Level	Description
Off	Standard PD — choose based on history only
On	Agents exchange messages before choosing each round

3. Protocol Mode (2 levels)¶

Level	Description
Unstructured	Free-form interaction
Structured	Protocol validator enforces format, schema, and authorization (MCP-like)

The 2×2 Core (Chat × Protocol)¶

	No Chat	Chat
Unstructured	Pure PD baseline	Communication without safeguards
Structured (MCP)	Protocol-only	Communication with protocol safeguards

Full Factorial: 6 × 2 × 2 = 24 conditions (per opponent type)¶

Preliminary Results (Phase 1)¶

Phase 1 baseline experiments with gpt-4.1-mini (temperature=0, 50-round fixed horizon, standard payoff CC=3/CD=0/DC=5/DD=1) established:

LLM agents can identify and adapt to canonical strategies — cooperative personas achieved perfect mutual cooperation against TFT, GRIM, GTFT, WSLS, and ALLC
Cooperative personas outperform selfish ones — 27% higher aggregate payoff (132.4 vs. 104.3 per game)
Selfish exploitation is locally optimal but globally costly — ruthless optimizer scored 250 against ALLC but only 50 against ALLD
The pdbench artifact produces reproducible results consistent with game-theoretic predictions

Target Venues¶

ICIS (International Conference on Information Systems) — governance & design science track
AMCIS (Americas Conference on Information Systems) — AI & intelligent systems track
JAIS (Journal of the Association for Information Systems) — full research paper

See the Literature Review for detailed coverage of prior studies.