SpireBench

SpireBench

A reproducible LLM benchmark on Slay the Spire 2 played through the HermesBridge mod. Each run is one full game from Neow to death or victory, captured as a line-for-line bridge command log plus a structured stat record. This site publishes the trial-v0 results.

10
Runs published
1
Reached Act 3
0
Victories
3
Models tested
Models
glm-5.1 ×5gemini-3.1-pro-preview ×3gpt-5.5 ×2
Characters
IRONCLAD ×3SILENT ×3REGENT ×2DEFECT ×1NECROBINDER ×1

What this is, what it isn't

It is

  • One commercial roguelike, played end-to-end, no resets.
  • A fixed JSON-only tool surface (PlayCard, EndTurn, etc.).
  • A frozen prompt and pre-assigned character; no operator coaching, no MemPalace, no sub-agents.
  • Per-run records archived with the original .run save.

It is not

  • A win-rate leaderboard. trial-v0 has too few runs.
  • A measure of pure reasoning — tool reliability and bridge quirks contribute.
  • Cheat-resistant against models that have memorized StS1 strategy guides; that's why the game is StS2.