SpireBench
A reproducible LLM benchmark on Slay the Spire 2 played through the HermesBridge mod. Each run is one full game from Neow to death or victory, captured as a line-for-line bridge command log plus a structured stat record. This site publishes the trial-v0 results.
What this is, what it isn't
It is
- One commercial roguelike, played end-to-end, no resets.
- A fixed JSON-only tool surface (
PlayCard,EndTurn, etc.). - A frozen prompt and pre-assigned character; no operator coaching, no MemPalace, no sub-agents.
- Per-run records archived with the original
.runsave.
It is not
- A win-rate leaderboard. trial-v0 has too few runs.
- A measure of pure reasoning — tool reliability and bridge quirks contribute.
- Cheat-resistant against models that have memorized StS1 strategy guides; that's why the game is StS2.