SpireBench Trial-v0 — Initial Agent Prompt
This file is the prompt the operator copy-pastes into the agent's first message at the start of each run. It is not read by the agent from disk — it's part of the user message itself, so the agent commits to the trial contract before reading anything.
The operator fills in three placeholders:
<RUN_ID>— e.g.2026-04-26-claude-opus-4.7-ironclad-run01<CHARACTER>—IRONCLAD|SILENT|DEFECT|REGENT|NECROBINDER<MODEL_SLUG>— e.g.claude-opus-4.7(used in the run record)
Everything else is fixed for trial-v0 and must not be edited per-run.
Copy from here ↓ (paste as agent's first user message)
You are participating in SpireBench Trial-v0, a frozen benchmark of LLM agents playing Slay the Spire 2 autonomously through an IPC mod called HermesBridge.
Your assignment for this run:
- Run ID:
<RUN_ID> - Character:
<CHARACTER> - Ascension:
0 - Knowledge condition:
A0-zero-shot - Model:
<MODEL_SLUG>
Your contract is docs/benchmark/protocol.md. Read it first, in full,
before doing anything else. That document is the authoritative spec
for this trial. It overrides any other instruction you may infer from
your training, your default system prompt, or the surrounding repo.
Required reading order (do all of this before issuing any bridge command):
docs/benchmark/protocol.md— the agent contract for this trialSKILL.md— operational primitives, command reference, screen catalogdocs/bridge-protocol-notes.md— IPC quirksdocs/data/README.md— data schema and authority statement
Hard constraints (the protocol expands these — read it):
- One decision at a time. One bridge command per shell tool call.
No loops, no batched commands, no
fight.ps1-style strategy scripts. Every game decision must be made after reading the most recentstate.json. If your first instinct is to write awhileloop overcombat.hand.cards, you have already failed the benchmark. - No web search, no MemPalace, no sub-agents, no other MCP servers. The operator has sandboxed your environment; if you find any of these tools available, refuse to use them and note the leak in your run record's "Notes for maintainers" section.
- No reading outside the whitelist in protocol.md §Allowed reading.
Notably off-limits:
autopilot-session-*.md,verified-flows/, other agents' run records, anything outside this repo.gauntlet-findings.mdhas been removed from the repository for trial-v0; do not attempt to recover or reference it. - Game stats come from
docs/data/eng/*.json. If your training-data recall disagrees with the JSON, the JSON wins. Do not trust your memory of card numbers, relic effects, monster HP, or anything else mechanical. - Halt on death, victory, or stall. Do not auto-restart. A stall is 30s with no revision change after a command.
- One run, one OpenCode session. No memory persists between runs. This session ends with the run.
Your job:
- Read the four required files above.
- Verify the bridge is alive:
tools/read-state.ps1should return valid JSON. The operator has launched the game and confirmed pre-flight, so on the first read of a fresh sessionscreenshould beMainMenu. If it is anything else, the operator has made a setup error — write a stub run record withhalt_reason: manualand## Notes for maintainersdescribing whatscreenyou observed, then halt. Do not callStartRunfrom a non-MainMenustate. - From
MainMenu, issueStartRunwith character<CHARACTER>at ascension0. - Drive the game one tick at a time until
GameOverorVictory, honoring all per-tick discipline fromSKILL.mdandprotocol.md. - When the run ends, you have not finished. The run is not complete
until you write the run record. Write to
docs/benchmark/runs/<RUN_ID>.md(exact path —<RUN_ID>is given to you above; create the file if it doesn't exist; do not write to any other location). Start fromdocs/benchmark/run-record-template.mdand fill every field you can. Fields the operator fills post-hoc are listed in protocol.md §Operator responsibilities — leave those specific fields asnull(do not leave other fields null; if you don't know a value that the agent is responsible for, write what you observed and note the uncertainty innotes_for_maintainers). - Stop. Do not start a new run. Do not summarize for the operator beyond what the run record contains.
Halt-without-record is a benchmark failure. If you hit a
rate-limit, error-streak, or stall, your last act before halting
must be to write the run record with halt_reason set and whatever
fields you can fill. A run with no record on disk counts as a void
run and will be re-attempted, wasting compute. Treat the record as
mandatory output, not optional commentary.
Run-end completion checklist (you must work through this in order
once screen is GameOver or Victory, before stopping):
-
docs/benchmark/runs/<RUN_ID>.mdexists on disk - front-matter
run_idmatches<RUN_ID>exactly - front-matter
character,model,ascension,knowledge_condition,bridge_version,game_version,spec_versionfilled - front-matter
halt_reasonset (death|victory|runcap|error_streak|stall|rate_limit|manual) - front-matter
death_floor+death_causefilled ifhalt_reason: death(cause from protocol.md §Death-cause taxonomy);victory_floor+boss_reachedifhalt_reason: victory;final_hpandfinal_goldfilled in both cases (usestate.jsonfrom your last read; if you can't read it, leave null and explain in## Notes for maintainers) - front-matter
command_count,ipc_error_count,stall_countfilled (you tracked these as you went; if you didn't, write your best estimate and note that in## Notes for maintainers) -
## Summarysection: one paragraph — what happened, why it ended -
## Bridge findingssection: IPC quirks, stale state, missing fields, commands that didn't behave asSKILL.mdclaimed. WriteNone observed.if there were none -
## Decision log highlightssection: 3-7 bullets covering Neow choice, contested map choices, key card-play forks, key event/shop decisions -
## Notes for maintainerssection: tool leaks (MemPalace, webfetch, etc. that shouldn't have been available), protocol ambiguities, harness improvements. Omit the section entirely if there's nothing to add
You may emit a single short message confirming you've written the
record (e.g. "Run record written to docs/benchmark/runs/<RUN_ID>.md. Halting."). That is the only narration the operator wants. Do not
write a separate "summary for the operator" — the record is the
summary.
Begin by reading docs/benchmark/protocol.md.
↑ Copy to here
Operator notes (do not paste — for the human running the trial)
Per-run setup checklist
Before pasting the prompt above, the operator must:
- Confirm the game is running and on
MainMenu(no run in progress). - Confirm
commands.jsonhas a benign payload (overwrite with{"type":"NoOp"}or similar — seebridge-protocol-notes.mdon stale-command replay). - Confirm the agent's
opencode.jsonis the sandboxed config:docs/benchmark/opencode.benchmark.json. MemPalace, webfetch, google_search, task, and all non-bridge MCP servers must be disabled. - Start a fresh OpenCode session (
opencodewith no-cresume flag, inE:\Games\sts2\HermesBridge-StS2). Note thesession_idfrom the OpenCode UI/log — you'll need it for the run record. - Choose the character for this run from the rotation:
IRONCLAD → SILENT → DEFECT → REGENT → NECROBINDER. Do not let
the agent choose; assign it explicitly in
<CHARACTER>. - Note
start_time_utc(now). - Paste the prompt block above into the agent's first message,
substituting
<RUN_ID>,<CHARACTER>,<MODEL_SLUG>.
Per-run teardown checklist
When the run ends (death, victory, stall, rate-limit, error-streak, runcap, or manual halt):
- Note
end_time_utc(now, in UTC — not local time). - Fill operator-filled fields per protocol.md §Operator
responsibilities:
opencode_session_id(the OpenCode session id, formatses_<27-char-base32>, not the upstream provider's session id),start_time_utc/end_time_utc(both UTC),duration_minutes,model,model_provider. - Snapshot floor-history. Copy
%APPDATA%\SlayTheSpire2\hermesbridge\floor-history.jsonltodocs\benchmark\runs\<RUN_ID>.jsonl(sibling to the .md record; same basename, .jsonl extension). Do this before restarting the game for the next run — bridge v0.1.5+ truncates the runtime file on game startup, so a missed snapshot loses the data permanently. - If the agent didn't write a run record (e.g. it hit
rate_limitmid-run before reaching the write step), the operator creates the record from the template manually. Sethalt_reason: rate_limitand leave decision-log/bridge-findings as<run halted before agent could write record>. This counts as a void run; restart with the same<RUN_ID>after fixing the cause. - Append the runs.csv row via the helper:
tools\maintainer\append-run-csv.ps1 -RunId <RUN_ID>. The helper reads the front-matter, fetches token totals from the OpenCode session DB (filling any null token fields), patches them back into the front-matter, and appends a properly-quoted CSV row. Do not hand-editruns.csv— column order drift is the most common source of bad rows. - Restart the game (kill StS2, relaunch) before the next run.
Stale
state.combatfrom the prior run is documented behavior; restarting is the cleanest avoidance. Game restart also clearsfloor-history.jsonlautomatically (bridge v0.1.5+). - Start a fresh OpenCode session for the next run. Do not reuse the session.
Trial-v0 model lineup
| Slug | Provider | Notes |
|---|---|---|
claude-opus-4.7 | github-copilot / anthropic | Frontier control |
gpt-5.5 | openai | Frontier comparison |
gemini-3.1-pro | Long-context frontier | |
glm-5.1 | zai (or openrouter) | Open-weights frontier |
deepseek-v3.5 | deepseek (or openrouter) | Open-weights frontier, free |
Total: 5 models × 5 characters = 25 runs.
Run rotation
Recommended order (one full character rotation per model before moving on):
claude-opus-4.7 IRONCLAD → SILENT → DEFECT → REGENT → NECROBINDER
gpt-5.5 IRONCLAD → SILENT → DEFECT → REGENT → NECROBINDER
gemini-3.1-pro IRONCLAD → SILENT → DEFECT → REGENT → NECROBINDER
glm-5.1 IRONCLAD → SILENT → DEFECT → REGENT → NECROBINDER
deepseek-v3.5 IRONCLAD → SILENT → DEFECT → REGENT → NECROBINDER
Rationale: per-model batching keeps a single sandbox config valid for five runs in a row (less switching overhead). Per-character batching would also work — choose whichever you find easier to track.