Commit Graph

3 Commits

Author SHA1 Message Date
Evan 7fa81c6bb9 perf: reduce core live-memory footprint by 45% on large maps (#4507)
## Summary

Reduces the simulation's steady-state memory footprint. On Giant World
Map at 20 game-minutes (12 000 ticks, 400 bots, seed `perf-default`),
live memory after a full GC drops **293 MB → 161 MB (−45%)**; unforced
peak heap drops **326 MB → 165 MB**. The simulation also runs ~10%
faster (85 → 94 ticks/s). The final game-state hash is **bit-identical**
(`57830793797434300`) — no behavior change.

## Measurement (first commit)

The full-game perf harness gains a footprint mode:

- `--footprint` — forces a full GC at every `--window` boundary and
records the live heap / ArrayBuffer / RSS curve across the game
(requires `NODE_OPTIONS=--expose-gc`).
- `--snapshot-at 0,2000,12000` — writes V8 `.heapsnapshot` files at
chosen ticks.
- `HeapSnapshotRetainers.ts` — attributes every heap node to its nearest
meaningfully-named retainer (e.g. `PlayerImpl._tiles`), plus prints
retainer chains for all nodes ≥128 KB. `HeapSnapshotSummary.ts` is a
streaming fallback for snapshots too large to `JSON.parse`.

Baseline attribution at tick 12 000: player `_tiles`/`_borderTiles` Sets
**83 MB**, GameMap `refToX`/`refToY` lookup tables **38 MB**, two
duplicate 30.5 MB visited-scratch arrays, trade-ship stepper paths **15
MB**, a construction-only flood-fill queue **9.5 MB**.

## Optimizations

**Map-sized buffers (second commit):**
- `GameMap.x()/y()` compute `ref % width` / `(ref / width) | 0` instead
of reading two per-tile Uint16 tables (−38 MB). The arithmetic is
cheaper than the tables' random-access cache misses — this is where the
speedup comes from.
- `PlayerExecution` and `SpatialQuery` each kept their own per-game
generation-stamped visited `Uint32Array`; both now share one via
`TileTraversalScratch` (−30 MB).
- `PathFinderStepper` stores numeric paths as `Uint32Array` (half the
bytes; steppers hold their full path for a unit's whole journey).
- `ConnectedComponents` frees its flood-fill queue after `initialize()`.

**Player tile sets (third commit):**
- New `TileSet`: insertion-ordered set of tile refs backed by a dense
`Uint32Array` plus an open-addressing hash index — ~12 bytes/element vs
~34 for a native `Set<number>`. Deletes tombstone; compaction is
deferred while iteration is in progress so positions never shift under
an iterator.
- Iteration semantics match `Set` exactly (insertion order, entries
added mid-iteration visited, deleted ones skipped, delete+re-add moves
to end) — the simulation relies on this order for determinism, and the
unchanged hash confirms it.
- `Player.borderTiles()` now returns `ReadonlyTileSet` (a native `Set`
still satisfies it structurally); `GameRunner.playerBorderTiles` copies
into a real `Set` since that result crosses the worker boundary via
structured clone.

## Footprint curve (giant world map, live MB after forced GC)

| checkpoint | before | after |
|---|---|---|
| spawn end | 20 + 100 buf | 20 + 55 buf |
| tick 6301 | 119 + 161 buf | 29 + 127 buf |
| tick 12301 | 130 + 161 buf | 32 + 129 buf |

## Validation

- Final hash `57830793797434300` identical across baseline / round 1 /
round 2 runs (12 000 ticks).
- Full suite passes (1798 + 126 tests), including new `TileSet` tests:
order semantics, mutation-during-iteration parity with `Set`, tombstone
compaction, and a 20 000-op randomized differential test against native
`Set`.
- Runs recorded in
`tests/perf/output/footprint-{baseline,round1,round2}-giant.txt`.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
2026-07-04 15:25:29 -07:00
Evan 5e4b2791aa perf: reduce core-sim GC churn 42% and add GC-churn profiling to the perf harness (#4494)
## Summary

Reduces core-simulation GC churn by **42%** on a 20-game-minute Giant
World Map run, and extends the headless full-game perf harness so churn
is measurable and regressions are visible.

### 1. GC-churn measurement (`tests/perf/fullgame/GcProfiler.ts`)

`npm run perf:game` now reports:

- **GC pauses** by kind (minor/major/incremental) via a
`PerformanceObserver` on `'gc'` entries, bucketed into tick windows by
timestamp (V8 only delivers these entries on a timer task, so they're
flushed after the run)
- **Allocation rate** per `--window N` ticks (default 1000) from
used-heap deltas sampled every tick, so churn can be tracked across game
phases
- **Top allocating functions** from the V8 sampling heap profiler with
`includeObjectsCollectedBy{Major,Minor}GC` — i.e. actual churn including
short-lived garbage, not live memory — plus a `.heapprofile` loadable in
Chrome DevTools (Memory → Allocation sampling)

New flags: `--window N`, `--no-gc-profile`, `--no-alloc-profile`.

### 2. Allocation reductions in the hot paths it found

| Site | Change |
|---|---|
| `GameMap.bfs` | inline neighbor enumeration instead of an array per
visited tile |
| `GameMap`/`Game` | new `forEachNeighborNSWE` — allocation-free
iterator matching `neighbors()` N,S,W,E order for order-sensitive
callers (`forEachNeighbor` visits W,E,N,S, so substituting it would
change sim behavior) |
| `PlayerImpl.nearby` / `sharesBorderWith` / `shoreReachableNeighbors` |
no per-call neighbor arrays; no materialized shore-tile array |
| `PlayerImpl.units(types)` | gather into a reusable scratch buffer,
return one exact-size slice (still a fresh snapshot array per call) |
| `AiAttackBehavior.maybeAttack` | single pass over border neighbors
replacing the `flatMap`/`filter`/`map` chain over every border tile |
| `AiAttackBehavior.isBorderingNukedTerritory` | reusable `neighbors4`
buffer with early exit |
| `SharedWaterCache.build` | allocation-free neighbor iteration |
| `SpatialQuery.bfsNearest` | first-minimum scan instead of
collect-then-stable-sort (identical result incl. tie-breaking) |

### Results (Giant World Map, 400 bots, 12,000 ticks ≈ 20 game-minutes,
seed `perf-default`)

| Metric | Before | After |
|---|---|---|
| Sampled allocations (incl. collected) | 97.7 GB | **56.9 GB (−42%)** |
| GC count / total pause | 1,682 / 3,313 ms (1.8% of wall) | 1,058 /
2,087 ms (1.2%) |
| Ticks/sec | 66 | 70 |
| p99 / max tick | 49.9 ms / 988 ms | 43.5 ms / 689 ms |
| Ticks over 100 ms budget | 31 | 19 |

## Determinism

Every rewrite preserves exact iteration order (the new NSWE iterator
exists precisely for the order-sensitive sites). Verified by identical
final game-state hashes on three runs: Giant World Map 12,000 ticks
(`67286276735690560`), Giant World Map 2,000 ticks, and World 1,800
ticks.

## Test plan

- [x] Full suite green (1,896 tests)
- [x] New tests: `forEachNeighborNSWE` order contract vs `neighbors()`
over every tile; `units()` filtering semantics (insertion order,
fresh-array guarantee, duplicate types, Set path)
- [x] Final-hash equality on 3 seeded headless runs (2 maps)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 12:30:28 -07:00
Evan 8da2291a49 Add full-game perf harness for the core simulation (#4228)
## Summary

Adds a full-game performance harness under `tests/perf/fullgame/` that
runs the **real simulation pipeline** headlessly — `GameRunner` +
`Executor` with the real `Config`, nations from the map manifest, and
bots on a production map from `resources/maps/` — for a configurable
number of ticks, then reports where the time goes.

```bash
npm run perf:game                                        # world, 400 bots, 1800 ticks
npm run perf:game -- --map giantworldmap --ticks 3600
npm run perf:game -- --no-exec-profile                   # purest CPU profile (no timing wrappers)
```

## What it reports

1. **Per-tick wall time** — mean / p50 / p95 / p99 / max, count of ticks
over the 100ms budget, and the slowest ticks by tick number.
2. **Time per Execution class** — every `Execution`'s `init()`/`tick()`
is timed and aggregated by class name (`AttackExecution`,
`NationExecution`, …).
3. **Top functions by self time** — via the V8 sampling profiler
(`node:inspector`), so no instrumentation skew. Also writes a
`.cpuprofile` to `tests/perf/output/` (gitignored) that opens in Chrome
DevTools as a flame graph.

## Determinism

The run is fully deterministic for a given `--seed`/`--map`/`--bots`
(verified: identical final hashes across runs), and the final game-state
hash is printed — so an optimization can be checked to not change
simulation behavior.

## Sample output (world, 400 bots, 1800 ticks)

```
--- Per-tick wall time (game phase) ---
mean 9.04ms | p50 7.90ms | p95 17.1ms | p99 21.5ms | max 31.7ms
Over 100ms budget: 0 / 1800 ticks

--- Time by Execution class ---
execution                      total ms  %     tick ms  init ms  ticks   instances
AttackExecution                6568      48.8  6288     280      212536  4200
PlayerExecution                2832      21.0  2832     0.36     492049  472
NationExecution                2508      18.6  2508     0.23     144654  72
TransportShipExecution         703       5.2   96.0     607      30440   257
...

--- Top functions by self time (V8 sampling profiler) ---
self ms  %    function                 location
1065     6.5  forEachNeighborWithDiag  src/core/game/GameImpl.ts
979      6.0  conquer                  src/core/game/GameImpl.ts
948      5.8  (anonymous)              src/core/execution/AttackExecution.ts
595      3.6  toFullUpdate             src/core/game/PlayerImpl.ts
...
```

The harness lives in a subdirectory so the existing `npm run perf`
micro-benchmark runner (which globs `tests/perf/*.ts`) doesn't pick it
up.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 18:52:18 -07:00