🌊 refactor: Local Snapshot for Aggregate Key Cache to Avoid Redundant Redis GETs (#12422)

* perf: Add local snapshot to aggregate key cache to avoid redundant Redis GETs

getAll() was being called 20+ times per chat request (once per tool,
per server config lookup, per connection check). Each call hit Redis
even though the data doesn't change within a request cycle.

Add an in-memory snapshot with 5s TTL that collapses all reads within
the window into a single Redis GET. Writes (add/update/remove/reset)
invalidate the snapshot immediately so mutations are never stale.

Also removes the debug logger that was producing noisy per-call logs.

* fix: Prevent snapshot mutation and guarantee cleanup on write failure

- Never mutate the snapshot object in-place during writes. Build a new
  object (spread) so concurrent readers never observe uncommitted state.
- Move invalidateLocalSnapshot() into withWriteLock's finally block so
  cleanup is guaranteed even when successCheck throws on Redis failure.
- After successful writes, populate the snapshot with the committed state
  to avoid an unnecessary Redis GET on the next read.
- Use Date.now() after the await in getAll() so the TTL window isn't
  shortened by Redis latency.
- Strengthen tests: spy on underlying Keyv cache to verify N getAll()
  calls collapse into 1 Redis GET, verify snapshot reference immutability.

* fix: Remove dead populateLocalSnapshot calls from write callbacks

populateLocalSnapshot was called inside withWriteLock callbacks, but
the finally block in withWriteLock always calls invalidateLocalSnapshot
immediately after — undoing the populate on every execution path.

Remove the dead method and its three call sites. The snapshot is
correctly cleared by finally on both success and failure paths. The
next getAll() after a write hits Redis once to fetch the committed
state, which is acceptable since writes only occur during init and
rare manual reinspection.

* fix: Derive local snapshot TTL from MCP_REGISTRY_CACHE_TTL config

Use cacheConfig.MCP_REGISTRY_CACHE_TTL (default 5000ms) instead of a
hardcoded 5s constant. When TTL is 0 (operator explicitly wants no
caching), the snapshot is disabled entirely — every getAll() hits Redis.

* fix: Add TTL expiry test, document 2×TTL staleness, clarify comments

- Add missing test for snapshot TTL expiry path (force-expire via
  localSnapshotExpiry mutation, verify Redis is hit again)
- Document 2×TTL max cross-instance staleness in localSnapshot JSDoc
- Document reset() intentionally bypasses withWriteLock
- Add inline comments explaining why early invalidateLocalSnapshot()
  in write callbacks is distinct from the finally-block cleanup
- Update cacheConfig.MCP_REGISTRY_CACHE_TTL JSDoc to reflect both
  use sites and the staleness implication
- Rename misleading test name for snapshot reference immutability
- Add epoch sentinel comment on localSnapshotExpiry initialization
This commit is contained in:
Danny Avila 2026-03-26 16:39:09 -04:00 committed by GitHub
parent 8e2721011e
commit 5e3b7bcde3
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
3 changed files with 159 additions and 19 deletions

View file

@ -128,8 +128,13 @@ const cacheConfig = {
REDIS_SCAN_COUNT: math(process.env.REDIS_SCAN_COUNT, 1000),
/**
* TTL in milliseconds for MCP registry read-through cache.
* This cache reduces redundant lookups within a single request flow.
* TTL in milliseconds for MCP registry caches. Used by both:
* - `MCPServersRegistry` read-through caches (`readThroughCache`/`readThroughCacheAll`)
* - `ServerConfigsCacheRedisAggregateKey` local snapshot (avoids redundant Redis GETs)
*
* Both layers use this value, so the effective max cross-instance staleness is up
* to 2× this value in multi-instance deployments. Set to 0 to disable the local
* snapshot entirely (every `getAll()` hits Redis directly).
* @default 5000 (5 seconds)
*/
MCP_REGISTRY_CACHE_TTL: math(process.env.MCP_REGISTRY_CACHE_TTL, 5000),