refactor: Use in-memory cache for App MCP configs to avoid Redis SCAN (#12410)

*  perf: Use in-memory cache for App MCP configs to avoid Redis SCAN

The 'App' namespace holds static YAML-loaded configs identical on every
instance. Storing them in Redis and retrieving via SCAN + batch-GET
caused 60s+ stalls under concurrent load (#11624). Since these configs
are already loaded into memory at startup, bypass Redis entirely by
always returning ServerConfigsCacheInMemory for the 'App' namespace.

* ♻️ refactor: Extract APP_CACHE_NAMESPACE constant and harden tests

- Extract magic string 'App' to a shared `APP_CACHE_NAMESPACE` constant
  used by both ServerConfigsCacheFactory and MCPServersRegistry
- Document that `leaderOnly` is ignored for the App namespace
- Reset `cacheConfig.USE_REDIS` in test `beforeEach` to prevent
  ordering-dependent flakiness
- Fix import order in test file (longest to shortest)

* 🐛 fix: Populate App cache on follower instances in cluster mode

In cluster deployments, only the leader runs MCPServersInitializer to
inspect and cache MCP server configs. Followers previously read these
from Redis, but with the App namespace now using in-memory storage,
followers would have an empty cache.

Add populateLocalCache() so follower processes independently initialize
their own in-memory App cache from the same YAML configs after the
leader signals completion. The method is idempotent — if the cache is
already populated (leader case), it's a no-op.

* 🐛 fix: Use static flag for populateLocalCache idempotency

Replace getAllServerConfigs() idempotency check with a static
localCachePopulated flag. The previous check merged App + DB caches,
causing false early returns in deployments with publicly shared DB
configs, and poisoned the TTL read-through cache with stale results.

The static flag is zero-cost (no async/Redis/DB calls), immune to
DB config interference, and is reset alongside hasInitializedThisProcess
in resetProcessFlag() for test teardown.

Also set localCachePopulated=true after leader initialization completes,
so subsequent calls on the leader don't redundantly re-run populateLocalCache.

* 📝 docs: Document process-local reset() semantics for App cache

With the App namespace using in-memory storage, reset() only clears the
calling process's cache. Add JSDoc noting this behavioral change so
callers in cluster deployments know each instance must reset independently.

*  test: Add follower cache population tests for MCPServersInitializer

Cover the populateLocalCache code path:
- Follower populates its own App cache after leader signals completion
- localCachePopulated flag prevents redundant re-initialization
- Fresh follower process independently initializes all servers

* 🧹 style: Fix import order to longest-to-shortest convention

* 🔬 test: Add Redis perf benchmark to isolate getAll() bottleneck

Benchmarks that run against a live Redis instance to measure:
1. SCAN vs batched GET phases independently
2. SCAN cost scaling with total keyspace size (noise keys)
3. Concurrent getAll() at various concurrency levels (1/10/50/100)
4. Alternative: single aggregate key vs SCAN+GET
5. Alternative: raw MGET vs Keyv batch GET (serialization overhead)

Run with: npx jest --config packages/api/jest.config.mjs \
  --testPathPatterns="perf_benchmark" --coverage=false

*  feat: Add aggregate-key Redis cache for MCP App configs

ServerConfigsCacheRedisAggregateKey stores all configs under a single
Redis key, making getAll() a single GET instead of SCAN + N GETs.

This eliminates the O(keyspace_size) SCAN that caused 60s+ stalls in
large deployments while preserving cross-instance visibility — all
instances read/write the same Redis key, so reinspection results
propagate automatically after readThroughCache TTL expiry.

* ♻️ refactor: Use aggregate-key cache for App namespace in factory

Update ServerConfigsCacheFactory to return ServerConfigsCacheRedisAggregateKey
for the App namespace when Redis is enabled, instead of ServerConfigsCacheInMemory.

This preserves cross-instance visibility (reinspection results propagate
through Redis) while eliminating SCAN. Non-App namespaces still use the
standard per-key ServerConfigsCacheRedis.

* 🗑️ revert: Remove populateLocalCache — no longer needed with aggregate key

With App configs stored under a single Redis key (aggregate approach),
followers read from Redis like before. The populateLocalCache mechanism
and its localCachePopulated flag are no longer necessary.

Also reverts the process-local reset() JSDoc since reset() is now
cluster-wide again via Redis.

* 🐛 fix: Add write mutex to aggregate cache and exclude perf benchmark from CI

- Add promise-based write lock to ServerConfigsCacheRedisAggregateKey to
  prevent concurrent read-modify-write races during parallel initialization
  (Promise.allSettled runs multiple addServer calls concurrently, causing
  last-write-wins data loss on the aggregate key)
- Rename perf benchmark to cache_integration pattern so CI skips it
  (requires live Redis)

* 🔧 fix: Rename perf benchmark to *.manual.spec.ts to exclude from all CI

The cache_integration pattern is picked up by test:cache-integration:mcp
in CI. Rename to *.manual.spec.ts which isn't matched by any CI runner.

*  test: Add cache integration tests for ServerConfigsCacheRedisAggregateKey

Tests against a live Redis instance covering:
- CRUD operations (add, get, update, remove)
- getAll with empty/populated cache
- Duplicate add rejection, missing update/remove errors
- Concurrent write safety (20 parallel adds without data loss)
- Concurrent read safety (50 parallel getAll calls)
- Reset clears all configs

* 🔧 fix: Rename perf benchmark to *.manual.spec.ts to exclude from all CI

The perf benchmark file was renamed to *.manual.spec.ts but no
testPathIgnorePatterns existed for that convention. Add .*manual\.spec\.
to both test and test:ci scripts, plus jest.config.mjs, so manual-only
tests never run in CI unit test jobs.

* fix: Address review findings for aggregate key cache

- Add successCheck() to all write paths (add/update/remove) so Redis
  SET failures throw instead of being silently swallowed
- Override reset() to use targeted cache.delete(AGGREGATE_KEY) instead
  of inherited SCAN-based cache.clear() — consistent with eliminating
  SCAN operations
- Document cross-instance write race invariant in class JSDoc: the
  promise-based writeLock is process-local only; callers must enforce
  single-writer semantics externally (leader-only init)
- Use definite-assignment assertion (let resolve!:) instead of non-null
  assertion at call site
- Fix import type convention in integration test
- Verify Promise.allSettled rejections explicitly in concurrent write test
- Fix broken run command in benchmark file header

* style: Fix import ordering per AGENTS.md convention

Local/project imports sorted longest to shortest.

* chore: Update import ordering and clean up unused imports in MCPServersRegistry.ts

* chore: import order

* chore: import order
This commit is contained in:
Danny Avila 2026-03-26 14:44:31 -04:00 committed by GitHub
parent 1123f96e6a
commit 359cc63b41
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
8 changed files with 779 additions and 43 deletions

View file

@ -18,8 +18,8 @@
"build:dev": "npm run clean && NODE_ENV=development rollup -c --bundleConfigAsCjs",
"build:watch": "NODE_ENV=development rollup -c -w --bundleConfigAsCjs",
"build:watch:prod": "rollup -c -w --bundleConfigAsCjs",
"test": "jest --coverage --watch --testPathIgnorePatterns=\"\\.*integration\\.|\\.*helper\\.|__tests__/helpers/\"",
"test:ci": "jest --coverage --ci --testPathIgnorePatterns=\"\\.*integration\\.|\\.*helper\\.|__tests__/helpers/\"",
"test": "jest --coverage --watch --testPathIgnorePatterns=\"\\.*integration\\.|\\.*helper\\.|__tests__/helpers/|\\.*manual\\.spec\\.\"",
"test:ci": "jest --coverage --ci --testPathIgnorePatterns=\"\\.*integration\\.|\\.*helper\\.|__tests__/helpers/|\\.*manual\\.spec\\.\"",
"test:cache-integration:core": "jest --testPathPatterns=\"src/cache/.*\\.cache_integration\\.spec\\.ts$\" --coverage=false",
"test:cache-integration:cluster": "jest --testPathPatterns=\"src/cluster/.*\\.cache_integration\\.spec\\.ts$\" --coverage=false --runInBand",
"test:cache-integration:mcp": "jest --testPathPatterns=\"src/mcp/.*\\.cache_integration\\.spec\\.ts$\" --coverage=false",