🩹 fix: MCP Server Recovery from Startup Inspection Failures (#12145)
Some checks are pending
Docker Dev Branch Images Build / build (Dockerfile, lc-dev, node) (push) Waiting to run
Docker Dev Branch Images Build / build (Dockerfile.multi, lc-dev-api, api-build) (push) Waiting to run
Docker Dev Images Build / build (Dockerfile, librechat-dev, node) (push) Waiting to run
Docker Dev Images Build / build (Dockerfile.multi, librechat-dev-api, api-build) (push) Waiting to run
Sync Locize Translations & Create Translation PR / Sync Translation Keys with Locize (push) Waiting to run
Sync Locize Translations & Create Translation PR / Create Translation PR on Version Published (push) Blocked by required conditions

* feat: MCP server reinitialization recovery mechanism

- Added functionality to store a stub configuration for MCP servers that fail inspection at startup, allowing for recovery via reinitialization.
- Introduced `reinspectServer` method in `MCPServersRegistry` to handle reinspection of previously failed servers.
- Enhanced `MCPServersInitializer` to log and manage server initialization failures, ensuring proper handling of inspection failures.
- Added integration tests to verify the recovery process for unreachable MCP servers, ensuring that stub configurations are stored and can be reinitialized successfully.
- Updated type definitions to include `inspectionFailed` flag in server configurations for better state management.

* fix: MCP server handling for inspection failures

- Updated `reinitMCPServer` to return a structured response when the server is unreachable, providing clearer feedback on the failure.
- Modified `ConnectionsRepository` to prevent connections to servers marked as inspection failed, improving error handling.
- Adjusted `MCPServersRegistry` methods to ensure proper management of server states, including throwing errors for non-failed servers during reinspection.
- Enhanced integration tests to validate the behavior of the system when dealing with unreachable MCP servers and inspection failures, ensuring robust recovery mechanisms.

* fix: Clear all cached server configurations in MCPServersRegistry

- Added a comment to clarify the necessity of clearing all cached server configurations when updating a server's configuration, as the cache is keyed by userId without a reverse index for enumeration.

* fix: Update integration test for file_tools_server inspection handling

- Modified the test to verify that the `file_tools_server` is stored as a stub when inspection fails, ensuring it can be reinitialized correctly.
- Adjusted expectations to confirm that the `inspectionFailed` flag is set to true for the stub configuration, enhancing the robustness of the recovery mechanism.

* test: Add unit tests for reinspecting servers in MCPServersRegistry

- Introduced tests for the `reinspectServer` method to validate error handling when called on a healthy server and when the server does not exist.
- Ensured that appropriate exceptions are thrown for both scenarios, enhancing the robustness of server state management.

* test: Add integration test for concurrent reinspectServer calls

- Introduced a new test to validate that multiple concurrent calls to reinspectServer do not crash or corrupt the server state.
- Ensured that at least one call succeeds and any failures are due to the server not being in a failed state, enhancing the reliability of the reinitialization process.

* test: Enhance integration test for concurrent MCP server reinitialization

- Added a new test to validate that concurrent calls to reinitialize the MCP server do not crash or corrupt the server state.
- Ensured that at least one call succeeds and that failures are handled gracefully, improving the reliability of the reinitialization process.
- Reset MCPManager instance after each test to maintain a clean state for subsequent tests.
This commit is contained in:
Danny Avila 2026-03-08 21:49:04 -04:00 committed by GitHub
parent 8b18a16446
commit 32cadb1cc5
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
9 changed files with 627 additions and 15 deletions

View file

@ -1,8 +1,8 @@
const { logger } = require('@librechat/data-schemas');
const { CacheKeys, Constants } = require('librechat-data-provider');
const { getMCPManager, getMCPServersRegistry, getFlowStateManager } = require('~/config');
const { findToken, createToken, updateToken, deleteTokens } = require('~/models');
const { updateMCPServerTools } = require('~/server/services/Config');
const { getMCPManager, getFlowStateManager } = require('~/config');
const { getLogStores } = require('~/cache');
/**
@ -41,6 +41,33 @@ async function reinitMCPServer({
let oauthUrl = null;
try {
const registry = getMCPServersRegistry();
const serverConfig = await registry.getServerConfig(serverName, user?.id);
if (serverConfig?.inspectionFailed) {
logger.info(
`[MCP Reinitialize] Server ${serverName} had failed inspection, attempting reinspection`,
);
try {
const storageLocation = serverConfig.dbId ? 'DB' : 'CACHE';
await registry.reinspectServer(serverName, storageLocation, user?.id);
logger.info(`[MCP Reinitialize] Reinspection succeeded for server: ${serverName}`);
} catch (reinspectError) {
logger.error(
`[MCP Reinitialize] Reinspection failed for server ${serverName}:`,
reinspectError,
);
return {
availableTools: null,
success: false,
message: `MCP server '${serverName}' is still unreachable`,
oauthRequired: false,
serverName,
oauthUrl: null,
tools: null,
};
}
}
const customUserVars = userMCPAuthMap?.[`${Constants.mcp_prefix}${serverName}`];
const flowManager = _flowManager ?? getFlowStateManager(getLogStores(CacheKeys.FLOWS));
const mcpManager = getMCPManager();