mirror of https://github.com/danny-avila/LibreChat.git synced 2026-03-03 14:50:19 +01:00

🐘 feat: FerretDB Compatibility (#11769 )

* feat: replace unsupported MongoDB aggregation operators for FerretDB compatibility

Replace $lookup, $unwind, $sample, $replaceRoot, and $addFields aggregation
stages which are unsupported on FerretDB v2.x (postgres-documentdb backend).

- Prompt.js: Replace $lookup/$unwind/$project pipelines with find().select().lean()
  + attachProductionPrompts() batch helper. Replace $group/$replaceRoot/$sample
  in getRandomPromptGroups with distinct() + Fisher-Yates shuffle.
- Agent/Prompt migration scripts: Replace $lookup anti-join pattern with
  distinct() + $nin two-step queries for finding un-migrated resources.

All replacement patterns verified against FerretDB v2.7.0.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix: use $pullAll for simple array removals, fix memberIds type mismatches

Replace $pull with $pullAll for exact-value scalar array removals. Both
operators work on MongoDB and FerretDB, but $pullAll is more explicit for
exact matching (no condition expressions).

Fix critical type mismatch bugs where ObjectId values were used against
String[] memberIds arrays in Group queries:
- config/delete-user.js: use string uid instead of ObjectId user._id
- e2e/setup/cleanupUser.ts: convert userId.toString() before query

Harden PermissionService.bulkUpdateResourcePermissions abort handling to
prevent crash when abortTransaction is called after commitTransaction.

All changes verified against FerretDB v2.7.0 and MongoDB Memory Server.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix: harden transaction support probe for FerretDB compatibility

Commit the transaction before aborting in supportsTransactions probe, and
wrap abortTransaction in try-catch to prevent crashes when abort is called
after a successful commit (observed behavior on FerretDB).

Co-authored-by: Cursor <cursoragent@cursor.com>

* feat: add FerretDB compatibility test suite, retry utilities, and CI config

Add comprehensive FerretDB integration test suite covering:
- $pullAll scalar array operations
- $pull with subdocument conditions
- $lookup replacement (find + manual join)
- $sample replacement (distinct + Fisher-Yates)
- $bit and $bitsAllSet operations
- Migration anti-join pattern
- Multi-tenancy (useDb, scaling, write amplification)
- Sharding proof-of-concept
- Production operations (backup/restore, schema migration, deadlock retry)

Add production retryWithBackoff utility for deadlock recovery during
concurrent index creation on FerretDB/DocumentDB backends.

Add UserController.spec.js tests for deleteUserController (runs in CI).

Configure jest and eslint to isolate FerretDB tests from CI pipelines:
- packages/data-schemas/jest.config.mjs: ignore misc/ directory
- eslint.config.mjs: ignore packages/data-schemas/misc/

Include Docker Compose config for local FerretDB v2.7 + postgres-documentdb,
dedicated jest/tsconfig for the test files, and multi-tenancy findings doc.

Co-authored-by: Cursor <cursoragent@cursor.com>

* style: brace formatting in aclEntry.ts modifyPermissionBits

Co-authored-by: Cursor <cursoragent@cursor.com>

* refactor: reorganize retry utilities and update imports

- Moved retryWithBackoff utility to a new file `retry.ts` for better structure.
- Updated imports in `orgOperations.ferretdb.spec.ts` to reflect the new location of retry utilities.
- Removed old import statement for retryWithBackoff from index.ts to streamline exports.

* test: add $pullAll coverage for ConversationTag and PermissionService

Add integration tests for deleteConversationTag verifying $pullAll
removes tags from conversations correctly, and for
syncUserEntraGroupMemberships verifying $pullAll removes user from
non-matching Entra groups while preserving local group membership.

---------

Co-authored-by: Cursor <cursoragent@cursor.com>

2026-02-26 20:29:18 -05:00

8.9 KiB

Raw Blame History

FerretDB Multi-Tenancy Plan

Status: Active Investigation

Goal

Database-per-org data isolation using FerretDB (PostgreSQL-backed) with horizontal sharding across multiple FerretDB+Postgres pairs. MongoDB and AWS DocumentDB are not options.

Findings

1. FerretDB Architecture (DocumentDB Backend)

FerretDB with postgres-documentdb does not create separate PostgreSQL schemas per MongoDB database. All data lives in a single documentdb_data PG schema:

Each MongoDB collection → documents_<id> + retry_<id> table pair
Catalog tracked in documentdb_api_catalog.collections and .collection_indexes
mongoose.connection.useDb('org_X') creates a logical database in DocumentDB's catalog

Implication: No PG-level schema isolation, but logical isolation is enforced by FerretDB's wire protocol layer. Backup/restore must go through FerretDB, not raw pg_dump.

2. Schema & Index Compatibility

All 29 LibreChat Mongoose models and 98 custom indexes work on FerretDB v2.7.0:

Index Type	Count	Status
Sparse + unique	9 (User OAuth IDs)	Working
TTL (expireAfterSeconds)	8 models	Working
partialFilterExpression	2 (File, Group)	Working
Compound unique	5+	Working
Concurrent creation	All 29 models	No deadlock (single org)

3. Scaling Curve (Empirically Tested)

Orgs	Collections	Catalog Indexes	Data Tables	pg_class	Init/org	Query avg	Query p95
10	450	1,920	900	5,975	501ms	1.03ms	1.44ms
50	1,650	7,040	3,300	20,695	485ms	1.00ms	1.46ms
100	3,150	13,440	6,300	39,095	483ms	0.83ms	1.13ms

Key finding: Init time and query latency are flat through 100 orgs. No degradation.

4. Write Amplification

User model (11+ indexes) vs zero-index collection: 1.11x — only 11% overhead. DocumentDB's JSONB index management is efficient.

5. Sharding PoC

Tenant router proven with:

Pool assignment with capacity limits (fill-then-spill)
Warm cache routing overhead: 0.001ms (sub-microsecond)
Cold routing (DB lookup + connection + model registration): 6ms
Cross-pool data isolation confirmed
Express middleware pattern (req.getModel('User')) works transparently

6. Scaling Thresholds

Org Count	Postgres Instances	Notes
1–300	1	Default config
300–700	1	Tune autovacuum, PgBouncer, shared_buffers
700–1,000	1-2	Split when monitoring signals pressure
1,000+	N / ~500 each	One FerretDB+Postgres pair per ~500 orgs

7. Deadlock Behavior

Single org, concurrent index creation: No deadlock (DocumentDB handles it)
Bulk provisioning (10 orgs sequential): Deadlock occurred on Pool B, recovered via retry
Production requirement: Exponential backoff + jitter retry on createIndexes()

Open Items

A. Production Deadlock Retry ✅

Build retryWithBackoff utility with exponential backoff + jitter
Integrate into initializeOrgCollections and migrateOrg scripts
Tested against FerretDB — real deadlocks detected and recovered:
- retry_4 hit a deadlock on createIndexes(User), recovered via backoff (1,839ms total)
- retry_5 also hit retry path (994ms vs ~170ms clean)
- Production utility at packages/data-schemas/src/utils/retryWithBackoff.ts

B. Per-Org Backup/Restore ✅

mongodump/mongorestore CLI not available — tested programmatic driver-level approach
Backup: listCollections() → find({}).toArray() per collection → in-memory OrgBackup struct
Restore: collection.insertMany(docs) per collection into fresh org database
BSON type preservation verified: ObjectId, Date, String all round-trip correctly
Data integrity verified: _id values, field values, document counts match exactly
Performance: Backup 24ms, Restore 15ms (8 docs across 29 collections)
Scales linearly with document count — no per-collection overhead beyond the query

C. Schema Migration Across Orgs ✅

createIndexes() is idempotent — re-init took 86ms with 12 indexes unchanged
New collection propagation: Added AuditLog collection with 4 indexes to 5 orgs — 109ms total
New index propagation: Added compound {username:1, createdAt:-1} index to users across 5 orgs — 22ms total
Full migration run: 5 orgs × 29 models = 88ms/org average (with deadlock retry)
Data preservation confirmed: All existing user data intact after migration
Extrapolating: 1,000 orgs × 88ms/org = ~88 seconds for a full migration sweep

Test Files

File	Purpose
`packages/data-schemas/src/methods/multiTenancy.ferretdb.spec.ts`	5-phase benchmark (useDb mapping, indexes, scaling, write amp, shared collection)
`packages/data-schemas/src/methods/sharding.ferretdb.spec.ts`	Sharding PoC (router, assignment, isolation, middleware pattern)
`packages/data-schemas/src/methods/orgOperations.ferretdb.spec.ts`	Production operations (backup/restore, migration, deadlock retry)
`packages/data-schemas/src/utils/retryWithBackoff.ts`	Production retry utility

Docker

File	Purpose
`docker-compose.ferretdb.yml`	Single FerretDB + Postgres (dev/test)

Detailed Empirical Results

Deadlock Retry Behavior

The retryWithBackoff utility was exercised under real FerretDB load. Key observations:

Scenario	Attempts	Total Time	Notes
Clean org init (no contention)	1	165-199ms	Most orgs complete in one shot
Deadlock on User indexes	2	994ms	Single retry recovers cleanly
Deadlock with compounding retries	2-3	1,839ms	Worst case in 5-org sequential batch

The User model (11+ indexes including 9 sparse unique) is the most deadlock-prone collection. The retry utility's exponential backoff with jitter (100ms base, 10s cap) handles this gracefully.

Backup/Restore Round-Trip

Tested with a realistic org containing 4 populated collections:

Operation	Time	Details
Backup (full org)	24ms	8 docs across 29 collections (25 empty)
Restore (to new org)	15ms	Including `insertMany()` for each collection
Index re-creation	~500ms	Separate `initializeOrgCollections` call

Round-trip verified:

_id (ObjectId) preserved exactly
createdAt / updatedAt (Date) preserved
String, Number, ObjectId ref fields preserved
Document counts match source

For larger orgs (thousands of messages/conversations), backup time scales linearly with document count. The bottleneck is network I/O to FerretDB, not serialization.

Schema Migration Performance

Operation	Time	Per Org
Idempotent re-init (no changes)	86ms	86ms
New collection + 4 indexes	109ms	22ms/org
New compound index on users	22ms	4.4ms/org
Full migration sweep (29 models)	439ms	88ms/org

Migration is safe to run while the app is serving traffic — createIndexes and createCollection are non-blocking operations that don't lock existing data.

5-Org Provisioning with Production Retry

retry_1: 193ms (29 models) — clean
retry_2: 199ms (29 models) — clean
retry_3: 165ms (29 models) — clean
retry_4: 1839ms (29 models) — deadlock on User indexes, recovered
retry_5: 994ms (29 models) — deadlock on User indexes, recovered
Total: 3,390ms for 5 orgs (678ms avg, but 165ms median)

Production Recommendations

1. Org Provisioning

Use initializeOrgCollections() from packages/data-schemas/src/utils/retryWithBackoff.ts for all new org setup. Process orgs in batches of 10 with Promise.all() to parallelize across pools while minimizing per-pool contention.

2. Backup Strategy

Implement driver-level backup (not mongodump):

Enumerate collections via listCollections()
Stream documents via find({}).batchSize(1000) for large collections
Write to object storage (S3/GCS) as NDJSON per collection
Restore via insertMany() in batches of 1,000

3. Schema Migrations

Run migrateAllOrgs() as a deployment step:

Enumerate all org databases from the assignment table
For each org: register models, createCollection(), createIndexesWithRetry()
createIndexes() is idempotent — safe to re-run
At 88ms/org, 1,000 orgs complete in ~90 seconds

4. Monitoring

Track per-org provisioning and migration times. If the median provisioning time rises above 500ms/org, investigate PostgreSQL catalog pressure:

pg_stat_user_tables.n_dead_tup for autovacuum health
pg_stat_bgwriter.buffers_backend for buffer pressure
documentdb_api_catalog.collections count for total table count

8.9 KiB Raw Blame History Unescape Escape