* feat: replace unsupported MongoDB aggregation operators for FerretDB compatibility Replace $lookup, $unwind, $sample, $replaceRoot, and $addFields aggregation stages which are unsupported on FerretDB v2.x (postgres-documentdb backend). - Prompt.js: Replace $lookup/$unwind/$project pipelines with find().select().lean() + attachProductionPrompts() batch helper. Replace $group/$replaceRoot/$sample in getRandomPromptGroups with distinct() + Fisher-Yates shuffle. - Agent/Prompt migration scripts: Replace $lookup anti-join pattern with distinct() + $nin two-step queries for finding un-migrated resources. All replacement patterns verified against FerretDB v2.7.0. Co-authored-by: Cursor <cursoragent@cursor.com> * fix: use $pullAll for simple array removals, fix memberIds type mismatches Replace $pull with $pullAll for exact-value scalar array removals. Both operators work on MongoDB and FerretDB, but $pullAll is more explicit for exact matching (no condition expressions). Fix critical type mismatch bugs where ObjectId values were used against String[] memberIds arrays in Group queries: - config/delete-user.js: use string uid instead of ObjectId user._id - e2e/setup/cleanupUser.ts: convert userId.toString() before query Harden PermissionService.bulkUpdateResourcePermissions abort handling to prevent crash when abortTransaction is called after commitTransaction. All changes verified against FerretDB v2.7.0 and MongoDB Memory Server. Co-authored-by: Cursor <cursoragent@cursor.com> * fix: harden transaction support probe for FerretDB compatibility Commit the transaction before aborting in supportsTransactions probe, and wrap abortTransaction in try-catch to prevent crashes when abort is called after a successful commit (observed behavior on FerretDB). Co-authored-by: Cursor <cursoragent@cursor.com> * feat: add FerretDB compatibility test suite, retry utilities, and CI config Add comprehensive FerretDB integration test suite covering: - $pullAll scalar array operations - $pull with subdocument conditions - $lookup replacement (find + manual join) - $sample replacement (distinct + Fisher-Yates) - $bit and $bitsAllSet operations - Migration anti-join pattern - Multi-tenancy (useDb, scaling, write amplification) - Sharding proof-of-concept - Production operations (backup/restore, schema migration, deadlock retry) Add production retryWithBackoff utility for deadlock recovery during concurrent index creation on FerretDB/DocumentDB backends. Add UserController.spec.js tests for deleteUserController (runs in CI). Configure jest and eslint to isolate FerretDB tests from CI pipelines: - packages/data-schemas/jest.config.mjs: ignore misc/ directory - eslint.config.mjs: ignore packages/data-schemas/misc/ Include Docker Compose config for local FerretDB v2.7 + postgres-documentdb, dedicated jest/tsconfig for the test files, and multi-tenancy findings doc. Co-authored-by: Cursor <cursoragent@cursor.com> * style: brace formatting in aclEntry.ts modifyPermissionBits Co-authored-by: Cursor <cursoragent@cursor.com> * refactor: reorganize retry utilities and update imports - Moved retryWithBackoff utility to a new file `retry.ts` for better structure. - Updated imports in `orgOperations.ferretdb.spec.ts` to reflect the new location of retry utilities. - Removed old import statement for retryWithBackoff from index.ts to streamline exports. * test: add $pullAll coverage for ConversationTag and PermissionService Add integration tests for deleteConversationTag verifying $pullAll removes tags from conversations correctly, and for syncUserEntraGroupMemberships verifying $pullAll removes user from non-matching Entra groups while preserving local group membership. --------- Co-authored-by: Cursor <cursoragent@cursor.com>
8.9 KiB
FerretDB Multi-Tenancy Plan
Status: Active Investigation
Goal
Database-per-org data isolation using FerretDB (PostgreSQL-backed) with horizontal sharding across multiple FerretDB+Postgres pairs. MongoDB and AWS DocumentDB are not options.
Findings
1. FerretDB Architecture (DocumentDB Backend)
FerretDB with postgres-documentdb does not create separate PostgreSQL schemas per MongoDB database. All data lives in a single documentdb_data PG schema:
- Each MongoDB collection →
documents_<id>+retry_<id>table pair - Catalog tracked in
documentdb_api_catalog.collectionsand.collection_indexes mongoose.connection.useDb('org_X')creates a logical database in DocumentDB's catalog
Implication: No PG-level schema isolation, but logical isolation is enforced by FerretDB's wire protocol layer. Backup/restore must go through FerretDB, not raw pg_dump.
2. Schema & Index Compatibility
All 29 LibreChat Mongoose models and 98 custom indexes work on FerretDB v2.7.0:
| Index Type | Count | Status |
|---|---|---|
| Sparse + unique | 9 (User OAuth IDs) | Working |
| TTL (expireAfterSeconds) | 8 models | Working |
| partialFilterExpression | 2 (File, Group) | Working |
| Compound unique | 5+ | Working |
| Concurrent creation | All 29 models | No deadlock (single org) |
3. Scaling Curve (Empirically Tested)
| Orgs | Collections | Catalog Indexes | Data Tables | pg_class | Init/org | Query avg | Query p95 |
|---|---|---|---|---|---|---|---|
| 10 | 450 | 1,920 | 900 | 5,975 | 501ms | 1.03ms | 1.44ms |
| 50 | 1,650 | 7,040 | 3,300 | 20,695 | 485ms | 1.00ms | 1.46ms |
| 100 | 3,150 | 13,440 | 6,300 | 39,095 | 483ms | 0.83ms | 1.13ms |
Key finding: Init time and query latency are flat through 100 orgs. No degradation.
4. Write Amplification
User model (11+ indexes) vs zero-index collection: 1.11x — only 11% overhead. DocumentDB's JSONB index management is efficient.
5. Sharding PoC
Tenant router proven with:
- Pool assignment with capacity limits (fill-then-spill)
- Warm cache routing overhead: 0.001ms (sub-microsecond)
- Cold routing (DB lookup + connection + model registration): 6ms
- Cross-pool data isolation confirmed
- Express middleware pattern (
req.getModel('User')) works transparently
6. Scaling Thresholds
| Org Count | Postgres Instances | Notes |
|---|---|---|
| 1–300 | 1 | Default config |
| 300–700 | 1 | Tune autovacuum, PgBouncer, shared_buffers |
| 700–1,000 | 1-2 | Split when monitoring signals pressure |
| 1,000+ | N / ~500 each | One FerretDB+Postgres pair per ~500 orgs |
7. Deadlock Behavior
- Single org, concurrent index creation: No deadlock (DocumentDB handles it)
- Bulk provisioning (10 orgs sequential): Deadlock occurred on Pool B, recovered via retry
- Production requirement: Exponential backoff + jitter retry on
createIndexes()
Open Items
A. Production Deadlock Retry ✅
- Build
retryWithBackoffutility with exponential backoff + jitter - Integrate into
initializeOrgCollectionsandmigrateOrgscripts - Tested against FerretDB — real deadlocks detected and recovered:
retry_4hit a deadlock oncreateIndexes(User), recovered via backoff (1,839ms total)retry_5also hit retry path (994ms vs ~170ms clean)- Production utility at
packages/data-schemas/src/utils/retryWithBackoff.ts
B. Per-Org Backup/Restore ✅
mongodump/mongorestoreCLI not available — tested programmatic driver-level approach- Backup:
listCollections()→find({}).toArray()per collection → in-memoryOrgBackupstruct - Restore:
collection.insertMany(docs)per collection into fresh org database - BSON type preservation verified: ObjectId, Date, String all round-trip correctly
- Data integrity verified:
_idvalues, field values, document counts match exactly - Performance: Backup 24ms, Restore 15ms (8 docs across 29 collections)
- Scales linearly with document count — no per-collection overhead beyond the query
C. Schema Migration Across Orgs ✅
createIndexes()is idempotent — re-init took 86ms with 12 indexes unchanged- New collection propagation: Added
AuditLogcollection with 4 indexes to 5 orgs — 109ms total - New index propagation: Added compound
{username:1, createdAt:-1}index tousersacross 5 orgs — 22ms total - Full migration run: 5 orgs × 29 models = 88ms/org average (with deadlock retry)
- Data preservation confirmed: All existing user data intact after migration
- Extrapolating: 1,000 orgs × 88ms/org = ~88 seconds for a full migration sweep
Test Files
| File | Purpose |
|---|---|
packages/data-schemas/src/methods/multiTenancy.ferretdb.spec.ts |
5-phase benchmark (useDb mapping, indexes, scaling, write amp, shared collection) |
packages/data-schemas/src/methods/sharding.ferretdb.spec.ts |
Sharding PoC (router, assignment, isolation, middleware pattern) |
packages/data-schemas/src/methods/orgOperations.ferretdb.spec.ts |
Production operations (backup/restore, migration, deadlock retry) |
packages/data-schemas/src/utils/retryWithBackoff.ts |
Production retry utility |
Docker
| File | Purpose |
|---|---|
docker-compose.ferretdb.yml |
Single FerretDB + Postgres (dev/test) |
Detailed Empirical Results
Deadlock Retry Behavior
The retryWithBackoff utility was exercised under real FerretDB load. Key observations:
| Scenario | Attempts | Total Time | Notes |
|---|---|---|---|
| Clean org init (no contention) | 1 | 165-199ms | Most orgs complete in one shot |
| Deadlock on User indexes | 2 | 994ms | Single retry recovers cleanly |
| Deadlock with compounding retries | 2-3 | 1,839ms | Worst case in 5-org sequential batch |
The User model (11+ indexes including 9 sparse unique) is the most deadlock-prone collection. The retry utility's exponential backoff with jitter (100ms base, 10s cap) handles this gracefully.
Backup/Restore Round-Trip
Tested with a realistic org containing 4 populated collections:
| Operation | Time | Details |
|---|---|---|
| Backup (full org) | 24ms | 8 docs across 29 collections (25 empty) |
| Restore (to new org) | 15ms | Including insertMany() for each collection |
| Index re-creation | ~500ms | Separate initializeOrgCollections call |
Round-trip verified:
_id(ObjectId) preserved exactlycreatedAt/updatedAt(Date) preserved- String, Number, ObjectId ref fields preserved
- Document counts match source
For larger orgs (thousands of messages/conversations), backup time scales linearly with document count. The bottleneck is network I/O to FerretDB, not serialization.
Schema Migration Performance
| Operation | Time | Per Org |
|---|---|---|
| Idempotent re-init (no changes) | 86ms | 86ms |
| New collection + 4 indexes | 109ms | 22ms/org |
| New compound index on users | 22ms | 4.4ms/org |
| Full migration sweep (29 models) | 439ms | 88ms/org |
Migration is safe to run while the app is serving traffic — createIndexes and createCollection are non-blocking operations that don't lock existing data.
5-Org Provisioning with Production Retry
retry_1: 193ms (29 models) — clean
retry_2: 199ms (29 models) — clean
retry_3: 165ms (29 models) — clean
retry_4: 1839ms (29 models) — deadlock on User indexes, recovered
retry_5: 994ms (29 models) — deadlock on User indexes, recovered
Total: 3,390ms for 5 orgs (678ms avg, but 165ms median)
Production Recommendations
1. Org Provisioning
Use initializeOrgCollections() from packages/data-schemas/src/utils/retryWithBackoff.ts for all new org setup. Process orgs in batches of 10 with Promise.all() to parallelize across pools while minimizing per-pool contention.
2. Backup Strategy
Implement driver-level backup (not mongodump):
- Enumerate collections via
listCollections() - Stream documents via
find({}).batchSize(1000)for large collections - Write to object storage (S3/GCS) as NDJSON per collection
- Restore via
insertMany()in batches of 1,000
3. Schema Migrations
Run migrateAllOrgs() as a deployment step:
- Enumerate all org databases from the assignment table
- For each org: register models,
createCollection(),createIndexesWithRetry() createIndexes()is idempotent — safe to re-run- At 88ms/org, 1,000 orgs complete in ~90 seconds
4. Monitoring
Track per-org provisioning and migration times. If the median provisioning time rises above 500ms/org, investigate PostgreSQL catalog pressure:
pg_stat_user_tables.n_dead_tupfor autovacuum healthpg_stat_bgwriter.buffers_backendfor buffer pressuredocumentdb_api_catalog.collectionscount for total table count