LibreChat/packages/data-provider
Pol Burkardt Freire 7e74165c3c
📖 feat: Add Native ODT Document Parser Support (#12303)
* fix: add ODT support to native document parser

* fix: replace execSync with jszip for ODT parsing

* docs: update documentParserMimeTypes comment to include odt

* fix: improve ODT XML extraction and add empty.odt fixture

- Scope extraction to <office:body> to exclude metadata/style nodes
- Map </text:p> and </text:h> closings to newlines, preserving paragraph
  structure instead of collapsing everything to a single line
- Handle <text:line-break/> as explicit newlines
- Strip remaining tags, normalize horizontal whitespace, cap consecutive
  blank lines at one
- Regenerate sample.odt as a two-paragraph fixture so the test exercises
  multi-paragraph output
- Add empty.odt fixture and test asserting 'No text found in document'

* fix: address review findings in ODT parser

- Use static `import JSZip from 'jszip'` instead of dynamic import;
  jszip is CommonJS-only with no ESM/Jest-isolation concern (F1)
- Decode the five standard XML entities after tag-stripping so
  documents with &, <, >, ", ' send correct text to the LLM (F2)
- Remove @types/jszip devDependency; jszip ships bundled declarations
  and @types/jszip is a stale 2020 stub that would shadow them (F3)
- Handle <text:tab/> → \t and <text:s .../> → ' ' before the generic
  tag stripper so tab-aligned and multi-space content is preserved (F4)
- Add sample-entities.odt fixture and test covering entity decoding,
  tab, and spacing-element handling (F5)
- Rename 'throws for empty odt' → 'throws for odt with no extractable
  text' to distinguish from a zero-byte/corrupt file case (F8)

* fix: add decompressed content size cap to odtToText (F6)

Reads uncompressed entry sizes from the JSZip internal metadata before
extracting any content. Throws if the total exceeds 50MB, preventing a
crafted ODT with a high-ratio compressed payload from exhausting heap.

Adds a corresponding test using a real DEFLATE-compressed ZIP (~51KB on
disk, 51MB uncompressed) to verify the guard fires before any extraction.

* fix: add java to codeTypeMapping for file upload support

.java files were rejected with "Unable to determine file type" because
browsers send an empty MIME type for them and codeTypeMapping had no
'java' entry for inferMimeType() to fall back on.

text/x-java was already present in all five validation lists
(fullMimeTypesList, codeInterpreterMimeTypesList, retrievalMimeTypesList,
textMimeTypes, retrievalMimeTypes), so mapping to it (not text/plain)
ensures .java uploads work for both File Search and Code Interpreter.

Closes #12307

* fix: address follow-up review findings (A-E)

A: regenerate package-lock.json after removing @types/jszip from
   package.json; without this npm ci was still installing the stale
   2020 type stubs and TypeScript was resolving against them
B: replace dynamic import('jszip') in the zip-bomb test with the same
   static import already used in production; jszip is CJS-only with no
   ESM/Jest isolation concern
C: document that the _data.uncompressedSize guard fails open if jszip
   renames the private field (accepted limitation, test would catch it)
D: rename 'preserves tabs' test to 'normalizes tab and spacing elements
   to spaces' since <text:tab> is collapsed to a space, not kept as \t
E: fix test.each([ formatting artifact (missing newline after '[')

---------

Co-authored-by: Danny Avila <danny@librechat.ai>
2026-03-19 15:49:52 -04:00
..
react-query 📦 chore: Bump Dependabot Packages (#11836) 2026-02-17 18:55:28 -05:00
specs 🧯 fix: Prevent Env-Variable Exfil. via Placeholder Injection (#12260) 2026-03-16 08:48:24 -04:00
src 📖 feat: Add Native ODT Document Parser Support (#12303) 2026-03-19 15:49:52 -04:00
.gitignore 🔄 refactor: Consolidate Ask/Edit Controllers (#1365) 2023-12-15 15:47:40 -05:00
babel.config.js chore: add back data-provider 2023-07-30 11:50:24 -04:00
check_updates.sh 🔧 feat: Initial MCP Support (Tools) (#5015) 2024-12-17 13:12:57 -05:00
jest.config.js refactor: Parallelize CI Workflows with Isolated Caching and Fan-Out Test Jobs (#12088) 2026-03-05 13:56:07 -05:00
package.json v0.8.4-rc1 (#12285) 2026-03-17 16:08:48 -04:00
rollup.config.js ⚙️ chore: Update Build Config due to Windows Tests (#9511) 2025-09-08 14:16:49 -04:00
server-rollup.config.js 🚀 feat: Add Code API Proxy Support and Update MCP SDK (#6203) 2025-03-06 12:47:59 -05:00
tsconfig.json feat: OAuth for Actions (#5693) 2025-02-10 15:56:08 -05:00
tsconfig.spec.json feat: Assistants API, General File Support, Side Panel, File Explorer (#1696) 2024-02-13 20:42:27 -05:00