⚡ refactor: Optimize & Standardize Tokenizer Usage (#10777)

* refactor: Token Limit Processing with Enhanced Efficiency - Added a new test suite for `processTextWithTokenLimit`, ensuring comprehensive coverage of various scenarios including under, at, and exceeding token limits. - Refactored the `processTextWithTokenLimit` function to utilize a ratio-based estimation method, significantly reducing the number of token counting function calls compared to the previous binary search approach. - Improved handling of edge cases and variable token density, ensuring accurate truncation and performance across diverse text inputs. - Included direct comparisons with the old implementation to validate correctness and efficiency improvements. * refactor: Remove Tokenizer Route and Related References - Deleted the tokenizer route from the server and removed its references from the routes index and server files, streamlining the API structure. - This change simplifies the routing configuration by eliminating unused endpoints. * refactor: Migrate countTokens Utility to API Module - Removed the local countTokens utility and integrated it into the @librechat/api module for centralized access. - Updated various files to reference the new countTokens import from the API module, ensuring consistent usage across the application. - Cleaned up unused references and imports related to the previous countTokens implementation. * refactor: Centralize escapeRegExp Utility in API Module - Moved the escapeRegExp function from local utility files to the @librechat/api module for consistent usage across the application. - Updated imports in various files to reference the new centralized escapeRegExp function, ensuring cleaner code and reducing redundancy. - Removed duplicate implementations of escapeRegExp from multiple files, streamlining the codebase. * refactor: Enhance Token Counting Flexibility in Text Processing - Updated the `processTextWithTokenLimit` function to accept both synchronous and asynchronous token counting functions, improving its versatility. - Introduced a new `TokenCountFn` type to define the token counting function signature. - Added comprehensive tests to validate the behavior of `processTextWithTokenLimit` with both sync and async token counting functions, ensuring consistent results. - Implemented a wrapper to track call counts for the `countTokens` function, optimizing performance and reducing unnecessary calls. - Enhanced existing tests to compare the performance of the new implementation against the old one, demonstrating significant improvements in efficiency. * chore: documentation for Truncation Safety Buffer in Token Processing - Added a safety buffer multiplier to the character position estimates during text truncation to prevent overshooting token limits. - Updated the `processTextWithTokenLimit` function to utilize the new `TRUNCATION_SAFETY_BUFFER` constant, enhancing the accuracy of token limit processing. - Improved documentation to clarify the rationale behind the buffer and its impact on performance and efficiency in token counting.
2026-01-20 17:26:12 +01:00 · 2025-12-02 12:22:04 -05:00 · 2025-12-02 12:22:04 -05:00 · 8bdc808074
commit 8bdc808074
parent b2387cc6fa
19 changed files with 925 additions and 107 deletions
--- a/packages/api/src/utils/text.ts
+++ b/packages/api/src/utils/text.ts
@ -1,11 +1,39 @@
 import { logger } from '@librechat/data-schemas';

+/** Token count function that can be sync or async */
+export type TokenCountFn = (text: string) => number | Promise<number>;
+
+/**
+ * Safety buffer multiplier applied to character position estimates during truncation.
+ *
+ * We use 98% (0.98) rather than 100% to intentionally undershoot the target on the first attempt.
+ * This is necessary because:
+ * - Token density varies across text (some regions may have more tokens per character than the average)
+ * - The ratio-based estimate assumes uniform token distribution, which is rarely true
+ * - Undershooting is safer than overshooting: exceeding the limit requires another iteration,
+ *   while being slightly under is acceptable
+ * - In practice, this buffer reduces refinement iterations from 2-3 down to 0-1 in most cases
+ *
+ * @example
+ * // If text has 1000 chars and 250 tokens (4 chars/token average), targeting 100 tokens:
+ * // Without buffer: estimate = 1000 * (100/250) = 400 chars → might yield 105 tokens (over!)
+ * // With 0.98 buffer: estimate = 400 * 0.98 = 392 chars → likely yields 97-99 tokens (safe)
+ */
+const TRUNCATION_SAFETY_BUFFER = 0.98;
+
 /**
 * Processes text content by counting tokens and truncating if it exceeds the specified limit.
+ * Uses ratio-based estimation to minimize expensive tokenCountFn calls.
+ *
 * @param text - The text content to process
 * @param tokenLimit - The maximum number of tokens allowed
- * @param tokenCountFn - Function to count tokens
+ * @param tokenCountFn - Function to count tokens (can be sync or async)
 * @returns Promise resolving to object with processed text, token count, and truncation status
+ *
+ * @remarks
+ * This function uses a ratio-based estimation algorithm instead of binary search.
+ * Binary search would require O(log n) tokenCountFn calls (~17 for 100k chars),
+ * while this approach typically requires only 2-3 calls for a 90%+ reduction in CPU usage.
 */
 export async function processTextWithTokenLimit({
  text,
@ -14,7 +42,7 @@ export async function processTextWithTokenLimit({
 }: {
  text: string;
  tokenLimit: number;
-  tokenCountFn: (text: string) => number;
+  tokenCountFn: TokenCountFn;
 }): Promise<{ text: string; tokenCount: number; wasTruncated: boolean }> {
  const originalTokenCount = await tokenCountFn(text);

@ -26,40 +54,34 @@ export async function processTextWithTokenLimit({
    };
  }

-  /**
-   * Doing binary search here to find the truncation point efficiently
-   * (May be a better way to go about this)
-   */
-  let low = 0;
-  let high = text.length;
-  let bestText = '';
-
  logger.debug(
    `[textTokenLimiter] Text content exceeds token limit: ${originalTokenCount} > ${tokenLimit}, truncating...`,
  );

-  while (low <= high) {
-    const mid = Math.floor((low + high) / 2);
-    const truncatedText = text.substring(0, mid);
-    const tokenCount = await tokenCountFn(truncatedText);
+  const ratio = tokenLimit / originalTokenCount;
+  let charPosition = Math.floor(text.length * ratio * TRUNCATION_SAFETY_BUFFER);

-    if (tokenCount <= tokenLimit) {
-      bestText = truncatedText;
-      low = mid + 1;
-    } else {
-      high = mid - 1;
-    }
+  let truncatedText = text.substring(0, charPosition);
+  let tokenCount = await tokenCountFn(truncatedText);
+
+  const maxIterations = 5;
+  let iterations = 0;
+
+  while (tokenCount > tokenLimit && iterations < maxIterations && charPosition > 0) {
+    const overageRatio = tokenLimit / tokenCount;
+    charPosition = Math.floor(charPosition * overageRatio * TRUNCATION_SAFETY_BUFFER);
+    truncatedText = text.substring(0, charPosition);
+    tokenCount = await tokenCountFn(truncatedText);
+    iterations++;
  }

-  const finalTokenCount = await tokenCountFn(bestText);
-
  logger.warn(
-    `[textTokenLimiter] Text truncated from ${originalTokenCount} to ${finalTokenCount} tokens (limit: ${tokenLimit})`,
+    `[textTokenLimiter] Text truncated from ${originalTokenCount} to ${tokenCount} tokens (limit: ${tokenLimit})`,
  );

  return {
-    text: bestText,
-    tokenCount: finalTokenCount,
+    text: truncatedText,
+    tokenCount,
    wasTruncated: true,
  };
 }