📚 docs: Separate LiteLLM and Ollama Documentation (#1948)

* Separate LiteLLM and Ollama Documentation

* Clarify Ollama Setup

* Fix litellm config
This commit is contained in:
bsu3338 2024-03-02 11:42:02 -06:00 committed by GitHub
parent b2ef75e009
commit 78f52859c4
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
4 changed files with 179 additions and 199 deletions

View file

@ -1,5 +1,5 @@
---
title: 🚅 LiteLLM and Ollama
title: 🚅 LiteLLM
description: Using LibreChat with LiteLLM Proxy
weight: -7
---
@ -12,12 +12,18 @@ Use **[LiteLLM Proxy](https://docs.litellm.ai/docs/simple_proxy)** for:
* Authentication & Spend Tracking Virtual Keys
## Start LiteLLM Proxy Server
### Pip install litellm
```shell
pip install litellm
```
### 1. Uncomment desired sections in docker-compose.override.yml
The override file contains sections for the below LiteLLM features
### Create a config.yaml for litellm proxy
#### Caching with Redis
Litellm supports in-memory, redis, and s3 caching. Note: Caching currently only works with exact matching.
#### Performance Monitoring with Langfuse
Litellm supports various logging and observability options. The settings below will enable Langfuse which will provide a cache_hit tag showing which conversations used cache.
### 2. Create a config.yaml for LiteLLM proxy
LiteLLM requires a configuration file in addition to the override file. The file
below has the options to enable llm proxy to various providers, load balancing, Redis caching, and Langfuse monitoring. Review documentation for other configuration options.
More information on LiteLLM configurations here: **[docs.litellm.ai/docs/simple_proxy](https://docs.litellm.ai/docs/simple_proxy)**
```yaml
@ -40,42 +46,29 @@ model_list:
api_base: https://openai-france-1234.openai.azure.com/
api_key:
rpm: 1440
- model_name: mixtral
litellm_params:
model: ollama/mixtral:8x7b-instruct-v0.1-q5_K_M
api_base: http://ollama:11434
stream: True
- model_name: mistral
litellm_params:
model: ollama/mistral
api_base: http://ollama:11434
stream: True
litellm_settings:
success_callback: ["langfuse"]
cache: True
cache_params:
type: "redis"
supported_call_types: ["acompletion", "completion", "embedding", "aembedding"]
general_settings:
master_key: sk_live_SetToRandomValue
```
### Start the proxy
```shell
litellm --config /path/to/config.yaml
### 3. Configure LibreChat
#INFO: Proxy running on http://0.0.0.0:8000
```
## Use LiteLLM Proxy Server with LibreChat
#### 1. Clone the repo
```shell
git clone https://github.com/danny-avila/LibreChat.git
```
#### 2. Modify Librechat's `docker-compose.yml`
```yaml
OPENAI_REVERSE_PROXY=http://host.docker.internal:8000/v1/chat/completions
```
**Important**: As of v0.6.6, it's recommend you use the `librechat.yaml` [Configuration file (guide here)](./custom_config.md) to add Reverse Proxies as separate endpoints.
#### 3. Save fake OpenAI key in Librechat's `.env`
Copy Librechat's `.env.example` to `.env` and overwrite the default OPENAI_API_KEY (by default it requires the user to pass a key).
```env
OPENAI_API_KEY=sk-1234
```
#### 4. Run LibreChat:
```shell
docker compose up
```
Use `librechat.yaml` [Configuration file (guide here)](./ai_endpoints.md) to add Reverse Proxies as separate endpoints.
---
@ -102,162 +95,4 @@ Key components and features include:
- **Deployment and Performance**: Information on deploying LiteLLM Proxy and its performance metrics.
- **Proxy CLI Arguments**: A wide range of command-line arguments for customization.
Overall, LiteLLM Server offers a comprehensive suite of tools for managing, deploying, and interacting with a variety of LLMs, making it a versatile choice for large-scale AI applications.
## Ollama
Use [Ollama](https://ollama.ai/) for
* Run large language models on local hardware
* Host multiple models
* Dynamically load the model upon request
### GPU Acceleration
- **Linux**: Requires a Linux distrubution support by official Nvidia drivers. [Nvidia CUDA Toolkit](https://developer.nvidia.com/cuda-downloads?target_os=Linux)
- **Windows**: Requires Windows Subsytem for Linux. Follow Nvidia instructions at [Nvidia WSL User Guide](https://docs.nvidia.com/cuda/wsl-user-guide/index.html)
- **macOS**: [macOS Ollama Download](https://ollama.ai/download/mac)
### docker-compose.override.yml with GPU
```yaml
version: "3.8"
services:
litellm:
image: ghcr.io/berriai/litellm:main-latest
volumes:
- ./litellm/litellm-config.yaml:/app/config.yaml
command: [ "--config", "/app/config.yaml", "--port", "8000", "--num_workers", "8" ]
ollama:
image: ollama/ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
capabilities: [compute, utility]
ports:
- "11434:11434"
volumes:
- ./ollama:/root/.ollama
```
### Loading Models in Ollama
1. Browse the available models at [Ollama Library](https://ollama.ai/library)
2. Run ```docker exec -it ollama /bin/bash```
3. Copy the text from the Tags tab from the library website. It should begin with 'ollama run'
4. Check model size. Models that can run in GPU memory perform the best.
5. Use /bye to exit the terminal
### Litellm Ollama Configuration
Add the below lines to the config to access the Ollama models
```yaml
- model_name: mixtral
litellm_params:
model: ollama/mixtral:8x7b-instruct-v0.1-q5_K_M
api_base: http://ollama:11434
stream: True
- model_name: mistral
litellm_params:
model: ollama/mistral
api_base: http://ollama:11434
stream: True
```
## Caching with Redis
Litellm supports in-memory, redis, and s3 caching. Note: Caching currently only works with exact matching.
### Update docker-compose.override.yml to enable Redis
Add the below service to your docker-compose.override.yml
```yaml
redis:
image: redis:7-alpine
command:
- sh
- -c # this is to evaluate the $REDIS_PASSWORD from the env
- redis-server --appendonly yes --requirepass $$REDIS_PASSWORD ## $$ because of docker-compose
environment:
REDIS_PASSWORD: RedisChangeMe
volumes:
- ./redis:/data
```
Add the following to the environment variables in the litellm service inside the docker-compose.override.yml
```yaml
litellm:
image: ghcr.io/berriai/litellm:main-latest
volumes:
- ./litellm/litellm-config.yaml:/app/config.yaml
command: [ "--config", "/app/config.yaml", "--port", "8000", "--num_workers", "8" ]
environment:
REDIS_HOST: redis
REDIS_PORT: 6379
REDIS_PASSWORD: RedisChangeMe
```
### Update Litellm Config File
Add the below options to the litellm config file
```yaml
litellm_settings: # module level litellm settings - https://github.com/BerriAI/litellm/blob/main/litellm/__init__.py
cache: True # set cache responses to True, litellm defaults to using a redis cache
cache_params: # cache_params are optional
type: "redis" # The type of cache to initialize. Can be "local" or "redis". Defaults to "local".
# Optional configurations
supported_call_types: ["acompletion", "completion", "embedding", "aembedding"] # defaults to all litellm call types
```
## Performance Monitoring with Langfuse
Litellm supports various logging and observability options. The settings below will enable Langfuse which will provide a cache_hit tag showing which conversations used cache.
### Update docker-compose.override.yml to enable Langfuse
Langfuse requires a postgres database, so add both postgres and langfuse services to the docker-compose.override.yml
```yaml
langfuse-server:
image: ghcr.io/langfuse/langfuse:latest
depends_on:
- db
ports:
- "3000:3000"
environment:
- NODE_ENV=production
- DATABASE_URL=postgresql://postgres:PostgresChangeMe@db:5432/postgres
- NEXTAUTH_SECRET=ChangeMe
- SALT=ChangeMe
- NEXTAUTH_URL=http://localhost:3000
- TELEMETRY_ENABLED=${TELEMETRY_ENABLED:-true}
- NEXT_PUBLIC_SIGN_UP_DISABLED=${NEXT_PUBLIC_SIGN_UP_DISABLED:-false}
- LANGFUSE_ENABLE_EXPERIMENTAL_FEATURES=${LANGFUSE_ENABLE_EXPERIMENTAL_FEATURES:-false}
db:
image: postgres
restart: always
environment:
- POSTGRES_USER=postgres
- POSTGRES_PASSWORD=PostgresChangeMe
- POSTGRES_DB=postgres
volumes:
- ./postgres:/var/lib/postgresql/data
```
Once Langfuse is running, create an account by accessing the web interface on port 3000. Create a new project to obtain the needed public and private key used by the litellm config
Add environement variable within the litellm service within docker-compose.override.yml
```yaml
litellm:
image: ghcr.io/berriai/litellm:main-latest
ports:
- "8000:8000"
volumes:
- /srv/litellm/config/litellm-config.yaml:/app/config.yaml
command: [ "--config", "/app/config.yaml", "--port", "8000", "--num_workers", "8" ]
environment:
LANGFUSE_PUBLIC_KEY: pk-lf-RandomStringFromLangfuseWebInterface
LANGFUSE_SECRET_KEY: sk-lf-RandomStringFromLangfuseWebInterface
LANGFUSE_HOST: http://langfuse-server:3000
```
### Update litellm config file
```yaml
litellm_settings:
success_callback: ["langfuse"]
```
Overall, LiteLLM Server offers a comprehensive suite of tools for managing, deploying, and interacting with a variety of LLMs, making it a versatile choice for large-scale AI applications.