This commit is contained in:
Dane Fetterman 2026-04-05 01:31:15 +00:00 committed by GitHub
commit 11a17cf346
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
18 changed files with 5077 additions and 0 deletions

77
deploy/aws-sam/.gitignore vendored Normal file
View file

@ -0,0 +1,77 @@
# SAM build artifacts
.aws-sam/
samconfig.toml.bak
# Environment files
.env
.env.local
.env.production
.env.staging
.env.development
.librechat-deploy-config*
# AWS credentials
.aws/
# Logs
*.log
logs/
# OS generated files
.DS_Store
.DS_Store?
._*
.Spotlight-V100
.Trashes
ehthumbs.db
Thumbs.db
# IDE files
.vscode/
.idea/
*.swp
*.swo
*~
# Temporary files
*.tmp
*.temp
.cache/
donotcommit.txt
repo/
# Node modules (if any)
node_modules/
# Python
__pycache__/
*.py[cod]
*$py.class
*.so
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg
# Secrets and sensitive data
secrets.yaml
secrets.json
*.pem
*.key
*.crt
# Backup files
*.bak
*.backup

724
deploy/aws-sam/README.md Normal file
View file

@ -0,0 +1,724 @@
# LibreChat AWS SAM Deployment
This repository contains AWS SAM templates and scripts to deploy LibreChat on AWS with maximum scalability and high availability.
## What is LibreChat?
LibreChat is an enhanced, open-source ChatGPT clone that provides:
- **Multi-AI Provider Support**: OpenAI, Anthropic, Google Gemini, AWS Bedrock, Azure OpenAI, and more
- **Advanced Features**: Agents, function calling, file uploads, conversation search, code interpreter
- **Secure Multi-User**: Authentication, user management, conversation privacy
- **Extensible**: Plugin system, custom endpoints, RAG integration
- **Self-Hosted**: Complete control over your data and infrastructure
## Architecture Overview
This deployment creates a highly scalable, production-ready LibreChat environment optimized for enterprise use:
### Core Infrastructure (Scalability-First Design)
- **ECS Fargate**: Serverless container orchestration with auto-scaling (2-20 instances)
- **Application Load Balancer**: High availability with health checks and SSL termination
- **VPC**: Multi-AZ setup with public/private subnets and flexible internet connectivity options
- **Internet Connectivity**: Choose between NAT Gateways (standard AWS pattern) or Transit Gateway (existing infrastructure)
- **Auto Scaling**: CPU-based scaling with target tracking (70% CPU utilization)
### Data & Storage Layer
- **DocumentDB**: MongoDB-compatible database with multi-AZ deployment and automatic failover
- **ElastiCache Redis**: In-memory caching, session storage, and conversation search with failover
- **S3**: Encrypted file storage for user uploads, avatars, documents, and static assets
### Internet Connectivity Options
The deployment supports two network connectivity patterns:
**Option 1: NAT Gateway (Standard AWS Pattern)**
- **High Availability**: NAT Gateways in each AZ with automatic failover
- **Enterprise Performance**: Up to 45 Gbps bandwidth per gateway
- **Zero Maintenance**: Fully managed by AWS with 99.95% SLA
- **Cost**: ~$90/month for 2 NAT Gateways + data processing fees
- **Use Case**: New deployments or when maximum reliability is required
**Option 2: Transit Gateway (Existing Infrastructure)**
- **Cost Optimization**: No NAT Gateway costs (~$90/month savings)
- **Existing Infrastructure**: Leverages existing Transit Gateway setup
- **Controlled Routing**: Uses existing network policies and routing
- **Use Case**: Organizations with existing Transit Gateway infrastructure
### Security & Monitoring
- **Secrets Manager**: Secure storage for database passwords, JWT secrets, and API keys
- **CloudWatch**: Centralized logging, monitoring, and alerting
- **Security Groups**: Network-level security with least privilege access
- **IAM Roles**: Fine-grained permissions for ECS tasks and AWS service access
### Advanced Scalability Features
- **Fargate Spot Integration**: 80% Spot instances + 20% On-Demand for cost optimization
- **Multi-AZ High Availability**: Automatic failover across multiple availability zones
- **Horizontal Auto Scaling**: Scales from 2-20 instances based on CPU utilization
- **Load Balancing**: Intelligent traffic distribution across healthy instances
- **Container Health Checks**: Automatic replacement of unhealthy containers
- **Database Read Replicas**: DocumentDB supports read scaling for high-traffic scenarios
- **Redis Clustering**: ElastiCache supports cluster mode for memory scaling
## Prerequisites
1. **AWS CLI** - [Installation Guide](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html)
2. **SAM CLI** - [Installation Guide](https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/install-sam-cli.html)
3. **AWS Account** with appropriate permissions and network topology
4. **Domain & SSL Certificate** (for custom domain)
5. **AWS Cognito User Pool** (optional - for SSO authentication)
### SSO Prerequisites (Optional)
If you plan to use SSO authentication:
- **AWS Cognito User Pool** with configured identity providers
- **App Client** created in the Cognito User Pool with appropriate settings
- **Identity Provider** (SAML, OIDC, or social) configured in Cognito
- **Attribute mappings** configured in Cognito for user claims (name, email)
### Required AWS Permissions
Your AWS user/role needs permissions for:
- CloudFormation (full access)
- ECS (full access)
- EC2 (VPC, Security Groups, Load Balancers)
- DocumentDB (full access)
- ElastiCache (full access)
- S3 (bucket creation and management)
- IAM (role creation)
- Secrets Manager (secret creation)
- CloudWatch (log groups)
- STS (checking caller identity)
## Quick Start
### Interactive Deployment (Recommended)
1. **Clone and configure:**
```bash
git clone <this-repo>
cd librechat-aws-sam
# Configure AWS credentials
aws configure
```
2. **Run interactive deployment:**
```bash
./deploy-clean.sh
```
The script will interactively prompt for:
- Environment (dev/staging/prod)
- AWS region
- Stack name
- Internet connectivity option (NAT Gateway vs Transit Gateway)
- VPC ID (with helpful VPC listing)
- Public subnet IDs (for load balancer)
- Private subnet IDs (for ECS tasks and databases)
- AWS Bedrock credentials for AI model access
- Optional SSO configuration with AWS Cognito
- Optional domain name and SSL certificate
3. **Save configuration for future deployments:**
The script automatically offers to save your configuration to `.librechat-deploy-config`
4. **Redeploy with saved configuration:**
```bash
./deploy-clean.sh --load-config
```
5. **Update YAML config file only option:**
To update yaml config file and restart containers only
```bash
./deploy-clean.sh --update-config
```
## Deployment Options
### Interactive Deployment (Recommended)
```bash
# First-time deployment
./deploy-clean.sh
# Redeploy with saved configuration
./deploy-clean.sh --load-config
# Reset saved configuration
./deploy-clean.sh --reset-config
# Update yaml config file and restart containers only
./deploy-clean.sh --update-config
```
The interactive deployment provides:
- **Guided Setup**: Step-by-step prompts for all parameters
- **AWS Resource Discovery**: Lists available VPCs and subnets
- **Validation**: Checks VPC and subnet accessibility
- **Configuration Persistence**: Saves settings for future deployments
- **Smart Defaults**: Remembers previous choices
## Configuration
### Deploy script configuration (`.librechat-deploy-config`)
The deploy script saves your choices to `.librechat-deploy-config` and reloads them with `--load-config`. You can also edit this file to set or change options without re-prompting.
**Optional: Custom container image (`LIBRECHAT_IMAGE`)**
By default, the stack uses the container image defined in the template (e.g. the official `librechat/librechat:latest` or a template default). To use a custom image (e.g. your own ECR build), set `LIBRECHAT_IMAGE` in your deploy config:
```bash
LIBRECHAT_IMAGE="<account>.dkr.ecr.<region>.amazonaws.com/<repository>:<tag>"
```
Then deploy with the config loaded so the parameter is applied:
```bash
./deploy-clean.sh --load-config
```
If `LIBRECHAT_IMAGE` is unset or empty, the templates default image is used.
### Environment Variables
The deployment automatically configures these environment variables for LibreChat:
**Core Application Settings:**
- `NODE_ENV`: Set to "production"
- `MONGO_URI`: DocumentDB connection string with SSL and authentication
- `REDIS_URI`: ElastiCache Redis connection string
- `NODE_TLS_REJECT_UNAUTHORIZED`: Set to "0" for DocumentDB SSL compatibility
- `ALLOW_REGISTRATION`: Set to "false" (configure SAML post-deployment)
**Security & Authentication:**
- `JWT_SECRET`: Auto-generated secure JWT secret (stored in Secrets Manager)
- `JWT_REFRESH_SECRET`: Auto-generated refresh token secret (stored in Secrets Manager)
- `CREDS_KEY`: Auto-generated credentials encryption key (stored in Secrets Manager)
- `CREDS_IV`: Auto-generated encryption IV (stored in Secrets Manager)
**SSO Authentication (Optional):**
- `ENABLE_SSO`: Set to "true" to enable SSO authentication
- `COGNITO_USER_POOL_ID`: AWS Cognito User Pool ID
- `OPENID_CLIENT_ID`: App Client ID from Cognito User Pool
- `OPENID_CLIENT_SECRET`: App Client Secret from Cognito User Pool
- `OPENID_SCOPE`: OpenID scope for authentication (default: `openid profile email`)
- `OPENID_BUTTON_LABEL`: Login button text (default: `Sign in with SSO`)
- `OPENID_NAME_CLAIM`: Name attribute mapping (default: `name`)
- `OPENID_EMAIL_CLAIM`: Email attribute mapping (default: `email`)
- `OPENID_SESSION_SECRET`: Auto-generated session secret (stored in Secrets Manager)
- `OPENID_ISSUER`: Auto-configured Cognito issuer URL
- `OPENID_CALLBACK_URL`: Auto-configured callback URL (`/oauth/openid/callback`)
**AWS Bedrock Configuration:**
- `AWS_REGION`: Deployment region for AWS services
- `BEDROCK_AWS_DEFAULT_REGION`: AWS region for Bedrock API calls
- `BEDROCK_AWS_ACCESS_KEY_ID`: AWS access key for Bedrock access (from deployment parameters)
- `BEDROCK_AWS_SECRET_ACCESS_KEY`: AWS secret key for Bedrock access (from deployment parameters)
- `BEDROCK_AWS_MODELS`: Pre-configured Bedrock models including:
- `us.anthropic.claude-3-7-sonnet-20250219-v1:0`
- `us.anthropic.claude-opus-4-20250514-v1:0`
- `us.anthropic.claude-sonnet-4-20250514-v1:0`
- `us.anthropic.claude-3-5-haiku-20241022-v1:0`
- `us.meta.llama3-3-70b-instruct-v1:0`
- `us.amazon.nova-pro-v1:0`
**Configuration Management:**
- `CONFIG_PATH`: Set to "/app/config/librechat.yaml" (mounted from EFS)
- `CACHE`: Set to "false" to disable prompt caching (avoids Bedrock caching issues)
### EFS Configuration System:
The deployment includes an EFS-based configuration management system:
- **Real-time Updates**: Configuration changes without container rebuilds
- **S3 → EFS Pipeline**: Automated sync from S3 to EFS via Lambda
- **Container Mounting**: EFS volume mounted at `/app/config/librechat.yaml` and CONFIG_PATH environmental variable set to match it
- **Update Commands**: Use `./deploy-clean.sh --update-config` for config-only updates
### Scaling Configuration
Default scaling settings:
- **Min Capacity**: 2 instances
- **Max Capacity**: 20 instances
- **Target CPU**: 70% utilization
- **Scale Out Cooldown**: 5 minutes
- **Scale In Cooldown**: 5 minutes
To modify scaling, edit the `ECSAutoScalingTarget` and `ECSAutoScalingPolicy` resources in `template.yaml`.
### Database Configuration
**DocumentDB (MongoDB-compatible):**
- Instance Class: `db.t3.medium` (2 instances)
- Backup Retention: 7 days
- Encryption: Enabled
- Multi-AZ: Yes
**ElastiCache Redis:**
- Node Type: `cache.t3.micro` (2 nodes)
- Engine Version: 7.0
- Encryption: At-rest and in-transit
- Multi-AZ: Yes with automatic failover
## LibreChat Dependencies & Features
### Core Dependencies Deployed
- **MongoDB/DocumentDB**: Primary database for conversations, users, and metadata
- **Redis/ElastiCache**: Session management, caching, and real-time features
- **S3**: File storage with support for multiple strategies:
- **Avatars**: User and agent profile images
- **Images**: Chat image uploads and generations
- **Documents**: PDF uploads, text files, and attachments
- **Static Assets**: CSS, JavaScript, and other static content
### Optional Components (Can Be Added)
- **Meilisearch**: Full-text search for conversation history with typo tolerance
- **Vector Database**: For RAG (Retrieval-Augmented Generation) functionality
- **CDN**: CloudFront integration for global content delivery
### File Storage Strategies
LibreChat supports multiple storage strategies that can be mixed:
- **S3**: Scalable cloud storage (configured in this deployment)
## Post-Deployment Setup
### 1. Access LibreChat
After deployment completes (15-20 minutes), access LibreChat using the Load Balancer URL:
```bash
# Get the application URL
aws cloudformation describe-stacks \
--stack-name librechat \
--query 'Stacks[0].Outputs[?OutputKey==`LoadBalancerURL`].OutputValue' \
--output text
```
The application will be available at: `http://your-load-balancer-url` (or `https://` if you configured SSL)
### 2. Initial Admin Setup
1. **First User Registration**: The first user to register becomes the admin
<!-- 2. **Admin Panel Access**: Navigate to `/admin` after logging in as admin
3. **User Management**: Control user registration and permissions -->
### 3. Configure SSO Authentication (Optional)
**Prerequisites:**
- AWS Cognito User Pool created and configured
- App Client created in the User Pool with appropriate settings
- Identity Provider configured in Cognito (SAML, OIDC, or social providers)
- Attribute mappings configured in Cognito
**SSO Configuration Options:**
The deployment supports optional SSO authentication through AWS Cognito with OpenID Connect:
**Required SSO Settings:**
- `ENABLE_SSO`: Set to "true" to enable SSO authentication
- `COGNITO_USER_POOL_ID`: Your AWS Cognito User Pool ID (e.g., `us-east-1_8o9DM3lHZ`)
- `OPENID_CLIENT_ID`: App Client ID from your Cognito User Pool
- `OPENID_CLIENT_SECRET`: App Client Secret from your Cognito User Pool
**Optional SSO Settings:**
- `OPENID_SCOPE`: OpenID scope for authentication (default: `openid profile email`)
- `OPENID_BUTTON_LABEL`: Login button text (default: `Sign in with SSO`)
- `OPENID_NAME_CLAIM`: Name attribute mapping (default: `name`)
- `OPENID_EMAIL_CLAIM`: Email attribute mapping (default: `email`)
**Automatic Configuration:**
The deployment automatically configures:
- `OPENID_ISSUER`: Cognito issuer URL (`https://cognito-idp.{region}.amazonaws.com/{user-pool-id}`)
- `OPENID_CALLBACK_URL`: OAuth callback URL (`/oauth/openid/callback`)
- `OPENID_SESSION_SECRET`: Secure session secret (auto-generated and stored in Secrets Manager)
**Configuration Methods:**
1. **During Deployment**: The interactive deployment script will prompt for SSO settings
2. **Post-Deployment**: Update the CloudFormation stack with SSO parameters
3. **Environment Variables**: Configure directly in the ECS task definition
**SSO Setup Steps:**
1. **Create AWS Cognito User Pool**:
- Create a new User Pool in AWS Cognito
- Configure sign-in options (email, username, etc.)
- Set up password policies and MFA if desired
- Configure attribute mappings for name and email
2. **Create App Client**:
- Create an App Client in your User Pool
- Enable "Generate client secret"
- Configure OAuth 2.0 settings:
- Allowed OAuth Flows: Authorization code grant
- Allowed OAuth Scopes: openid, profile, email
- Callback URLs: `https://your-domain/oauth/openid/callback`
- Sign out URLs: `https://your-domain`
3. **Configure Identity Provider (Optional)**:
- Add SAML, OIDC, or social identity providers to Cognito
- Configure attribute mappings between IdP and Cognito
- Test the identity provider integration
4. **Deploy with SSO**:
```bash
./deploy-clean.sh
# Choose "y" when prompted for SSO configuration
# Provide the required Cognito User Pool ID, Client ID, and Client Secret
```
5. **Verify SSO Integration**:
- Access LibreChat URL
- Click the SSO login button (customizable label)
- Complete authentication flow through Cognito
- Verify user attributes are mapped correctly
**Important Notes:**
- SSO configuration is completely optional
- If SSO is not configured, LibreChat uses standard email/password authentication
- SSO settings can be added or modified after initial deployment
- Ensure Cognito User Pool and App Client configuration is complete before enabling SSO
- The callback URL must match exactly what's configured in your Cognito App Client
**Adding SSO After Initial Deployment:**
If you deployed without SSO initially, you can add it later:
1. **Update CloudFormation Stack**:
```bash
aws cloudformation update-stack \
--stack-name your-stack-name \
--use-previous-template \
--parameters ParameterKey=EnableSSO,ParameterValue="true" \
ParameterKey=CognitoUserPoolId,ParameterValue="your-user-pool-id" \
ParameterKey=OpenIdClientId,ParameterValue="your-client-id" \
ParameterKey=OpenIdClientSecret,ParameterValue="your-client-secret" \
--capabilities CAPABILITY_IAM
```
2. **Or Re-run Deployment Script**:
```bash
./deploy-clean.sh --load-config
# Choose "y" for SSO configuration when prompted
```
**Supported Identity Providers:**
Through AWS Cognito, you can integrate with:
- **SAML 2.0**: Enterprise identity providers (Active Directory, Okta, etc.)
- **OpenID Connect**: OIDC-compliant providers
- **Social Providers**: Google, Facebook, Amazon, Apple
- **Custom Providers**: Any OAuth 2.0 or SAML 2.0 compliant system
### 4. Set Up AI Provider API Keys
Configure your AI providers in the LibreChat interface:
**Supported Providers:**
- **OpenAI**: GPT-4, GPT-3.5, DALL-E, Whisper
- **Anthropic**: Claude 3.5 Sonnet, Claude 3 Opus/Haiku
- **Google**: Gemini Pro, Gemini Vision
- **Azure OpenAI**: Enterprise OpenAI models
- **AWS Bedrock**: Claude, Titan, Llama models
- **Groq**: Fast inference for Llama, Mixtral
- **OpenRouter**: Access to multiple model providers
- **Custom Endpoints**: Any OpenAI-compatible API
**Configuration Methods:**
- **Environment Variables**: Pre-configure in deployment (more secure)
- **YAML FILE**: Certain configuration options are configured via librechat.yaml
<!-- ### 5. File Upload & Storage Configuration
The deployment automatically configures S3 for file storage:
- **Upload Limits**: Configure max file sizes in admin panel
- **Supported Formats**: PDFs, images, text files, code files
- **Storage Strategy**: S3 (configured automatically)
- **CDN Integration**: Ready for CloudFront if needed -->
### 5. Advanced Configuration Options
<!-- **Conversation Search (Optional):**
- Deploy Meilisearch for full-text conversation search
- Enables typo-tolerant search across chat history
- Can be added as additional ECS service
**RAG Integration (Optional):**
- Configure vector database for document Q&A
- Supports PDF uploads with semantic search
- Integrates with embedding providers
**Rate Limiting:**
- Configure per-user rate limits
- Set up token usage tracking
- Monitor costs across providers -->
### 6. Monitoring & Maintenance
**CloudWatch Dashboards:**
- ECS service metrics (CPU, memory, task count)
- Load balancer performance (response time, error rates)
- Database metrics (DocumentDB and Redis)
- Application logs and error tracking
**Automated Scaling:**
- Monitors CPU utilization (target: 70%)
- Scales from 2-20 instances automatically
- Uses 80% Spot instances for cost optimization
**Health Checks:**
- Application-level health checks
- Database connectivity monitoring
- Automatic unhealthy task replacement
## Monitoring and Maintenance
### CloudWatch Logs
View application logs:
```bash
aws logs tail /ecs/librechat --follow
```
### ECS Service Status
Check service health:
```bash
aws ecs describe-services --cluster librechat-cluster --services librechat-service
```
### Database Monitoring
- DocumentDB metrics available in CloudWatch
- ElastiCache Redis metrics and performance insights
- Set up CloudWatch alarms for critical metrics
### Cost Optimization
- Monitor Fargate Spot vs On-Demand usage
- Review DocumentDB and ElastiCache instance sizes
- Set up billing alerts
## Scaling Considerations
### Horizontal Scaling (Automatic)
The deployment automatically handles horizontal scaling:
**ECS Auto Scaling:**
- **Minimum**: 2 instances (high availability)
- **Maximum**: 20 instances (configurable)
- **Trigger**: 70% CPU utilization average
- **Scale Out**: Add instances when CPU > 70% for 5 minutes
- **Scale In**: Remove instances when CPU < 70% for 5 minutes
- **Cooldown**: 5-minute intervals between scaling actions
**Database Scaling:**
- **DocumentDB**: Supports up to 15 read replicas for read scaling
- **ElastiCache Redis**: Supports cluster mode for memory scaling
- **Connection Pooling**: Efficient database connection management
### Vertical Scaling (Manual)
For higher per-instance performance:
**ECS Task Scaling:**
```yaml
# In template.yaml, modify:
Cpu: 2048 # Double CPU (1024 -> 2048)
Memory: 4096 # Double memory (2048 -> 4096)
```
**Database Scaling:**
```yaml
# Upgrade DocumentDB instances:
DBInstanceClass: db.r5.large # From db.t3.medium
DBInstanceClass: db.r5.xlarge # For heavy workloads
# Upgrade Redis instances:
NodeType: cache.r6g.large # From cache.t3.micro
```
### Global Scaling (Multi-Region)
For worldwide deployment:
<!-- 1. **Deploy in Multiple Regions**:
```bash
./deploy.sh -r us-east-1 -s librechat-us-east
./deploy.sh -r eu-west-1 -s librechat-eu-west
./deploy.sh -r ap-southeast-1 -s librechat-asia
```
2. **Route 53 Setup**:
- Health checks for each region
- Latency-based routing
- Automatic failover
3. **Data Synchronization**:
- DocumentDB Global Clusters
- S3 Cross-Region Replication
- Redis Global Datastore -->
### Load Testing
Before production deployment, perform load testing:
```bash
# Example load test with Apache Bench
ab -n 10000 -c 100 http://your-load-balancer-url/
# Or use more sophisticated tools:
# - Artillery.io for API testing
# - JMeter for comprehensive testing
# - Locust for Python-based testing
```
### Capacity Planning
Plan for growth with these guidelines:
**User Scaling:**
- **Light Users**: 1 instance per 100 concurrent users
- **Medium Users**: 1 instance per 50 concurrent users
- **Heavy Users**: 1 instance per 25 concurrent users
**Database Scaling:**
- **DocumentDB**: 1000 connections per db.t3.medium
- **Redis**: 65,000 connections per cache.t3.micro
- **Storage**: Plan 1GB per 1000 conversations
## Security Best Practices
### Network Security
- All databases in private subnets
- Security groups with minimal required access
- Optional NAT gateways or Transit Gateway for outbound internet access
- Flexible internet connectivity based on existing infrastructure
### Data Security
- Encryption at rest for all data stores
- Encryption in transit for Redis
- S3 bucket encryption and versioning
- Secrets Manager for sensitive data
### Access Control
- IAM roles with least privilege
- ECS task roles for service-specific permissions
- No hardcoded credentials
## Troubleshooting
### Common Issues
**Deployment Fails:**
```bash
# Check CloudFormation events
aws cloudformation describe-stack-events --stack-name librechat
# Check SAM logs
sam logs -n ECSService --stack-name librechat
```
**Service Won't Start:**
```bash
# Check ECS task logs
aws ecs describe-tasks --cluster librechat-cluster --tasks <task-arn>
# Check CloudWatch logs
aws logs tail /ecs/librechat --follow
```
**Database Connection Issues:**
- Verify security group rules
- Check DocumentDB cluster status
- Validate connection strings in Secrets Manager
### Performance Issues
- Monitor ECS service CPU/memory utilization
- Check DocumentDB performance insights
- Review ElastiCache Redis metrics
- Analyze ALB target group health
## Cleanup
To remove all resources:
```bash
aws cloudformation delete-stack --stack-name librechat
```
**Note:** This will delete all data. Ensure you have backups if needed.
## Cost Optimization & Estimation
### Cost Optimization Features
This deployment is optimized for cost efficiency while maintaining high availability:
**Fargate Spot Integration:**
- **80% Spot Instances**: Up to 70% cost savings on compute
- **20% On-Demand**: Ensures availability during Spot interruptions
- **Automatic Failover**: Seamless transition between Spot and On-Demand
**Right-Sizing Strategy:**
- **Auto Scaling**: Only pay for resources you need (2-20 instances)
- **Efficient Instance Types**: Optimized CPU/memory ratios
- **Database Optimization**: DocumentDB and Redis sized for typical workloads
**Storage Optimization:**
- **S3 Intelligent Tiering**: Automatic cost optimization for file storage
- **Lifecycle Policies**: Automatic cleanup of incomplete uploads
- **Compression**: Efficient storage of conversation data
### Monthly Cost Estimation (US-East-1)
**Base Infrastructure (Minimum 2 instances):**
- **ECS Fargate (2 instances)**: ~$30-50/month
- 80% Spot pricing: ~$24-40/month
- 20% On-Demand: ~$6-10/month
- **DocumentDB (2x db.t3.medium)**: ~$100-120/month
- **ElastiCache Redis (2x cache.t3.micro)**: ~$30-40/month
- **Application Load Balancer**: ~$20/month
- **NAT Gateway (2 AZs) - Optional**: ~$90/month
- **Base cost**: $45/month per NAT Gateway × 2 = $90/month
- **Data processing**: $0.045 per GB processed
- **High availability**: Automatic failover between AZs
- **Performance**: Up to 45 Gbps bandwidth per gateway
- **S3 Storage**: ~$5-25/month (depending on usage)
- **Data Transfer**: ~$10-30/month (depending on traffic)
**Total Monthly Cost Ranges:**
**With NAT Gateways (Standard AWS Pattern):**
- **Light Usage (2-3 instances)**: ~$285-335/month
- **Medium Usage (5-8 instances)**: ~$380-480/month
- **Heavy Usage (10-20 instances)**: ~$530-830/month
**Without NAT Gateways (Transit Gateway Pattern):**
- **Light Usage (2-3 instances)**: ~$195-245/month
- **Medium Usage (5-8 instances)**: ~$290-390/month
- **Heavy Usage (10-20 instances)**: ~$440-740/month
**NAT Gateway vs Transit Gateway Comparison:**
- **NAT Gateway Benefits**: 99.95% SLA, zero maintenance, 45 Gbps performance, built-in DDoS protection
- **Transit Gateway Benefits**: ~$90/month cost savings, leverages existing infrastructure, centralized routing
- **Cost Difference**: ~$90/month for NAT Gateway option
- **Performance**: NAT Gateway typically faster for internet access, Transit Gateway may have additional latency
**Cost Comparison:**
- **Traditional EC2**: 40-60% more expensive
- **Managed Services**: 70-80% more expensive than self-managed
- **Multi-Cloud**: This deployment is 50-70% cheaper than equivalent GCP/Azure
### Cost Monitoring & Alerts
- **AWS Cost Explorer**: Track spending by service
- **Billing Alerts**: Set up budget notifications
- **Resource Tagging**: Track costs by environment/team
- **Spot Instance Savings**: Monitor Spot vs On-Demand usage
### Additional Cost Optimization Tips
1. **Use Reserved Instances**: For DocumentDB if usage is predictable
2. **Enable S3 Intelligent Tiering**: Automatic storage class optimization
3. **Monitor Data Transfer**: Optimize between AZs and regions
4. **Regular Cleanup**: Remove unused resources and old backups
5. **Right-Size Databases**: Monitor and adjust instance types based on usage
## Support
For issues related to:
- **LibreChat**: [LibreChat GitHub](https://github.com/danny-avila/LibreChat)
- **AWS SAM**: [AWS SAM Documentation](https://docs.aws.amazon.com/serverless-application-model/)
- **This deployment**: Create an issue in this repository
## License
This deployment template is provided under the MIT License. LibreChat itself is licensed under the MIT License.

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,108 @@
# Minimal LibreChat config for AWS SAM deploy
# Copy this file to librechat.yaml and customize for your deployment.
# For full options, see: https://www.librechat.ai/docs/configuration/librechat_yaml
# Configuration version (required)
version: 1.2.8
# Cache settings
cache: true
# File storage configuration
fileStrategy: "s3"
# Transaction settings
transactions:
enabled: true
interface:
mcpServers:
placeholder: "Select MCP Servers"
use: true
create: true
share: true
trustCheckbox:
label: "I trust this server"
subLabel: "Only enable servers you trust"
privacyPolicy:
externalUrl: "https://example.com/privacy"
openNewTab: true
termsOfService:
externalUrl: "https://example.com/terms"
openNewTab: true
modalAcceptance: true
modalTitle: "Terms of Service"
modalContent: |
# Terms of Service
## Introduction
Welcome to LibreChat!
modelSelect: true
parameters: true
sidePanel: true
presets: false
prompts: false
bookmarks: false
multiConvo: true
agents: true
customWelcome: "Welcome to LibreChat!"
runCode: true
webSearch: true
fileSearch: true
fileCitations: true
# MCP Servers Configuration (customize or add your own)
# Use env var placeholders for secrets, e.g. ${MCP_SOME_TOKEN}
mcpServers:
# Example: third-party MCP
# Deepwiki:
# url: "https://mcp.deepwiki.com/mcp"
# name: "DeepWiki"
# description: "DeepWiki MCP Server..."
# type: "streamable-http"
# Example: your own MCP (replace with your API URL and token env var)
# MyMcp:
# name: "My MCP Server"
# description: "Description of the server"
# url: "https://YOUR_API_ID.execute-api.YOUR_REGION.amazonaws.com/dev/mcp/your_mcp"
# type: "streamable-http"
# headers:
# Authorization: "Bearer ${MCP_MY_TOKEN}"
# Registration (optional)
# registration:
# socialLogins: ['saml', 'github', 'google', 'openid', ...]
registration:
socialLogins:
- "saml"
- "openid"
# allowedDomains:
# - "example.edu"
# - "*.example.edu"
# Balance settings (optional)
balance:
enabled: true
startBalance: 650000
autoRefillEnabled: true
refillIntervalValue: 1440
refillIntervalUnit: "minutes"
refillAmount: 250000
# Custom endpoints (e.g. Bedrock)
endpoints:
# bedrock:
# cache: true
# promptCache: true
# titleModel: "us.anthropic.claude-3-7-sonnet-20250219-v1:0"
# Model specs default model selection for new users
# modelSpecs:
# prioritize: true
# list:
# - name: "my-default"
# label: "My Default Model"
# description: "Default model for new conversations"
# default: true
# preset:
# endpoint: "bedrock"
# model: "us.anthropic.claude-sonnet-4-5-20250929-v1:0"

View file

@ -0,0 +1,235 @@
# LibreChat Admin Scripts
This directory contains utility scripts for managing your LibreChat deployment.
## Managing Admin Users
### Grant Admin Permissions
To grant admin permissions to a user:
```bash
./scripts/make-admin.sh user@domain.edu
```
### Remove Admin Permissions
To remove admin permissions from a user (demote to regular user):
```bash
./scripts/make-admin.sh user@domain.edu --remove
```
### How It Works
The script:
1. Spins up a one-off ECS task using your existing task definition
2. Connects to MongoDB using the same credentials as your running application
3. Updates the user's role to ADMIN or USER
4. Waits for completion and reports success/failure
5. Automatically cleans up the task
The user will need to log out and log back in for changes to take effect.
## Managing User Balance
### Add Balance to a User
To add tokens to a user's balance:
```bash
./scripts/add-balance.sh user@domain.edu 1000
```
This will add 1000 tokens to the user's account.
### Requirements
- Balance must be enabled in `librechat.yaml`:
```yaml
balance:
enabled: true
startBalance: 600000
autoRefillEnabled: true
refillIntervalValue: 1440
refillIntervalUnit: 'minutes'
refillAmount: 100000
```
### How It Works
The script:
1. Validates that balance is enabled in your configuration
2. Finds the user by email
3. Creates a transaction record with the specified amount
4. Updates the user's balance
5. Reports the new balance
### Common Use Cases
```bash
# Give a new user initial credits
./scripts/add-balance.sh newuser@domain.edu 5000
# Top up a user who ran out
./scripts/add-balance.sh user@domain.edu 10000
# Grant bonus credits
./scripts/add-balance.sh poweruser@domain.edu 50000
```
## Manual AWS CLI Commands
If you prefer to run commands manually or need to troubleshoot:
### 1. Get your cluster and network configuration
```bash
# Load your deployment config
source .librechat-deploy-config
CLUSTER_NAME="${STACK_NAME}-cluster"
REGION="${REGION:-us-east-1}"
# Get network configuration from existing service
aws ecs describe-services \
--cluster "$CLUSTER_NAME" \
--services "${STACK_NAME}-service" \
--region "$REGION" \
--query 'services[0].networkConfiguration.awsvpcConfiguration'
```
### 2. Run a one-off task to manage admin role
```bash
# Set the user email and action
USER_EMAIL="user@domain.edu"
TARGET_ROLE="ADMIN" # or "USER" to remove admin
# Get task definition
TASK_DEF=$(aws ecs describe-task-definition \
--task-definition "${STACK_NAME}-task" \
--region "$REGION" \
--query 'taskDefinition.taskDefinitionArn' \
--output text)
# Create the command
SHELL_CMD="cd /app/api && cat > manage-admin.js << 'EOFSCRIPT'
const path = require('path');
require('module-alias')({ base: path.resolve(__dirname) });
const mongoose = require('mongoose');
const { updateUser, findUser } = require('~/models');
(async () => {
try {
await mongoose.connect(process.env.MONGO_URI);
const user = await findUser({ email: '$USER_EMAIL' });
if (!user) {
console.error('User not found');
process.exit(1);
}
await updateUser(user._id, { role: '$TARGET_ROLE' });
console.log('User role updated to $TARGET_ROLE');
await mongoose.connection.close();
} catch (err) {
console.error('Error:', err.message);
process.exit(1);
}
})();
EOFSCRIPT
node manage-admin.js"
# Build JSON with jq
OVERRIDES=$(jq -n --arg cmd "$SHELL_CMD" '{
containerOverrides: [{
name: "librechat",
command: ["sh", "-c", $cmd]
}]
}')
# Run the task (replace SUBNETS and SECURITY_GROUPS with values from step 1)
aws ecs run-task \
--cluster "$CLUSTER_NAME" \
--task-definition "$TASK_DEF" \
--launch-type FARGATE \
--network-configuration "awsvpcConfiguration={subnets=[subnet-xxx,subnet-yyy],securityGroups=[sg-xxxxx],assignPublicIp=DISABLED}" \
--overrides "$OVERRIDES" \
--region "$REGION"
```
## Troubleshooting
### Task fails to start
- Check that your ECS service is running
- Verify network configuration (subnets, security groups)
- Check CloudWatch Logs: `/aws/ecs/${STACK_NAME}`
### User not found error
- Verify the email address is correct
- Check that the user has logged in at least once
- Email addresses are case-sensitive
### MongoDB connection fails
- Verify the MONGO_URI environment variable is set correctly in the task
- Check that the security group allows access to DocumentDB (port 27017)
- Ensure the task is running in the same VPC as DocumentDB
### Changes don't take effect
- User must log out and log back in for role changes to apply
- Check CloudWatch Logs to confirm the update was successful
- Verify the exit code was 0 (success)
### Balance not enabled error
- Ensure `balance.enabled: true` is set in `librechat.yaml`
- Restart your ECS service after updating the configuration
- Verify the config file is properly mounted in the container
### Invalid amount error
- Amount must be a positive integer
- Do not use decimals or negative numbers
- Example: `1000` not `1000.5` or `-1000`
## Security Notes
- These scripts use your existing task definition with all environment variables
- The MongoDB connection uses the same credentials as your running application
- Tasks run in your private subnets with no public IP
- All commands are logged to CloudWatch Logs
- One-off tasks automatically stop after completion
## Alternative: Use OpenID Groups (Recommended for Production)
Instead of manually managing admin users, consider using OpenID groups for automatic role assignment:
### Setup
1. **In AWS Cognito**, create a group called "admin"
2. **Add users** to that group through the Cognito console
3. **Configure LibreChat** (already done in `.env.local`):
```bash
OPENID_ADMIN_ROLE=admin
OPENID_ADMIN_ROLE_PARAMETER_PATH=cognito:groups
OPENID_ADMIN_ROLE_TOKEN_KIND=id_token
```
4. **Users automatically get admin permissions** on their next login
### Benefits
- No database access required
- Centralized user management in Cognito
- Automatic role assignment on login
- Easier to audit and manage at scale
- Role changes take effect immediately on next login
### When to Use the Script vs OpenID Groups
**Use the script when:**
- You need to quickly grant/revoke admin access
- You're troubleshooting or testing
- You have a one-time admin setup need
**Use OpenID groups when:**
- Managing multiple admins
- You want centralized access control
- You need audit trails through Cognito
- You want automatic role management

View file

@ -0,0 +1,208 @@
#!/bin/bash
# Script to add balance to a user by running a one-off ECS task
# Usage: ./scripts/add-balance.sh <user-email> <amount>
set -e
# Check if arguments are provided
if [ -z "$1" ] || [ -z "$2" ]; then
echo "Usage: $0 <user-email> <amount>"
echo ""
echo "Examples:"
echo " Add 1000 tokens: $0 user@domain.com 1000"
echo " Add 5000 tokens: $0 user@domain.com 5000"
echo ""
echo "Note: Balance must be enabled in librechat.yaml"
exit 1
fi
USER_EMAIL="$1"
AMOUNT="$2"
# Validate amount is a number
if ! [[ "$AMOUNT" =~ ^[0-9]+$ ]]; then
echo "Error: Amount must be a positive number"
exit 1
fi
# Load configuration
if [ ! -f .librechat-deploy-config ]; then
echo "Error: .librechat-deploy-config not found"
exit 1
fi
source .librechat-deploy-config
# Set variables
CLUSTER_NAME="${STACK_NAME}-cluster"
TASK_FAMILY="${STACK_NAME}-task"
REGION="${REGION:-us-east-1}"
echo "=========================================="
echo "Adding balance to user: $USER_EMAIL"
echo "Amount: $AMOUNT tokens"
echo "Stack: $STACK_NAME"
echo "Region: $REGION"
echo "=========================================="
# Get VPC configuration from the existing service
echo "Getting network configuration from existing service..."
SERVICE_INFO=$(aws ecs describe-services \
--cluster "$CLUSTER_NAME" \
--services "${STACK_NAME}-service" \
--region "$REGION" \
--query 'services[0].networkConfiguration.awsvpcConfiguration' \
--output json)
SUBNETS=$(echo "$SERVICE_INFO" | jq -r '.subnets | join(",")')
SECURITY_GROUPS=$(echo "$SERVICE_INFO" | jq -r '.securityGroups | join(",")')
echo "Subnets: $SUBNETS"
echo "Security Groups: $SECURITY_GROUPS"
# Get the task definition
echo "Getting task definition..."
TASK_DEF=$(aws ecs describe-task-definition \
--task-definition "$TASK_FAMILY" \
--region "$REGION" \
--query 'taskDefinition.taskDefinitionArn' \
--output text)
echo "Task Definition: $TASK_DEF"
# Run the one-off task
echo "Starting ECS task to add balance..."
# Create a Node.js script that mimics the add-balance.js functionality
SHELL_CMD="cd /app/api && cat > add-balance-task.js << 'EOFSCRIPT'
// Setup module-alias like LibreChat does
const path = require('path');
require('module-alias')({ base: path.resolve(__dirname) });
const mongoose = require('mongoose');
const { getBalanceConfig } = require('@librechat/api');
const { User } = require('@librechat/data-schemas').createModels(mongoose);
const { createTransaction } = require('~/models/Transaction');
const { getAppConfig } = require('~/server/services/Config');
const email = '$USER_EMAIL';
const amount = $AMOUNT;
(async () => {
try {
// Connect to MongoDB
console.log('Connecting to MongoDB...');
await mongoose.connect(process.env.MONGO_URI);
console.log('Connected to MongoDB');
// Get app config and balance config
console.log('Loading configuration...');
const appConfig = await getAppConfig();
const balanceConfig = getBalanceConfig(appConfig);
if (!balanceConfig?.enabled) {
console.error('Error: Balance is not enabled. Use librechat.yaml to enable it');
await mongoose.connection.close();
process.exit(1);
}
// Find the user
console.log('Looking for user:', email);
const user = await User.findOne({ email }).lean();
if (!user) {
console.error('Error: No user with that email was found!');
await mongoose.connection.close();
process.exit(1);
}
console.log('Found user:', user.email);
// Create transaction and update balance
console.log('Creating transaction for', amount, 'tokens...');
const result = await createTransaction({
user: user._id,
tokenType: 'credits',
context: 'admin',
rawAmount: +amount,
balance: balanceConfig,
});
if (!result?.balance) {
console.error('Error: Something went wrong while updating the balance!');
await mongoose.connection.close();
process.exit(1);
}
// Success!
console.log('✅ Transaction created successfully!');
console.log('Amount added:', amount);
console.log('New balance:', result.balance);
await mongoose.connection.close();
process.exit(0);
} catch (err) {
console.error('Error:', err.message);
console.error(err.stack);
if (mongoose.connection.readyState === 1) {
await mongoose.connection.close();
}
process.exit(1);
}
})();
EOFSCRIPT
node add-balance-task.js"
# Build the overrides JSON using jq for proper escaping
OVERRIDES=$(jq -n \
--arg cmd "$SHELL_CMD" \
'{
containerOverrides: [{
name: "librechat",
command: ["sh", "-c", $cmd]
}]
}')
echo "Running command in container..."
TASK_ARN=$(aws ecs run-task \
--cluster "$CLUSTER_NAME" \
--task-definition "$TASK_DEF" \
--launch-type FARGATE \
--network-configuration "awsvpcConfiguration={subnets=[$SUBNETS],securityGroups=[$SECURITY_GROUPS],assignPublicIp=DISABLED}" \
--overrides "$OVERRIDES" \
--region "$REGION" \
--query 'tasks[0].taskArn' \
--output text)
echo "Task started: $TASK_ARN"
echo ""
echo "Waiting for task to complete..."
echo "You can monitor the task with:"
echo " aws ecs describe-tasks --cluster $CLUSTER_NAME --tasks $TASK_ARN --region $REGION"
echo ""
echo "Or view logs in CloudWatch Logs:"
echo " Log Group: /aws/ecs/${STACK_NAME}"
echo ""
# Wait for task to complete
aws ecs wait tasks-stopped \
--cluster "$CLUSTER_NAME" \
--tasks "$TASK_ARN" \
--region "$REGION"
# Check task exit code
EXIT_CODE=$(aws ecs describe-tasks \
--cluster "$CLUSTER_NAME" \
--tasks "$TASK_ARN" \
--region "$REGION" \
--query 'tasks[0].containers[0].exitCode' \
--output text)
if [ "$EXIT_CODE" = "0" ]; then
echo "✅ Success! Added $AMOUNT tokens to $USER_EMAIL"
echo "Check CloudWatch Logs for the new balance."
else
echo "❌ Task failed with exit code: $EXIT_CODE"
echo "Check CloudWatch Logs for details."
exit 1
fi

View file

@ -0,0 +1,201 @@
#!/bin/bash
# Script to flush Redis cache by running a one-off ECS task
# Usage: ./scripts/flush-redis-cache.sh
set -e
# Load configuration
if [ ! -f .librechat-deploy-config ]; then
echo "Error: .librechat-deploy-config not found"
exit 1
fi
source .librechat-deploy-config
# Set variables
CLUSTER_NAME="${STACK_NAME}-cluster"
TASK_FAMILY="${STACK_NAME}-task"
REGION="${REGION:-us-east-1}"
echo "=========================================="
echo "Flushing Redis Cache"
echo "Stack: $STACK_NAME"
echo "Region: $REGION"
echo "=========================================="
# Get VPC configuration from the existing service
echo "Getting network configuration from existing service..."
SERVICE_INFO=$(aws ecs describe-services \
--cluster "$CLUSTER_NAME" \
--services "${STACK_NAME}-service" \
--region "$REGION" \
--query 'services[0].networkConfiguration.awsvpcConfiguration' \
--output json)
SUBNETS=$(echo "$SERVICE_INFO" | jq -r '.subnets | join(",")')
SECURITY_GROUPS=$(echo "$SERVICE_INFO" | jq -r '.securityGroups | join(",")')
echo "Subnets: $SUBNETS"
echo "Security Groups: $SECURITY_GROUPS"
# Get the task definition
echo "Getting task definition..."
TASK_DEF=$(aws ecs describe-task-definition \
--task-definition "$TASK_FAMILY" \
--region "$REGION" \
--query 'taskDefinition.taskDefinitionArn' \
--output text)
echo "Task Definition: $TASK_DEF"
# Run the one-off task
echo "Starting ECS task to flush Redis cache..."
# Inline Node.js script to flush Redis cache
FLUSH_SCRIPT='
const IoRedis = require("ioredis");
const isEnabled = (value) => value === "true" || value === true;
async function flushRedis() {
try {
console.log("🔍 Connecting to Redis...");
const urls = (process.env.REDIS_URI || "").split(",").map((uri) => new URL(uri));
const username = urls[0]?.username || process.env.REDIS_USERNAME;
const password = urls[0]?.password || process.env.REDIS_PASSWORD;
const redisOptions = {
username: username,
password: password,
connectTimeout: 10000,
maxRetriesPerRequest: 3,
enableOfflineQueue: true,
lazyConnect: false,
};
const useCluster = urls.length > 1 || isEnabled(process.env.USE_REDIS_CLUSTER);
let redis;
if (useCluster) {
const clusterOptions = {
redisOptions,
enableOfflineQueue: true,
};
if (isEnabled(process.env.REDIS_USE_ALTERNATIVE_DNS_LOOKUP)) {
clusterOptions.dnsLookup = (address, callback) => callback(null, address);
}
redis = new IoRedis.Cluster(
urls.map((url) => ({ host: url.hostname, port: parseInt(url.port, 10) || 6379 })),
clusterOptions,
);
} else {
redis = new IoRedis(process.env.REDIS_URI, redisOptions);
}
await new Promise((resolve, reject) => {
const timeout = setTimeout(() => reject(new Error("Connection timeout")), 10000);
redis.once("ready", () => { clearTimeout(timeout); resolve(); });
redis.once("error", (err) => { clearTimeout(timeout); reject(err); });
});
console.log("✅ Connected to Redis");
let keyCount = 0;
try {
if (useCluster) {
const nodes = redis.nodes("master");
for (const node of nodes) {
const keys = await node.keys("*");
keyCount += keys.length;
}
} else {
const keys = await redis.keys("*");
keyCount = keys.length;
}
} catch (_error) {}
if (useCluster) {
const nodes = redis.nodes("master");
await Promise.all(nodes.map((node) => node.flushdb()));
console.log(`✅ Redis cluster cache flushed successfully (${nodes.length} master nodes)`);
} else {
await redis.flushdb();
console.log("✅ Redis cache flushed successfully");
}
if (keyCount > 0) {
console.log(` Deleted ${keyCount} keys`);
}
await redis.disconnect();
console.log("⚠️ Note: All users will need to re-authenticate");
process.exit(0);
} catch (error) {
console.error("❌ Error flushing Redis cache:", error.message);
process.exit(1);
}
}
flushRedis();
'
SHELL_CMD="cd /app && node -e '$FLUSH_SCRIPT'"
# Build the overrides JSON using jq for proper escaping
OVERRIDES=$(jq -n \
--arg cmd "$SHELL_CMD" \
'{
containerOverrides: [{
name: "librechat",
command: ["sh", "-c", $cmd]
}]
}')
echo "Running command in container..."
TASK_ARN=$(aws ecs run-task \
--cluster "$CLUSTER_NAME" \
--task-definition "$TASK_DEF" \
--launch-type FARGATE \
--network-configuration "awsvpcConfiguration={subnets=[$SUBNETS],securityGroups=[$SECURITY_GROUPS],assignPublicIp=DISABLED}" \
--overrides "$OVERRIDES" \
--region "$REGION" \
--query 'tasks[0].taskArn' \
--output text)
echo "Task started: $TASK_ARN"
echo ""
echo "Waiting for task to complete..."
echo "You can monitor the task with:"
echo " aws ecs describe-tasks --cluster $CLUSTER_NAME --tasks $TASK_ARN --region $REGION"
echo ""
echo "Or view logs in CloudWatch Logs:"
echo " Log Group: /ecs/${STACK_NAME}-task"
echo ""
# Wait for task to complete
aws ecs wait tasks-stopped \
--cluster "$CLUSTER_NAME" \
--tasks "$TASK_ARN" \
--region "$REGION"
# Check task exit code
EXIT_CODE=$(aws ecs describe-tasks \
--cluster "$CLUSTER_NAME" \
--tasks "$TASK_ARN" \
--region "$REGION" \
--query 'tasks[0].containers[0].exitCode' \
--output text)
if [ "$EXIT_CODE" = "0" ]; then
echo "✅ Success! Redis cache has been flushed."
echo ""
echo "⚠️ Note: All users will need to re-authenticate."
else
echo "❌ Task failed with exit code: $EXIT_CODE"
echo "Check CloudWatch Logs for details:"
echo " aws logs tail /ecs/${STACK_NAME}-task --follow --region $REGION"
exit 1
fi

View file

@ -0,0 +1,212 @@
#!/bin/bash
# Script to manage user admin role by running a one-off ECS task
# Usage: ./scripts/make-admin.sh <user-email> [--remove]
set -e
# Parse arguments
REMOVE_ADMIN=false
USER_EMAIL=""
while [[ $# -gt 0 ]]; do
case $1 in
--remove|-r)
REMOVE_ADMIN=true
shift
;;
*)
USER_EMAIL="$1"
shift
;;
esac
done
# Check if email is provided
if [ -z "$USER_EMAIL" ]; then
echo "Usage: $0 <user-email> [--remove]"
echo ""
echo "Examples:"
echo " Grant admin: $0 user@domain.com"
echo " Remove admin: $0 user@domain.com --remove"
exit 1
fi
# Load configuration
if [ ! -f .librechat-deploy-config ]; then
echo "Error: .librechat-deploy-config not found"
exit 1
fi
source .librechat-deploy-config
# Set variables
CLUSTER_NAME="${STACK_NAME}-cluster"
TASK_FAMILY="${STACK_NAME}-task"
REGION="${REGION:-us-east-1}"
if [ "$REMOVE_ADMIN" = true ]; then
ACTION="Removing admin role from"
TARGET_ROLE="USER"
else
ACTION="Granting admin role to"
TARGET_ROLE="ADMIN"
fi
echo "=========================================="
echo "$ACTION: $USER_EMAIL"
echo "Stack: $STACK_NAME"
echo "Region: $REGION"
echo "=========================================="
# Get VPC configuration from the existing service
echo "Getting network configuration from existing service..."
SERVICE_INFO=$(aws ecs describe-services \
--cluster "$CLUSTER_NAME" \
--services "${STACK_NAME}-service" \
--region "$REGION" \
--query 'services[0].networkConfiguration.awsvpcConfiguration' \
--output json)
SUBNETS=$(echo "$SERVICE_INFO" | jq -r '.subnets | join(",")')
SECURITY_GROUPS=$(echo "$SERVICE_INFO" | jq -r '.securityGroups | join(",")')
echo "Subnets: $SUBNETS"
echo "Security Groups: $SECURITY_GROUPS"
# Get the task definition
echo "Getting task definition..."
TASK_DEF=$(aws ecs describe-task-definition \
--task-definition "$TASK_FAMILY" \
--region "$REGION" \
--query 'taskDefinition.taskDefinitionArn' \
--output text)
echo "Task Definition: $TASK_DEF"
# Run the one-off task
echo "Starting ECS task to update user role..."
# Create a Node.js script that uses LibreChat's models with proper module-alias setup
SHELL_CMD="cd /app/api && cat > manage-admin.js << 'EOFSCRIPT'
// Setup module-alias like LibreChat does
const path = require('path');
require('module-alias')({ base: path.resolve(__dirname) });
const mongoose = require('mongoose');
const { updateUser, findUser } = require('~/models');
const { SystemRoles } = require('librechat-data-provider');
const targetRole = '$TARGET_ROLE';
(async () => {
try {
// Connect to MongoDB
console.log('Connecting to MongoDB...');
await mongoose.connect(process.env.MONGO_URI);
console.log('Connected to MongoDB');
// Find the user by email
console.log('Looking for user: $USER_EMAIL');
const user = await findUser({ email: '$USER_EMAIL' });
if (!user) {
console.error('User not found: $USER_EMAIL');
await mongoose.connection.close();
process.exit(1);
}
console.log('Found user:', user.email, 'Current role:', user.role);
// Check if already has target role
if (user.role === targetRole) {
console.log('User already has ' + targetRole + ' role');
await mongoose.connection.close();
process.exit(0);
}
// Update the user role
console.log('Updating user role to ' + targetRole + '...');
const result = await updateUser(user._id, { role: targetRole });
if (result) {
if (targetRole === 'ADMIN') {
console.log('✅ User $USER_EMAIL granted ADMIN role successfully');
} else {
console.log('✅ User $USER_EMAIL removed from ADMIN role successfully');
}
await mongoose.connection.close();
process.exit(0);
} else {
console.error('Failed to update user role');
await mongoose.connection.close();
process.exit(1);
}
} catch (err) {
console.error('Error:', err.message);
console.error(err.stack);
if (mongoose.connection.readyState === 1) {
await mongoose.connection.close();
}
process.exit(1);
}
})();
EOFSCRIPT
node manage-admin.js"
# Build the overrides JSON using jq for proper escaping
OVERRIDES=$(jq -n \
--arg cmd "$SHELL_CMD" \
'{
containerOverrides: [{
name: "librechat",
command: ["sh", "-c", $cmd]
}]
}')
echo "Running command in container..."
TASK_ARN=$(aws ecs run-task \
--cluster "$CLUSTER_NAME" \
--task-definition "$TASK_DEF" \
--launch-type FARGATE \
--network-configuration "awsvpcConfiguration={subnets=[$SUBNETS],securityGroups=[$SECURITY_GROUPS],assignPublicIp=DISABLED}" \
--overrides "$OVERRIDES" \
--region "$REGION" \
--query 'tasks[0].taskArn' \
--output text)
echo "Task started: $TASK_ARN"
echo ""
echo "Waiting for task to complete..."
echo "You can monitor the task with:"
echo " aws ecs describe-tasks --cluster $CLUSTER_NAME --tasks $TASK_ARN --region $REGION"
echo ""
echo "Or view logs in CloudWatch Logs:"
echo " Log Group: /aws/ecs/${STACK_NAME}"
echo ""
# Wait for task to complete
aws ecs wait tasks-stopped \
--cluster "$CLUSTER_NAME" \
--tasks "$TASK_ARN" \
--region "$REGION"
# Check task exit code
EXIT_CODE=$(aws ecs describe-tasks \
--cluster "$CLUSTER_NAME" \
--tasks "$TASK_ARN" \
--region "$REGION" \
--query 'tasks[0].containers[0].exitCode' \
--output text)
if [ "$EXIT_CODE" = "0" ]; then
if [ "$REMOVE_ADMIN" = true ]; then
echo "✅ Success! User $USER_EMAIL has been removed from admin role."
else
echo "✅ Success! User $USER_EMAIL has been granted admin permissions."
fi
echo "The user will need to log out and log back in for changes to take effect."
else
echo "❌ Task failed with exit code: $EXIT_CODE"
echo "Check CloudWatch Logs for details."
exit 1
fi

View file

@ -0,0 +1,153 @@
#!/bin/bash
# Script to manually scale LibreChat ECS service
# Usage: ./scale-service.sh [stack-name] [desired-count]
set -e
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m' # No Color
# Default values
STACK_NAME="${1:-librechat}"
DESIRED_COUNT="${2}"
REGION="${AWS_DEFAULT_REGION:-us-east-1}"
# Function to print colored output
print_status() {
echo -e "${BLUE}[INFO]${NC} $1"
}
print_success() {
echo -e "${GREEN}[SUCCESS]${NC} $1"
}
print_warning() {
echo -e "${YELLOW}[WARNING]${NC} $1"
}
print_error() {
echo -e "${RED}[ERROR]${NC} $1"
}
# Show usage if desired count not provided
if [[ -z "$DESIRED_COUNT" ]]; then
echo "Usage: $0 [stack-name] [desired-count]"
echo ""
echo "Examples:"
echo " $0 librechat 5 # Scale to 5 instances"
echo " $0 librechat-dev 1 # Scale dev environment to 1 instance"
exit 1
fi
# Validate desired count is a number
if ! [[ "$DESIRED_COUNT" =~ ^[0-9]+$ ]]; then
print_error "Desired count must be a number"
exit 1
fi
# Check if AWS CLI is available
if ! command -v aws &> /dev/null; then
print_error "AWS CLI is not installed"
exit 1
fi
# Check AWS credentials
if ! aws sts get-caller-identity &> /dev/null; then
print_error "AWS credentials not configured"
exit 1
fi
print_status "Scaling LibreChat service..."
print_status "Stack: $STACK_NAME"
print_status "Desired Count: $DESIRED_COUNT"
print_status "Region: $REGION"
# Get cluster and service names from CloudFormation
CLUSTER_NAME=$(aws cloudformation describe-stacks \
--stack-name "$STACK_NAME" \
--region "$REGION" \
--query 'Stacks[0].Outputs[?OutputKey==`ECSClusterName`].OutputValue' \
--output text)
SERVICE_NAME=$(aws cloudformation describe-stacks \
--stack-name "$STACK_NAME" \
--region "$REGION" \
--query 'Stacks[0].Outputs[?OutputKey==`ECSServiceName`].OutputValue' \
--output text)
if [[ -z "$CLUSTER_NAME" || -z "$SERVICE_NAME" ]]; then
print_error "Could not find ECS cluster or service in stack $STACK_NAME"
exit 1
fi
print_status "Cluster: $CLUSTER_NAME"
print_status "Service: $SERVICE_NAME"
# Get current service status
CURRENT_STATUS=$(aws ecs describe-services \
--cluster "$CLUSTER_NAME" \
--services "$SERVICE_NAME" \
--region "$REGION" \
--query 'services[0].{
RunningCount: runningCount,
PendingCount: pendingCount,
DesiredCount: desiredCount
}')
print_status "Current service status:"
echo "$CURRENT_STATUS" | jq .
CURRENT_DESIRED=$(echo "$CURRENT_STATUS" | jq -r '.DesiredCount')
if [[ "$CURRENT_DESIRED" == "$DESIRED_COUNT" ]]; then
print_warning "Service is already scaled to $DESIRED_COUNT instances"
exit 0
fi
# Update the service desired count
print_status "Scaling service from $CURRENT_DESIRED to $DESIRED_COUNT instances..."
aws ecs update-service \
--cluster "$CLUSTER_NAME" \
--service "$SERVICE_NAME" \
--desired-count "$DESIRED_COUNT" \
--region "$REGION" \
--query 'service.serviceName' \
--output text
# Wait for deployment to stabilize
print_status "Waiting for service to stabilize..."
aws ecs wait services-stable \
--cluster "$CLUSTER_NAME" \
--services "$SERVICE_NAME" \
--region "$REGION"
print_success "Service scaling completed successfully!"
# Show final service status
print_status "Final service status:"
aws ecs describe-services \
--cluster "$CLUSTER_NAME" \
--services "$SERVICE_NAME" \
--region "$REGION" \
--query 'services[0].{
ServiceName: serviceName,
Status: status,
RunningCount: runningCount,
PendingCount: pendingCount,
DesiredCount: desiredCount
}' \
--output table
# Show running tasks
print_status "Running tasks:"
aws ecs list-tasks \
--cluster "$CLUSTER_NAME" \
--service-name "$SERVICE_NAME" \
--region "$REGION" \
--query 'taskArns' \
--output table

View file

@ -0,0 +1,130 @@
#!/bin/bash
# Simple config upload script - replaces the complex Python approach
# Usage: ./simple-config-upload.sh <stack-name> [region] [config-file]
set -e
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m' # No Color
print_status() {
echo -e "${BLUE}[INFO]${NC} $1"
}
print_success() {
echo -e "${GREEN}[SUCCESS]${NC} $1"
}
print_error() {
echo -e "${RED}[ERROR]${NC} $1"
}
# Parameters
STACK_NAME="$1"
REGION="${2:-us-east-1}"
CONFIG_FILE="${3:-librechat.yaml}"
if [[ -z "$STACK_NAME" ]]; then
print_error "Usage: $0 <stack-name> [region] [config-file]"
exit 1
fi
print_status "Uploading config for stack: $STACK_NAME"
print_status "Region: $REGION"
print_status "Config file: $CONFIG_FILE"
# Get S3 bucket name from CloudFormation outputs
print_status "Getting S3 bucket name from stack outputs..."
BUCKET_NAME=$(aws cloudformation describe-stacks \
--stack-name "$STACK_NAME" \
--region "$REGION" \
--query 'Stacks[0].Outputs[?OutputKey==`S3BucketName`].OutputValue' \
--output text 2>/dev/null)
if [[ -z "$BUCKET_NAME" ]]; then
print_error "Could not find S3BucketName in stack outputs"
exit 1
fi
print_success "Found S3 bucket: $BUCKET_NAME"
# Upload config file to S3
if [[ -f "$CONFIG_FILE" ]]; then
print_status "Uploading $CONFIG_FILE to S3..."
aws s3 cp "$CONFIG_FILE" "s3://$BUCKET_NAME/configs/librechat.yaml" \
--content-type "application/x-yaml" \
--region "$REGION"
if [[ $? -eq 0 ]]; then
print_success "Configuration uploaded to s3://$BUCKET_NAME/configs/librechat.yaml"
else
print_error "Failed to upload configuration to S3"
exit 1
fi
else
print_error "Config file not found: $CONFIG_FILE"
exit 1
fi
# Trigger Config Manager Lambda to copy S3 → EFS
LAMBDA_NAME="${STACK_NAME}-config-manager"
print_status "Triggering config manager Lambda: $LAMBDA_NAME"
aws lambda invoke \
--function-name "$LAMBDA_NAME" \
--region "$REGION" \
--payload '{}' \
/tmp/lambda-response.json >/dev/null 2>&1
if [[ $? -eq 0 ]]; then
print_success "Config manager Lambda executed successfully"
print_status "Configuration has been copied to EFS"
else
print_error "Could not invoke config manager Lambda"
exit 1
fi
# Force ECS service to restart containers
print_status "Getting ECS cluster and service information..."
CLUSTER_NAME=$(aws cloudformation describe-stacks \
--stack-name "$STACK_NAME" \
--region "$REGION" \
--query 'Stacks[0].Outputs[?OutputKey==`ECSClusterName`].OutputValue' \
--output text 2>/dev/null)
SERVICE_NAME=$(aws cloudformation describe-stacks \
--stack-name "$STACK_NAME" \
--region "$REGION" \
--query 'Stacks[0].Outputs[?OutputKey==`ECSServiceName`].OutputValue' \
--output text 2>/dev/null)
if [[ -n "$CLUSTER_NAME" && -n "$SERVICE_NAME" ]]; then
print_status "Restarting ECS containers to pick up new config..."
print_status "Cluster: $CLUSTER_NAME"
print_status "Service: $SERVICE_NAME"
aws ecs update-service \
--cluster "$CLUSTER_NAME" \
--service "$SERVICE_NAME" \
--region "$REGION" \
--force-new-deployment >/dev/null 2>&1
if [[ $? -eq 0 ]]; then
print_success "ECS service restart initiated"
print_status "Containers will restart with the new configuration"
print_status "This may take a few minutes to complete"
else
print_error "Could not restart ECS service"
exit 1
fi
else
print_error "Could not find ECS cluster/service information"
exit 1
fi
print_success "Configuration update completed successfully!"

View file

@ -0,0 +1,142 @@
#!/bin/bash
# Script to update LibreChat ECS service with new image version
# Usage: ./update-service.sh [stack-name] [image-tag]
set -e
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m' # No Color
# Default values
STACK_NAME="${1:-librechat}"
IMAGE_TAG="${2:-latest}"
REGION="${AWS_DEFAULT_REGION:-us-east-1}"
# Function to print colored output
print_status() {
echo -e "${BLUE}[INFO]${NC} $1"
}
print_success() {
echo -e "${GREEN}[SUCCESS]${NC} $1"
}
print_warning() {
echo -e "${YELLOW}[WARNING]${NC} $1"
}
print_error() {
echo -e "${RED}[ERROR]${NC} $1"
}
# Check if AWS CLI is available
if ! command -v aws &> /dev/null; then
print_error "AWS CLI is not installed"
exit 1
fi
# Check AWS credentials
if ! aws sts get-caller-identity &> /dev/null; then
print_error "AWS credentials not configured"
exit 1
fi
print_status "Updating LibreChat service..."
print_status "Stack: $STACK_NAME"
print_status "Image Tag: $IMAGE_TAG"
print_status "Region: $REGION"
# Get cluster and service names from CloudFormation
CLUSTER_NAME=$(aws cloudformation describe-stacks \
--stack-name "$STACK_NAME" \
--region "$REGION" \
--query 'Stacks[0].Outputs[?OutputKey==`ECSClusterName`].OutputValue' \
--output text)
SERVICE_NAME=$(aws cloudformation describe-stacks \
--stack-name "$STACK_NAME" \
--region "$REGION" \
--query 'Stacks[0].Outputs[?OutputKey==`ECSServiceName`].OutputValue' \
--output text)
if [[ -z "$CLUSTER_NAME" || -z "$SERVICE_NAME" ]]; then
print_error "Could not find ECS cluster or service in stack $STACK_NAME"
exit 1
fi
print_status "Cluster: $CLUSTER_NAME"
print_status "Service: $SERVICE_NAME"
# Get current task definition
TASK_DEF_ARN=$(aws ecs describe-services \
--cluster "$CLUSTER_NAME" \
--services "$SERVICE_NAME" \
--region "$REGION" \
--query 'services[0].taskDefinition' \
--output text)
print_status "Current task definition: $TASK_DEF_ARN"
# Get task definition details
TASK_DEF=$(aws ecs describe-task-definition \
--task-definition "$TASK_DEF_ARN" \
--region "$REGION" \
--query 'taskDefinition')
# Update the image in the task definition
NEW_IMAGE="ghcr.io/danny-avila/librechat:$IMAGE_TAG"
UPDATED_TASK_DEF=$(echo "$TASK_DEF" | jq --arg image "$NEW_IMAGE" '
.containerDefinitions[0].image = $image |
del(.taskDefinitionArn, .revision, .status, .requiresAttributes, .placementConstraints, .compatibilities, .registeredAt, .registeredBy)
')
print_status "Updating image to: $NEW_IMAGE"
# Register new task definition
NEW_TASK_DEF_ARN=$(echo "$UPDATED_TASK_DEF" | aws ecs register-task-definition \
--region "$REGION" \
--cli-input-json file:///dev/stdin \
--query 'taskDefinition.taskDefinitionArn' \
--output text)
print_status "New task definition: $NEW_TASK_DEF_ARN"
# Update the service
print_status "Updating ECS service..."
aws ecs update-service \
--cluster "$CLUSTER_NAME" \
--service "$SERVICE_NAME" \
--task-definition "$NEW_TASK_DEF_ARN" \
--region "$REGION" \
--query 'service.serviceName' \
--output text
# Wait for deployment to complete
print_status "Waiting for deployment to complete..."
aws ecs wait services-stable \
--cluster "$CLUSTER_NAME" \
--services "$SERVICE_NAME" \
--region "$REGION"
print_success "Service update completed successfully!"
# Show service status
print_status "Service status:"
aws ecs describe-services \
--cluster "$CLUSTER_NAME" \
--services "$SERVICE_NAME" \
--region "$REGION" \
--query 'services[0].{
ServiceName: serviceName,
Status: status,
RunningCount: runningCount,
PendingCount: pendingCount,
DesiredCount: desiredCount,
TaskDefinition: taskDefinition
}' \
--output table

View file

@ -0,0 +1,239 @@
import json
import logging
import os
import boto3
import urllib3
from botocore.exceptions import ClientError
# Configure logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
# Initialize AWS clients
s3_client = boto3.client('s3')
def lambda_handler(event, context):
"""
Lambda function to copy configuration files from S3 to EFS.
Handles both CloudFormation custom resource lifecycle events and direct invocations.
"""
logger.info(f"Received event: {json.dumps(event, default=str)}")
# Check if this is a CloudFormation custom resource call or direct invocation
is_cloudformation = 'RequestType' in event and 'ResourceProperties' in event
if is_cloudformation:
# CloudFormation custom resource call
request_type = event.get('RequestType')
resource_properties = event.get('ResourceProperties', {})
s3_bucket = resource_properties.get('S3Bucket')
s3_key = resource_properties.get('S3Key', 'configs/librechat.yaml')
else:
# Direct invocation - get parameters from environment or event
logger.info("Direct invocation detected - processing config update")
request_type = 'Update' # Treat direct calls as updates
s3_bucket = event.get('S3Bucket') or get_s3_bucket_from_environment()
s3_key = event.get('S3Key', 'configs/librechat.yaml')
# Configuration
efs_mount_path = os.environ.get('EFS_MOUNT_PATH', '/mnt/efs')
efs_file_path = os.path.join(efs_mount_path, 'librechat.yaml')
response_data = {}
try:
if request_type in ['Create', 'Update']:
logger.info(f"Processing {request_type} request")
# Validate required parameters
if not s3_bucket:
raise ValueError("S3Bucket is required - either in ResourceProperties or environment")
# Ensure EFS mount directory exists
os.makedirs(efs_mount_path, exist_ok=True)
logger.info(f"EFS mount path ready: {efs_mount_path}")
# Download file from S3
logger.info(f"Downloading s3://{s3_bucket}/{s3_key}")
used_default_config = False
try:
s3_response = s3_client.get_object(Bucket=s3_bucket, Key=s3_key)
file_content = s3_response['Body'].read()
logger.info(f"Successfully downloaded {len(file_content)} bytes from S3")
except ClientError as e:
error_code = e.response['Error']['Code']
if error_code == 'NoSuchKey':
logger.warning(f"Configuration file not found: s3://{s3_bucket}/{s3_key}")
logger.info("Creating default configuration file on EFS")
used_default_config = True
# Create a minimal default config if the file doesn't exist
file_content = b"""# Default LibreChat Configuration
# This file was created automatically because no custom config was found
version: 1.2.8
cache: false
interface:
customWelcome: ""
"""
elif error_code == 'NoSuchBucket':
logger.warning(f"S3 bucket not found: {s3_bucket}")
logger.info("Creating default configuration file on EFS")
used_default_config = True
# Create a minimal default config if the bucket doesn't exist
file_content = b"""# Default LibreChat Configuration
# This file was created automatically because S3 bucket was not accessible
version: 1.2.8
cache: false
interface:
customWelcome: "Welcome to LibreChat! (Using Default Config - S3 Bucket Not Found)"
"""
elif error_code == 'AccessDenied':
logger.warning(f"Access denied to S3: s3://{s3_bucket}/{s3_key}")
logger.info("Creating default configuration file on EFS")
used_default_config = True
# Create a minimal default config if access is denied
file_content = b"""# Default LibreChat Configuration
# This file was created automatically because S3 access was denied
version: 1.2.8
cache: false
interface:
customWelcome: "Welcome to LibreChat! (Using Default Config - S3 Access Denied)"
"""
else:
raise ValueError(f"Failed to download from S3: {str(e)}")
# Write file to EFS
logger.info(f"Writing file to EFS: {efs_file_path}")
with open(efs_file_path, 'wb') as f:
f.write(file_content)
# Set appropriate file permissions (readable by all, writable by owner)
os.chmod(efs_file_path, 0o644)
logger.info(f"Set file permissions to 644 for {efs_file_path}")
# Verify file was written correctly
if os.path.exists(efs_file_path):
file_size = os.path.getsize(efs_file_path)
logger.info(f"File successfully written to EFS: {file_size} bytes")
response_data['FileSize'] = file_size
response_data['EFSPath'] = efs_file_path
response_data['UsedDefaultConfig'] = used_default_config
# For direct invocations, return success immediately
if not is_cloudformation:
logger.info("Direct invocation completed successfully")
return {
'statusCode': 200,
'body': json.dumps({
'message': 'Configuration updated successfully',
'fileSize': file_size,
'efsPath': efs_file_path,
'usedDefaultConfig': used_default_config
})
}
else:
raise RuntimeError("File was not created on EFS")
elif request_type == 'Delete':
logger.info("Processing Delete request")
# For delete operations, we could optionally remove the file
# but it's safer to leave it in place for potential rollbacks
if os.path.exists(efs_file_path):
logger.info(f"Configuration file exists at {efs_file_path} (leaving in place)")
else:
logger.info("Configuration file not found (already removed or never created)")
# Send success response to CloudFormation (only for CF calls)
if is_cloudformation:
send_response(event, context, 'SUCCESS', response_data)
except Exception as e:
logger.error(f"Error processing request: {str(e)}", exc_info=True)
# Handle errors differently for CF vs direct calls
if is_cloudformation:
send_response(event, context, 'FAILED', {'Error': str(e)})
else:
# For direct invocations, return error response
return {
'statusCode': 500,
'body': json.dumps({
'error': str(e),
'message': 'Configuration update failed'
})
}
raise
def get_s3_bucket_from_environment():
"""
Try to determine the S3 bucket name from the Lambda function's environment.
This is used for direct invocations when the bucket isn't provided in the event.
Prefers S3_BUCKET_NAME (set by the template) to avoid needing CloudFormation permissions.
"""
# Prefer environment variable (set by CloudFormation template; no extra IAM needed)
bucket_name = os.environ.get('S3_BUCKET_NAME')
if bucket_name:
logger.info(f"Found S3 bucket from environment: {bucket_name}")
return bucket_name
# Fallback: try to get from CloudFormation stack outputs (requires cloudformation:DescribeStacks)
function_name = os.environ.get('AWS_LAMBDA_FUNCTION_NAME', '')
if function_name.endswith('-config-manager'):
stack_name = function_name[:-15] # Remove '-config-manager'
try:
cf_client = boto3.client('cloudformation')
response = cf_client.describe_stacks(StackName=stack_name)
outputs = response['Stacks'][0].get('Outputs', [])
for output in outputs:
if output['OutputKey'] == 'S3BucketName':
bucket_name = output['OutputValue']
logger.info(f"Found S3 bucket from CloudFormation: {bucket_name}")
return bucket_name
except Exception as e:
logger.warning(f"Could not get S3 bucket from CloudFormation: {str(e)}")
logger.warning("Could not determine S3 bucket name")
return None
def send_response(event, context, response_status, response_data):
"""
Send response to CloudFormation custom resource.
"""
response_url = event.get('ResponseURL')
if not response_url:
logger.warning("No ResponseURL provided - this may be a test invocation")
return
# Prepare response payload
response_body = {
'Status': response_status,
'Reason': f'See CloudWatch Log Stream: {context.log_stream_name}',
'PhysicalResourceId': event.get('LogicalResourceId', 'ConfigManagerResource'),
'StackId': event.get('StackId'),
'RequestId': event.get('RequestId'),
'LogicalResourceId': event.get('LogicalResourceId'),
'Data': response_data
}
json_response_body = json.dumps(response_body)
logger.info(f"Sending response to CloudFormation: {response_status}")
logger.debug(f"Response body: {json_response_body}")
try:
# Send HTTP PUT request to CloudFormation
http = urllib3.PoolManager()
response = http.request(
'PUT',
response_url,
body=json_response_body,
headers={
'Content-Type': 'application/json',
'Content-Length': str(len(json_response_body))
}
)
logger.info(f"CloudFormation response status: {response.status}")
except Exception as e:
logger.error(f"Failed to send response to CloudFormation: {str(e)}")
raise

View file

@ -0,0 +1,2 @@
boto3>=1.26.0
urllib3>=1.26.0

View file

@ -0,0 +1,135 @@
import json
import logging
import boto3
import urllib3
import time
from botocore.exceptions import ClientError
# Configure logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
# Initialize AWS clients
efs_client = boto3.client('efs')
def lambda_handler(event, context):
"""
Lambda function to wait for EFS mount targets to be available.
This ensures mount targets are ready before other resources try to use them.
"""
logger.info(f"Received event: {json.dumps(event, default=str)}")
# Extract CloudFormation custom resource properties
request_type = event.get('RequestType')
resource_properties = event.get('ResourceProperties', {})
# Configuration
file_system_id = resource_properties.get('FileSystemId')
response_data = {}
try:
if request_type in ['Create', 'Update']:
logger.info(f"Processing {request_type} request")
# Validate required parameters
if not file_system_id:
raise ValueError("FileSystemId is required in ResourceProperties")
# Wait for mount targets to be available
logger.info(f"Waiting for mount targets to be available for EFS: {file_system_id}")
max_wait_time = 300 # 5 minutes
start_time = time.time()
while time.time() - start_time < max_wait_time:
try:
# Get mount targets for the file system
response = efs_client.describe_mount_targets(FileSystemId=file_system_id)
mount_targets = response.get('MountTargets', [])
if not mount_targets:
logger.info("No mount targets found yet, waiting...")
time.sleep(10)
continue
# Check if all mount targets are available
all_available = True
for mt in mount_targets:
state = mt.get('LifeCycleState')
logger.info(f"Mount target {mt.get('MountTargetId')} state: {state}")
if state != 'available':
all_available = False
break
if all_available:
logger.info("All mount targets are available!")
response_data['MountTargetsReady'] = True
response_data['MountTargetCount'] = len(mount_targets)
break
else:
logger.info("Some mount targets are not ready yet, waiting...")
time.sleep(10)
except ClientError as e:
logger.warning(f"Error checking mount targets: {e}")
time.sleep(10)
else:
# Timeout reached
raise RuntimeError(f"Mount targets did not become available within {max_wait_time} seconds")
elif request_type == 'Delete':
logger.info("Processing Delete request - nothing to do")
response_data['Status'] = 'Deleted'
# Send success response to CloudFormation
send_response(event, context, 'SUCCESS', response_data)
except Exception as e:
logger.error(f"Error processing request: {str(e)}", exc_info=True)
# Send failure response to CloudFormation
send_response(event, context, 'FAILED', {'Error': str(e)})
raise
def send_response(event, context, response_status, response_data):
"""
Send response to CloudFormation custom resource.
"""
response_url = event.get('ResponseURL')
if not response_url:
logger.warning("No ResponseURL provided - this may be a test invocation")
return
# Prepare response payload
response_body = {
'Status': response_status,
'Reason': f'See CloudWatch Log Stream: {context.log_stream_name}',
'PhysicalResourceId': event.get('LogicalResourceId', 'MountTargetWaiterResource'),
'StackId': event.get('StackId'),
'RequestId': event.get('RequestId'),
'LogicalResourceId': event.get('LogicalResourceId'),
'Data': response_data
}
json_response_body = json.dumps(response_body)
logger.info(f"Sending response to CloudFormation: {response_status}")
logger.debug(f"Response body: {json_response_body}")
try:
# Send HTTP PUT request to CloudFormation
http = urllib3.PoolManager()
response = http.request(
'PUT',
response_url,
body=json_response_body,
headers={
'Content-Type': 'application/json',
'Content-Length': str(len(json_response_body))
}
)
logger.info(f"CloudFormation response status: {response.status}")
except Exception as e:
logger.error(f"Failed to send response to CloudFormation: {str(e)}")
raise

View file

@ -0,0 +1,2 @@
boto3>=1.26.0
urllib3>=1.26.0

View file

@ -0,0 +1,97 @@
"""
CloudFormation custom resource: add this stack's ECS security group to an existing
Secrets Manager VPC endpoint's security group so ECS tasks can pull secrets.
Runs during stack create/update (after ECSSecurityGroup exists, before ECS Service).
"""
import json
import logging
import urllib3
import boto3
from botocore.exceptions import ClientError
logger = logging.getLogger()
logger.setLevel(logging.INFO)
ec2 = boto3.client('ec2')
def lambda_handler(event, context):
request_type = event.get('RequestType')
props = event.get('ResourceProperties', {})
endpoint_sg_id = (props.get('EndpointSecurityGroupId') or '').strip()
ecs_sg_id = (props.get('EcsSecurityGroupId') or '').strip()
response_data = {}
try:
if request_type in ('Create', 'Update'):
if endpoint_sg_id and ecs_sg_id:
logger.info(
"Adding ingress to endpoint SG %s: TCP 443 from ECS SG %s",
endpoint_sg_id, ecs_sg_id
)
try:
ec2.authorize_security_group_ingress(
GroupId=endpoint_sg_id,
IpPermissions=[{
'IpProtocol': 'tcp',
'FromPort': 443,
'ToPort': 443,
'UserIdGroupPairs': [{'GroupId': ecs_sg_id}],
}],
)
response_data['RuleAdded'] = 'true'
except ClientError as e:
if e.response['Error']['Code'] == 'InvalidPermission.Duplicate':
logger.info("Rule already exists, no change")
response_data['RuleAdded'] = 'already_exists'
else:
raise
else:
logger.info(
"EndpointSecurityGroupId or EcsSecurityGroupId empty; skipping (no-op)"
)
elif request_type == 'Delete':
if endpoint_sg_id and ecs_sg_id:
try:
ec2.revoke_security_group_ingress(
GroupId=endpoint_sg_id,
IpPermissions=[{
'IpProtocol': 'tcp',
'FromPort': 443,
'ToPort': 443,
'UserIdGroupPairs': [{'GroupId': ecs_sg_id}],
}],
)
response_data['RuleRevoked'] = 'true'
except ClientError as e:
if e.response['Error']['Code'] in (
'InvalidPermission.NotFound', 'InvalidGroup.NotFound'
):
logger.info("Rule or group already gone, ignoring")
else:
logger.warning("Revoke failed (non-fatal): %s", e)
send_response(event, context, 'SUCCESS', response_data)
except Exception as e:
logger.error("Error: %s", e, exc_info=True)
send_response(event, context, 'FAILED', {'Error': str(e)})
raise
def send_response(event, context, response_status, response_data):
response_url = event.get('ResponseURL')
if not response_url:
return
body = {
'Status': response_status,
'Reason': f'See CloudWatch Log Stream: {context.log_stream_name}',
'PhysicalResourceId': event.get('LogicalResourceId', 'SecretsManagerEndpointEcsAccess'),
'StackId': event.get('StackId'),
'RequestId': event.get('RequestId'),
'LogicalResourceId': event.get('LogicalResourceId'),
'Data': response_data,
}
http = urllib3.PoolManager()
http.request(
'PUT', response_url,
body=json.dumps(body),
headers={'Content-Type': 'application/json'},
)

View file

@ -0,0 +1,2 @@
boto3>=1.26.0
urllib3>=1.26.0

1401
deploy/aws-sam/template.yaml Normal file

File diff suppressed because it is too large Load diff