High-Performance RAG Application Development

Muhammad

Muhammad Talha

High-Performance RAG Application

A production-ready Retrieval-Augmented Generation (RAG) application built with FastAPI, LangChain, and ChromaDB Using Replit. Supports 1000+ requests per minute with multi-user isolation and AWS deployment.

๐Ÿš€ Features

Smart Load Balancing: Automatically switches between local Ollama and cloud APIs.
Multi-User Support: Isolated data and chat history per user
Multiple AI Providers: OpenAI GPT, Google Gemini, and local Ollama
Scalable Architecture: Auto-scaling on AWS with load balancing
Persistent Storage: EFS-based storage for documents and embeddings
Rate Limiting: Configurable request limits per user
Chat History: RAG-enhanced conversation memory
Admin Interface: User and file management APIs
Docker Support: Containerized deployment
Realistic Testing: Performance testing that accounts for hardware limits

๐Ÿ“‹ Requirements

Python 3.11+
Docker (for deployment)
AWS Account (for cloud deployment)
OpenAI API key or Google Gemini API key

๐Ÿ—๏ธ Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Load Balancer โ”‚โ”€โ”€โ”€โ”€โ”‚ Auto Scaling โ”‚โ”€โ”€โ”€โ”€โ”‚ EFS Storage โ”‚
โ”‚ (ALB) โ”‚ โ”‚ Group โ”‚ โ”‚ (Documents + โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ ChromaDB) โ”‚
โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ EC2 Instances โ”‚
โ”‚ (FastAPI + โ”‚
โ”‚ LangChain) โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿš€ Quick Start

Local Development

Clone and Setup
cd rag_project
python -m venv .venv
.venv\Scripts\activate # Windows
pip install -r backend/requirements.txt
Configure Environment
# Edit backend/.env
OPENAI_API_KEY="your-openai-key"
# OR
GOOGLE_API_KEY="your-gemini-key"
Start Server
cd backend
python main.py
Test the API
# Create user and upload documents
python client/client.py user-add --admin-user admin --admin-pass admin --new-user user1 --new-pass pass123
python client/client.py upload --user admin --password admin --file "test_data/PDF4_AnnualReport.pdf"

# Query the system
python client/client.py query --user user1 --password pass123 --query "What is in the annual report?"

Using Admin Interface

# Interactive mode
python ui/admin_interface.py --interactive

# Command line mode
python ui/admin_interface.py --bulk-setup
python ui/admin_interface.py --add-user user1 pass123
python ui/admin_interface.py --upload-dir test_data shared

๐Ÿงช Performance Testing

Realistic Performance Test (Recommended)

# Gradual load test that finds your system's limits
python realistic_performance_test.py

Laptop-Friendly Testing

# Quick test for development laptops
python realistic_performance_test.py --quick

High-Load Testing (Requires Cloud APIs)

# Only works with OpenAI/Gemini API keys
python test_performance.py --users 300 --duration 10 --rpm 1000

Performance Expectations:

Local Ollama Only: 5-10 RPM max
Hybrid Mode: 50-200 RPM (switches to cloud)
Cloud APIs Only: 1000+ RPM (AWS deployment)

โ˜๏ธ AWS Deployment

Prerequisites

# Install AWS CLI and configure
aws configure
aws sts get-caller-identity # Verify setup

Automated Setup

cd deploy
python aws_config.py # Generate AWS configuration
python docker_build.py # Build and push Docker image

Manual AWS Setup

Create Infrastructure (see DEPLOYMENT_GUIDE.md)
VPC with public/private subnets
EFS file system
ECR repository
Security groups
Deploy Application
Build and push Docker image
Create Launch Template
Set up Auto Scaling Group
Configure Load Balancer
Test Deployment
# Update client URL to your load balancer
python client/client.py query --user user1 --password pass123 --query "Test query"

# Run performance test
python test_performance.py --url "http://your-alb-dns" --users 300 --rpm 1000

๐Ÿ“Š API Endpoints

Authentication

All endpoints use HTTP Basic Authentication.

User Management

POST /admin/users/add - Add new user
POST /admin/users/remove - Remove user

File Management

POST /admin/files/upload - Upload PDF file
POST /admin/files/remove - Remove file

Query

POST /query/ - Send query to RAG system

Example Usage

import requests

# Add user
response = requests.post(
"http://localhost:8000/admin/users/add",
auth=("admin", "admin"),
data={"user_id": "newuser", "password": "newpass"}
)

# Upload file
with open("document.pdf", "rb") as f:
response = requests.post(
"http://localhost:8000/admin/files/upload",
auth=("admin", "admin"),
files={"file": f},
data={"user_id_for_file": "newuser"}
)

# Query
response = requests.post(
"http://localhost:8000/query/",
auth=("newuser", "newpass"),
data={"query": "What is this document about?"}
)

๐Ÿ”ง Configuration

Environment Variables

# AI API Keys
OPENAI_API_KEY="your-openai-key"
GOOGLE_API_KEY="your-gemini-key"

# Model Configuration
OPENAI_MODEL="gpt-4o"
GEMINI_MODEL="gemini-pro"
OLLAMA_MODEL="gemma:2b"

# AWS Configuration
AWS_REGION="ap-south-1"
EFS_FILE_SYSTEM_ID="fs-xxxxxxxxx"

Rate Limiting

Default: 1500 requests per minute per IP. Modify in main.py:
@limiter.limit("1500/minute")  # Adjust as needed

Scaling Configuration

Min instances: 1
Max instances: 10
Target CPU: 70%
Scale-out cooldown: 300s
Scale-in cooldown: 300s

๐Ÿ“ Project Structure

rag_project/
โ”œโ”€โ”€ backend/ # FastAPI application
โ”‚ โ”œโ”€โ”€ main.py # Main application
โ”‚ โ”œโ”€โ”€ services.py # Core services
โ”‚ โ”œโ”€โ”€ requirements.txt # Python dependencies
โ”‚ โ”œโ”€โ”€ Dockerfile # Container configuration
โ”‚ โ””โ”€โ”€ .env # Environment variables
โ”œโ”€โ”€ client/ # Client tools
โ”‚ โ””โ”€โ”€ client.py # Command-line client
โ”œโ”€โ”€ ui/ # User interfaces
โ”‚ โ””โ”€โ”€ admin_interface.py # Admin interface
โ”œโ”€โ”€ deploy/ # Deployment tools
โ”‚ โ”œโ”€โ”€ aws_config.py # AWS configuration helper
โ”‚ โ”œโ”€โ”€ docker_build.py # Docker build script
โ”‚ โ””โ”€โ”€ autoscaling_template.json
โ”œโ”€โ”€ test_data/ # Test documents and questions
โ”œโ”€โ”€ test_performance.py # Performance testing tool
โ”œโ”€โ”€ DEPLOYMENT_GUIDE.md # Detailed deployment guide
โ””โ”€โ”€ README.md # This file

๐Ÿ” Monitoring and Troubleshooting

Health Check

curl http://your-server:8000/docs

Logs

# Docker logs
docker logs rag-app

# Performance test logs
tail -f performance_test.jsonl

Common Issues

High Response Times
Increase instance size
Add more instances
Check EFS performance
Rate Limiting
Increase rate limits
Add more instances
Implement caching
Memory Issues
Use larger instances
Optimize ChromaDB settings
Implement cleanup

๐Ÿ’ฐ Cost Estimation

10-minute test (300 users, 1000 RPM):

EC2 instances (2x t3.medium): ~$0.20
EFS storage: ~$0.01
Load balancer: ~$0.05
Data transfer: ~$0.01
Total: ~$0.27

Monthly production (moderate load):

EC2 instances: ~$50-100
EFS storage: ~$10-20
Load balancer: ~$20
Total: ~$80-140/month

๐Ÿค Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests
Submit a pull request

๐Ÿ“„ License

This project is licensed under the MIT License.

๐Ÿ†˜ Support

For issues and questions:
Review the troubleshooting section
Check CloudWatch logs for AWS deployments
Open an issue with detailed error information

๐Ÿ”ฎ Roadmap

Web UI for administration
Redis caching layer
Multi-region deployment
Advanced analytics
API versioning
Webhook support
Advanced security features
Like this project

Posted Nov 5, 2025

Developed a high-performance RAG application with FastAPI, LangChain, and ChromaDB In replit and delpoy it on AWS EC2 Instance.