Enterprise RAG Evaluation Pipeline Project by Imaad MahmoodEnterprise RAG Evaluation Pipeline Project by Imaad Mahmood

Enterprise RAG Evaluation Pipeline Project

Imaad Mahmood

ML Engineer

AI Engineer

LangChain

Ollama

Pytest

Artificial Intelligence

🚀 Enterprise RAG Evaluation Pipeline (LLMOps)

Welcome to the Enterprise RAG Evaluation Pipeline! This project is designed to solve one of the biggest bottlenecks in modern AI: Trust.

Whether you are a recruiter looking at my architectural decisions, or a beginner trying to understand how to build AI systems that don't hallucinate, this guide will walk you through the entire process.

📌 The Business Problem (For Hiring Managers)

Retrieval-Augmented Generation (RAG) is the industry standard for querying private corporate documents. However, enterprises hesitate to deploy AI because of hallucinations and poor retrieval.

If an engineer changes the chunking strategy, or a new embedding model is swapped in, how do you know if the AI is still giving accurate answers?

You cannot manually test every prompt.

✅ The Solution

This project is an end-to-end, locally hosted RAG pipeline built with an Automated Evaluation Suite. It proves that AI responses can be programmatically tested for accuracy before being deployed to production.

🧠 Beginner's Tutorial: How This Actually Works

If you are new to AI engineering, here is exactly what is happening under the hood, broken down into three phases.

Phase 1: Data Ingestion (`src/ingest.py`)

AI models like ChatGPT or Llama don't magically know your company's private rules. We must provide the data.

Process:

Take a raw text file (example: corporate security policy).

Split it into smaller readable sections (chunks).

Convert chunks into numerical representations using an Embedding Model.

Store embeddings inside a Vector Database (ChromaDB).

This enables semantic search instead of keyword search.

Phase 2: Retrieval & Generation (`src/query.py`)

When a user asks:

"What is the password policy?"

The pipeline:

Searches the vector database for relevant chunks.

Retrieves only the most relevant policy sections.

Sends those sections as context to Llama 3.2.

The model generates an answer grounded strictly in retrieved data using LangChain Expression Language (LCEL).

Result: grounded, factual answers instead of hallucinations.

Phase 3: Automated Evaluation (`tests/test_eval.py`)

This is the LLMOps layer — and the core innovation of this project.

We created automated tests that behave like an AI quality inspector.

The evaluation suite:

Sends predefined questions

Validates answers using strict assert statements

Confirms required facts exist

Detects hallucinated responses

Prevents silent performance regressions

Example validation:

If retrieval or generation quality drops, CI tests fail immediately.

🏗️ Architecture Overview

🛠️ Tech Stack

Component Technology LLM Engine Local Llama 3.2 (Ollama) Embeddings Nomic-Embed-Text Orchestration LangChain Core (LCEL) Vector Database ChromaDB Evaluation Pytest Deployment Style Fully Local (Privacy-First)

Why Local Models?

✅ 100% data privacy

✅ Zero API cost

✅ Offline capability

✅ Enterprise-ready architecture

📂 Project Structure

🏃‍♂️ How to Run This Project Locally

1️⃣ Install Ollama (Local LLM Runtime)

Download:

https://ollama.com/

Pull required models:

2️⃣ Setup Python Environment

3️⃣ Build the Vector Database

Run ingestion:

This will:

Load documents

Chunk text

Generate embeddings

Store vectors in ChromaDB

4️⃣ Query the System

Ask questions about the corporate policy and receive grounded answers.

5️⃣ Run Automated Evaluation (CI/CD Step)

You will see the system automatically test AI correctness in real time.

🔬 What This Project Demonstrates

This repository showcases production-level AI engineering skills:

Retrieval-Augmented Generation (RAG)

LLMOps evaluation pipelines

Deterministic AI testing

Local model deployment

LangChain LCEL architecture

AI reliability engineering

🎯 Why This Matters for Enterprises

Companies don't reject AI because models are weak.

They reject AI because they cannot measure trust.

This project introduces:

measurable correctness

automated regression testing

deployment confidence

AI systems become testable software systems.

🤝 Let's Connect

I built this project to demonstrate production-ready Machine Learning and LLMOps practices.

If you are looking for an engineer who understands how to move AI out of notebooks and into reliable production systems — let's talk.

🔗 LinkedIn: https://www.linkedin.com/in/imaad-mahmood/

⭐ If You Found This Useful

Consider giving the repository a star — it helps others discover practical LLMOps examples!

Like this project

Posted Mar 17, 2026

Built a RAG pipeline with automated evaluation for enterprise AI deployment.

Likes

Views

Enterprise RAG Evaluation Pipeline Project

🚀 Enterprise RAG Evaluation Pipeline (LLMOps)