Production AI Data Platform: Safety, Eval, and Observability by Sergiu NicoaraProduction AI Data Platform: Safety, Eval, and Observability by Sergiu Nicoara

Production AI Data Platform: Safety, Eval, and Observability

Sergiu Nicoara

Sergiu Nicoara

Production AI Data Platform
Production AI Data Platform
A production-grade AI data platform where reliability, observability, safety, and evaluation are first-class concerns, not afterthoughts. Most RAG implementations are built for demos: single retrieval backend, no evaluation, no safety layer, no observability. They break silently in production. This one doesn't.

Why this exists: 11 years of telecom discipline

After 11 years engineering telecom systems, the gap in AI engineering shocked me. In telecom, you don't ship without fault tolerance designed from day one, observable systems with real-time alerting, and defined SLOs before a single line of code is written. In AI engineering, many RAG pipelines are shipped and called "production" with no evaluation framework, no monitoring, and no way to detect silent degradation.
Three things telecom taught me that shaped this platform:
Silent failures are the most dangerous. A system that confidently returns wrong answers is worse than one that crashes. At least crashes get fixed.
You need a contract before you ship. How many teams actually define what "good retrieval" means before going live? If you can't define it, you can't maintain it.
Observability is not optional. If you can't measure degradation, you won't know until a user tells you. By then, the brand damage is done.

Retrieval pipeline

Hybrid dense + sparse retrieval (pgvector ANN + PostgreSQL FTS) with RRF fusion and MMR reranking. OpenSearch backend (BM25 + kNN) available as a runtime-selectable alternative. Multimodal ingestion: PDF and image inputs processed via GPT-4o Vision captions into a unified vector space. P95 latency gated at 800ms.

NL→SQL layer

Replaced hand-written prompts with DSPy BootstrapFewShot-optimised natural language to SQL. Schema-aware intent extraction trained against a 20-example golden dataset, injection-safe parameterised queries, workspace-scoped execution, and full audit logging.

Safety and guardrails

Prompt injection detection across 5 taxonomies, PII redaction, toxicity filtering, structured audit events, and safe fallback behavior under failure or policy violations. Workspace-level token-bucket rate limiting.

Evaluation and observability

Offline evaluation with Recall@K, MRR, and groundedness gates (≥ 0.70). Rolling SLO window with EWMA anomaly detection and automated remediation. Prometheus + Grafana 15-panel dashboard with alerting. 184-test suite covering safety, reliability contracts, and SQL correctness.
Stack: Python, FastAPI, PostgreSQL, pgvector, Redis, Docker, Prometheus, Grafana, GCP, DSPy.
Like this project

Posted Jun 9, 2026

Production-grade RAG with hybrid dense+sparse retrieval, prompt injection detection, NL→SQL via DSPy, and a 184-test evaluation suite. P95 latency gated at 800ms.