
all-MiniLM-L6-v2 from Sentence Transformers — fast and performant for most tasks.E5-large (more accurate, slower)MPNet (very robust across domains)text-embedding-ada-002 (if you want to go cloud) — -agency-intro.md, a query like:GPTQ, AWQ, or Ollama’s built-in quantizers to load 4-bit weights with minimal performance hit.llama.cpp with AVX2 acceleration for bare-metal CPU serving. — -QAEvalChain or LlamaIndex’s evaluate() functions Example:Dockerfile for your FastAPI server:pgvector if you prefer PostgresQdrant for metadata filteringChroma if you want quick reloads — -loguru or logging module for logstracing or CallbackHandler for tracing prompt + retrieval flow — -slowapi or NGINX)llama.cpp — lightweight and great on CPUvLLM — high-throughput transformer inference for serving APIsLM Studio — GUI + playground for local models — -RetrievalQA can return the original documents. Display titles, filenames, or snippets with the answer. — -IndexFlatL2 for speedIVF or HNSW for larger corpora (tune for balance) — -Posted Nov 22, 2025
Built a local RAG pipeline for lead qualification and personalization.
0
0