BenevolentJoker-JohnL/FlockParser

John

John L

Hi guys!
I have actually decided to develop this application out further and will be licensing features. This is the free version! Ill be posting updates of the enhanced features with demonstrations when theyre complete. Until then, enjoy!

FlockParse - AI-Powered Document Intelligence Platform

Overview

FlockParse is a fully local AI-powered document intelligence platform that: ✅ Extracts text from PDFs with multiple methods (PyPDF2 and pdftotext) ✅ Converts PDFs to multiple formats (TXT, Markdown, DOCX) ✅ Uses Ollama embeddings (mxbai-embed-large) for semantic search ✅ Enables AI-powered chat with your document knowledge base using llama3.1 ✅ Works entirely offline with no data sent to external servers ✅ Preserves original document names in all converted files
The project offers two main interfaces:
flockparsecli.py - A command-line interface for personal document processing
flock_ai_api.py - A web server API for multi-user or application integration

🔧 Installation

1. Clone the Repository

git clone https://github.com/yourusername/flockparse.git
cd flockparse

2. Install Python Dependencies

pip install -r requirements.txt
Required dependencies:
fastapi
uvicorn
pdfplumber
pytesseract
PyPDF2
pypdf
chromadb
python-docx
ollama
numpy
Pillow
markdown

3. Install External Dependencies

For Better PDF Text Extraction:
Linux: sudo apt-get install poppler-utils
macOS: brew install poppler
Windows: Download from Poppler for Windows
For OCR Support:
Linux: sudo apt-get install tesseract-ocr
macOS: brew install tesseract
Windows: Install Tesseract OCR

4. Install and Configure Ollama

Install Ollama from ollama.ai
Start the Ollama service:
Pull the required models:

📜 Usage

CLI Interface (flockparsecli.py)

Run the script:
python flockparsecli.py
Available commands:
📖 open_pdf <file>   → Process a single PDF file
📂 open_dir <dir> → Process all PDFs in a directory
💬 chat → Chat with processed PDFs
📊 list_docs → List all processed documents
🔍 check_deps → Check for required dependencies
❌ exit → Quit the program

Web Server API (flock_ai_api.py)

Start the API server:
python flock_ai_api.py
The server will run on http://0.0.0.0:8000 by default with the following endpoints:
Endpoint Method Description /upload/ POST Upload and process a PDF file /summarize/{file_name} GET Get an AI-generated summary of a document /search/?query=your_query GET Search for relevant documents
Example API Usage:
Upload a document:
curl -X POST -F "file=@your_document.pdf" http://localhost:8000/upload/
Get a document summary:
curl http://localhost:8000/summarize/your_document.pdf
Search across documents:
curl http://localhost:8000/search/?query=your%20search%20query

💡 Practical Use Cases

Knowledge Management

Create searchable archives of research papers, legal documents, and technical manuals
Generate summaries of lengthy documents for quick review
Chat with your document collection to find specific information without manual searching

Legal & Compliance

Process contract repositories for semantic search capabilities
Extract key terms and clauses from legal documents
Analyze regulatory documents for compliance requirements

Research & Academia

Process and convert academic papers for easier reference
Create a personal research assistant that can reference your document library
Generate summaries of complex research for presentations or reviews

Business Intelligence

Convert business reports into searchable formats
Extract insights from PDF-based market research
Make proprietary documents more accessible throughout an organization

🔄 Example Workflows

CLI Workflow: Research Paper Processing

Check Dependencies:
Process a Directory of Research Papers:
Chat with Your Research Collection:

API Workflow: Document Processing Service

Start the API Server:
Upload Documents via API:
Generate a Summary:
Search Across Documents:

🔧 Troubleshooting Guide

Ollama Connection Issues

Problem: Error messages about Ollama not being available or connection failures.
Solution:
Verify Ollama is running: ps aux | grep ollama
Restart the Ollama service:
Check that you've pulled the required models:
If models are missing:

PDF Text Extraction Failures

Problem: No text extracted from certain PDFs.
Solution:
Check if the PDF is scanned/image-based:
Install OCR tools: sudo apt-get install tesseract-ocr (Linux)
For better scanned PDF handling: pip install ocrmypdf
Process with OCR: ocrmypdf input.pdf output.pdf
If the PDF has unusual fonts or formatting:
Install poppler-utils for better extraction
Try using the -layout option with pdftotext manually:

Memory Issues with Large Documents

Problem: Application crashes with large PDFs or many documents.
Solution:
Process one document at a time for very large PDFs
Reduce the chunk size in the code (default is 512 characters)
Increase your system's available memory or use a swap file
For server deployments, consider using a machine with more RAM

API Server Not Starting

Problem: Error when trying to start the API server.
Solution:
Check for port conflicts: lsof -i :8000
If another process is using port 8000, kill it or change the port
Verify FastAPI is installed: pip install fastapi uvicorn
Check for Python version compatibility (requires Python 3.7+)

💡 Features

Feature Description Multi-method PDF Extraction Uses both PyPDF2 and pdftotext for best results Format Conversion Converts PDFs to TXT, Markdown, and DOCX Semantic Search Uses vector embeddings to find relevant information Interactive Chat Discuss your documents with AI assistance Local Privacy 100% offline, no data sent to external servers Filename Preservation Maintains original document names in converted files REST API Web server for multi-user/application integration Document Summarization AI-generated summaries of uploaded documents OCR Processing Extract text from scanned documents using image recognition

Comparing flockparsecli.py and flock_ai_api.py

Feature flockparsecli.py flock_ai_api.py Interface Command line REST API over HTTP Use case Personal document processing Service/application integration Document formats Creates TXT, MD, DOCX Stores extracted text only Interaction Interactive chat mode Query/response via API Multi-user Single user Multiple users/applications Storage Local file-based ChromaDB vector database

📁 Project Structure

/converted_files - Stores the converted document formats (flockparsecli.py)
/knowledge_base - Contains the vector database and document chunks (flockparsecli.py)
/uploads - Temporary storage for uploaded documents (flock_ai_api.py)
/chroma_db - ChromaDB vector database (flock_ai_api.py)

🚀 Future Additions

Web UI for easy document management
Advanced OCR Support for better handling of scanned documents
Multi-language Support for processing non-English documents
Authentication for the flock_ai_api.py API endpoints
Document versioning to track changes over time

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.
Like this project

Posted Apr 9, 2025

FlockParser is an AI-powered PDF parser that extracts, processes, and organizes text from PDFs. It allows users to have extremely high level conversations