PDF Scraper Using OCR

Safeer Abbas

Data Scraper

Automation Engineer

Software Engineer

This project is a Python-based PDF scraper that utilizes Optical Character Recognition (OCR) to extract information from PDF documents. It processes PDF files, extracts relevant data, and saves the results in an Excel spreadsheet. The project is designed to handle various formats of property-related documents, making it useful for real estate professionals, researchers, and data analysts.

Features

OCR Processing: Uses ocrmypdf to convert scanned PDF documents into searchable PDFs.

Data Extraction: Extracts key information such as:

CFN (Case File Number)

Parcel ID

Property Address

Mailing Address

Company Name

Owner's Name

Violation Details

Penalty Costs

Dates of Violations and Compliance

Duplicate Removal: Cleans up mailing addresses by removing duplicate entries.

Excel Output: Saves the extracted data into an Excel file for easy access and analysis.

Logging: Provides detailed logging of the processing steps and any errors encountered.

Like this project

Posted Dec 9, 2024

This project is a Python-based PDF scraper that utilizes Optical Character Recognition (OCR) to extract information from PDF documents.

Likes

Views

PDF Scraper Using OCR

Features

Join 50k+ companies and 1M+ independents