This project is a Python-based PDF scraper that utilizes Optical Character Recognition (OCR) to extract information from PDF documents. It processes PDF files, extracts relevant data, and saves the results in an Excel spreadsheet. The project is designed to handle various formats of property-related documents, making it useful for real estate professionals, researchers, and data analysts.
Features
OCR Processing: Uses ocrmypdf to convert scanned PDF documents into searchable PDFs.
Data Extraction: Extracts key information such as:
CFN (Case File Number)
Parcel ID
Property Address
Mailing Address
Company Name
Owner's Name
Violation Details
Penalty Costs
Dates of Violations and Compliance
Duplicate Removal: Cleans up mailing addresses by removing duplicate entries.
Excel Output: Saves the extracted data into an Excel file for easy access and analysis.
Logging: Provides detailed logging of the processing steps and any errors encountered.
Like this project
0
Posted Dec 9, 2024
This project is a Python-based PDF scraper that utilizes Optical Character Recognition (OCR) to extract information from PDF documents.