AI Document Classifier

Ahsin Shabbir

Data Scientist
Software Engineer
Python
scikit-learn
TensorFlow

Background

To address the need to consistently categorize millions of pages of documents on a daily basis, the client hired me to come up with an AI solution. The client is in the mortgage industry and regularly receives documents with 200 pages, and thousands of documents of this size are processed daily. The need for the AI becomes clear at this scale. Prior to me being hired, the work was being performed entirely by humans. With the solution I provided, 70% of the work was done by the AI with 90% accuracy. This saved the company an astonishing $8million per year due to reduced headcount needed thanks to the automatic sorting by the AI.

Dataset

The data consists of documents in PDF format. I used Python to perform Optical Character Recognition (OCR) on the PDF documents to extract the text. Example dataset is shown below
example training data
example training data
Text: page content extracted by the OCR
Label: the category of this page
Page Number: page number relative to the entire document

Challenges

Since the documents are scanned and sent to the client, the text needs to be extracted using OCR
The scan quality is not guaranteed to be good, so the OCR can be imperfect, leading to a high chance of the model being given text that is not reliable
The model needs to understand 150 different document categories, some of which have very few training examples available
The model needs to be efficient enough to do a prediction in milliseconds for each page
The categories can change and this requires the model to be retrained to understand the new categorization

Solution

The solution that I developed was an AI model that understands the document content and outputs what it believes is the category of the document. I trained the model by sourcing data from the client by working with business analysts and senior managers. I organized the documents into a training dataset and an evaluation dataset. I determined that the model performs with 90% accuracy on the evaluation data. I setup the Python training and deployment code so that the client can continue to use the model long-term without any new code work needed. To retrain the model the client simply executes the script I wrote with specification of where the training and evaluation data are, and the acceptable accuracy for deployment, and the script takes care of the entire process. I trained the client's business analysts on how to use the script.

Conclusion

As you can see, there is a massive cost saving of using a machine learning AI model to perform document categorization. It is a one-time investment to hire me to develop the AI, after that, you have total ownership to retrain the model.

Skills

1) Exploratory Data Analysis
2) Data Preprocessing
3) Feature Engineering
4) One Hot Encoding
5) Base Models
6) Model Tuning
7) Natural Language Processing
8) Comparison of Final Models
9) Reporting
Partner With Ahsin
View Services

More Projects by Ahsin