Unleashing the Power of AI in Business Intelligence: Document Classification using OpenAI and ChromaDB

Uncover how OpenAI and ChromaDB revolutionized financial data extraction, streamlining document classification for efficiency.

  • AI

As an engineer at Lexis, I would like to share some helpful bits of my experience solving complex business intelligence problems. Recently, I was assigned a rather daunting task of extracting financial information from over half a million documents of financial reports from various companies. This task, though challenging, allowed me to delve deeper into the applications of the more recent AI advances, mainly through OpenAI embeddings and vector databases, for business intelligence problems. How, you ask? Let's take a journey through the process in my comprehensive case study.

The problem

We were tasked to extract key-value pairs from inconsistent structure and text from business documents. Therefore, we couldn't quickly develop an algorithm to solve the problem and had to implement a solution to leverage the new advances in LLM. We used a cloud solution to train an AI model to extract the key-value pairs. However, it was still pretty challenging to deal with the sheer volume of documents, as not all documents held relevant information, and if we had to process each page, not only would it take more time, but it would be much more expensive. Manually sifting through these documents would be even more time-consuming and economically unviable. The solution? A classification program that combines OpenAI embeddings and ChromaDB vector databases, using Pytesseract for Optical Character Recognition (OCR).

Before we delve into how these technologies helped me overcome the challenge, let's understand what vector embeddings and vector databases are.

Vector embeddings

In a high-dimensional space, vector embeddings are mathematical representations of objects, such as words or documents, capturing semantic meanings based on their context. These vectors provide a way for algorithms to understand the content and context of documents.

Vector databases

Vector databases, on the other hand, are databases designed to store and query these vector embeddings efficiently. They enable us to perform similarity search at scale, which is critical in tasks such as semantic search, relatedness search, or, in our example - classification.

Implementation

This is where OpenAI and ChromaDB came into the picture. OpenAI provides a powerful tool to generate embeddings for our documents, while ChromaDB allows us to store and query these embeddings efficiently. By leveraging the power of vector embeddings and vector databases, we can classify documents based on how close their vector representations are, which will help us identify relevant pages in a document.

First, I split the PDF documents into individual pages. For each page, I created a MySQL record, applied OCR using Pytesseract, and generated an embedding using OpenAI. Using ChromaDB, I queried the page embedding with the embeddings of already classified pages (that I organized personally) stored in the vector database. The result of the query would give me the most related page to the one I am querying, and I would classify it with the same type, storing the type back in the database.

This process enabled me to filter out irrelevant pages, leaving only the ones that needed further processing. This streamlined the process and reduced the cost by about 80% of the original estimate.

A basic example

I will give you a simple example to showcase an implementation of a Python program that does a similar job. To implement this solution, you'll need to create a MySQL database that contains the information for each image and a ChromaDB vector database, which you will use to query the vector embeddings.

Here's a simple schema for the MySQL database:

CREATE TABLE documents ( id BIGINT UNSIGNED PRIMARY KEY, name VARCHAR(255), file_path VARCHAR(255), # path of the document in the filesystem ); CREATE TABLE document_pages ( id BIGINT UNSIGNED PRIMARY KEY, document_id BIGINT UNSIGNED, type ENUM(‘invoice’, 'balance_sheet’, ‘income_sheet’, ‘none), # here we store the type of the image once we classify it file_path VARCHAR(255), FOREIGN KEY (document_id) REFERENCES documents(id) );

We need to manually classify at least a couple hundred images and save them in a folder. The assigned type of each image should be contained in the name of the image, separated by a dash, for example: '[name]-invoice.jpg' or '[name]-none.jpg'. Then, we will embed these images into a ChromaDB vector database using the following example:

import chromadb import glob import os import pytesseract as pt from openai import OpenAI import uuid openai = OpenAI('your-api-key') client = chromadb.PersistentClient(path="chromadb") # create the classification chromadb collection classification_collection = client.get_or_create_collection(name="classification_collection", metadata={"hnsw:space": "cosine"}) classified_images = glob.glob("classified_images/*.jpg") # the location of the folder containing the manually classified images for classified_image in classified_images: # get base name of image image_type = os.path.basename(classified_image).split(".")[0].split('-')[1] # get the ocr text ocr_text = pt.image_to_string(classified_image) embedding = openai.embeddings.create( input=ocr_text.lower(), model="text-embedding-ada-002" ).data[0].embedding # embed the image in the chromadb database classification_collection.add( embeddings=embedding, metadatas={'type': classified_image.type}, ids = str(uuid.uuid4()) )

After creating the classification database, you can use the following Python code to perform OCR, generate embeddings, and classify the pages of a given document page:

from models import DocumentPage import chromadb import pytesseract as pt from openai import OpenAI openai = OpenAI('your-api-key') client = chromadb.PersistentClient(path="chromadb") # create the classification chromadb collection classification_collection = client.get__collection(name="classification_collection") for DocumentPage in DocumentPage.all(): # get the ocr text ocr_text = pt.image_to_string(DocumentPage.file_path) embedding = openai.embeddings.create( input=ocr_text.lower(), model="text-embedding-ada-002" ).data[0].embedding # embed the image in the chromadb database query = classification_collection.query( query_embeddings=embedding, n_results=1 ) DocumentPage.type = query['metadatas'][0][0]['type'] DocumentPage.save()

Conclusion

This simple example should illustrate the process of classifying images using the latest trends in LLMs and vector databases. The synergy of OpenAI embeddings and ChromaDB vector databases has revolutionized our approach to document classification, making it more efficient and cost-effective.

The article has given you a glimpse into the power of AI tools in business intelligence. As we continue to explore and experiment, we are excited about the endless possibilities that AI holds for us, and we will keep sharing our insights about how we can use them in real-world scenarios.

Q&A

Build your digital solutions with expert help

Share your challenge with our team, who will work with you to deliver a revolutionary digital product.

Lexis Solutions is a software agency in Sofia, Bulgaria. We are a team of young professionals improving the digital world, one project at a time.

Contact

  • Deyan Denchev
  • CEO & Co-Founder
© 2024 Lexis Solutions. All rights reserved.