Unleashing the Power of AI in Business Intelligence: Document Classification using OpenAI and ChromaDB

As a software engineer at Lexis Solutions, I recently faced a seemingly Herculean BI challenge: distilling critical financial insights from more than 500,000 multi-format company reports. Tackling it pushed me beyond conventional ETL pipelines and into the frontier of AI-powered document classification. By pairing OpenAI’s text-embedding models with the lightning-fast similarity search of ChromaDB, I transformed a mountain of unstructured data into an organised, query-ready knowledge base. Curious how those pieces fit together - and how you can replicate the approach? Join me for a step-by-step walkthrough of the project, from data wrangling to production deployment.

The problem

We were tasked to extract key-value pairs from inconsistent structure and text from business documents. Therefore, we couldn't quickly develop an algorithm to solve the problem and had to implement a solution to leverage the new advances in LLM. We used a cloud solution to train an AI model to extract the key-value pairs. However, it was still pretty challenging to deal with the sheer volume of documents, as not all documents held relevant information, and if we had to process each page, not only would it take more time, but it would be much more expensive. Manually sifting through these documents would be even more time-consuming and economically unviable. The solution? A classification program that combines OpenAI embeddings and ChromaDB vector databases, using Pytesseract for Optical Character Recognition (OCR).

Before we delve into how these technologies helped me overcome the challenge, let's understand what vector embeddings and vector databases are.

Vector embeddings

In a high-dimensional space, vector embeddings are mathematical representations of objects, such as words or documents, capturing semantic meanings based on their context. These vectors provide a way for algorithms to understand the content and context of documents.

Vector databases

Vector databases, on the other hand, are databases designed to store and query these vector embeddings efficiently. They enable us to perform similarity search at scale, which is critical in tasks such as semantic search, relatedness search, or, in our example - classification.

Implementation

This is where OpenAI and ChromaDB came into the picture. OpenAI provides a powerful tool to generate embeddings for our documents, while ChromaDB allows us to store and query these embeddings efficiently. By leveraging the power of vector embeddings and vector databases, we can classify documents based on how close their vector representations are, which will help us identify relevant pages in a document.

First, I split the PDF documents into individual pages. For each page, I created a MySQL record, applied OCR using Pytesseract, and generated an embedding using OpenAI. Using ChromaDB, I queried the page embedding with the embeddings of already classified pages (that I organized personally) stored in the vector database. The result of the query would give me the most related page to the one I am querying, and I would classify it with the same type, storing the type back in the database.

This process enabled me to filter out irrelevant pages, leaving only the ones that needed further processing. This streamlined the process and reduced the cost by about 80% of the original estimate.

A basic example

I will give you a simple example to showcase an implementation of a Python program that does a similar job. To implement this solution, you'll need to create a MySQL database that contains the information for each image and a ChromaDB vector database, which you will use to query the vector embeddings.

Here's a simple schema for the MySQL database:

CREATE TABLE documents (
    id BIGINT UNSIGNED PRIMARY KEY,
    name VARCHAR(255),
    file_path VARCHAR(255), # path of the document in the filesystem
);

CREATE TABLE document_pages (
    id BIGINT UNSIGNED PRIMARY KEY,
    document_id BIGINT UNSIGNED,
    type ENUM(‘invoice’, 'balance_sheet’, ‘income_sheet’, ‘none), # here we store the type of the image once we classify it
    file_path VARCHAR(255),
    FOREIGN KEY (document_id) REFERENCES documents(id)
);

We need to manually classify at least a couple hundred images and save them in a folder. The assigned type of each image should be contained in the name of the image, separated by a dash, for example: '[name]-invoice.jpg' or '[name]-none.jpg'. Then, we will embed these images into a ChromaDB vector database using the following example:

import chromadb
import glob
import os
import pytesseract as pt
from openai import OpenAI
import uuid

openai = OpenAI('your-api-key')

client = chromadb.PersistentClient(path="chromadb")

# create the classification chromadb collection
classification_collection = client.get_or_create_collection(name="classification_collection",  metadata={"hnsw:space": "cosine"})

classified_images = glob.glob("classified_images/*.jpg") # the location of the folder containing the manually classified images

for classified_image in classified_images:
    # get base name of image
    image_type = os.path.basename(classified_image).split(".")[0].split('-')[1]

    # get the ocr text
    ocr_text = pt.image_to_string(classified_image)

    embedding = openai.embeddings.create(
        input=ocr_text.lower(),
        model="text-embedding-ada-002"
    ).data[0].embedding

    # embed the image in the chromadb database
    classification_collection.add(
        embeddings=embedding,
        metadatas={'type': classified_image.type},
        ids = str(uuid.uuid4())
    )

After creating the classification database, you can use the following Python code to perform OCR, generate embeddings, and classify the pages of a given document page:

from models import DocumentPage
import chromadb
import pytesseract as pt
from openai import OpenAI

openai = OpenAI('your-api-key')

client = chromadb.PersistentClient(path="chromadb")

# create the classification chromadb collection
classification_collection = client.get__collection(name="classification_collection")

for DocumentPage in DocumentPage.all():
    # get the ocr text
    ocr_text = pt.image_to_string(DocumentPage.file_path)

    embedding = openai.embeddings.create(
        input=ocr_text.lower(),
        model="text-embedding-ada-002"
    ).data[0].embedding

    # embed the image in the chromadb database
    query = classification_collection.query(
        query_embeddings=embedding,
        n_results=1
    )

    DocumentPage.type = query['metadatas'][0][0]['type']

    DocumentPage.save()

Conclusion

This simple example should illustrate the process of classifying images using the latest trends in LLMs and vector databases. The synergy of OpenAI embeddings and ChromaDB vector databases has revolutionized our approach to document classification, making it more efficient and cost-effective.

The article has given you a glimpse into the power of AI tools in business intelligence. As we continue to explore and experiment, we are excited about the endless possibilities that AI holds for us, and we will keep sharing our insights about how we can use them in real-world scenarios.

Q&A

: Vector embeddings are mathematical representations of objects, such as words or documents, in a high-dimensional space that captures semantic meanings based on their context. These vectors provide a way for algorithms to understand the content and context of documents, which can be highly beneficial for solving business intelligence problems.
: Vector databases are designed to store and query vector embeddings efficiently. They enable us to perform similarity searches at scale, which is critical in tasks such as semantic search, relatedness search, or classification.
: Pytesseract can be used for OCR to convert images to text, and OpenAI can generate embeddings for these texts. In contrast, ChromaDB can efficiently store and query these embeddings.
: First, perform OCR on each document page to convert it into text. Then, use OpenAI embeddings API to generate an embedding for this text. Then, query this embedding in a ChromaDB database against the embeddings of pages you already classified. The query's result gives the most related page to the one being queried and is then classified with the same type.

Unleashing the Power of AI in Business Intelligence: Document Classification using OpenAI and ChromaDB

The problem

Vector embeddings

Vector databases

Implementation

A basic example

Conclusion

Q&A

What are vector embeddings, and how do they benefit business intelligence problems?

What are vector databases, and how do they implement vector embeddings?

What libraries or technologies can you use for classifying documents?

How can you classify documents using OpenAI embeddings and ChromaDB vector database?

Build your digital solutions with expert help