Dev Log: Serializing PDFs for indexing, retrieval, and prompting LLMs

Aug 13, 2023

Introduction

Many compelling applications of Artificial Intelligence (AI) and Large Language Models (LLM) such as reading scientific literature, interpreting manuals and documentation, question and answer over legal documents, books, test prep etc. require serializing PDF into a text stream. This text stream can then be chunked and indexed for vector retrieval or directly indexed as a problem.

Naive Solution in Python

Simple in theory and simple to get an MVP with just a few lines of code. For most use cases, this is the bare minimum and not useful enough. Below is an example MVP.

pip install pypdf2
import PyPDF2

def extract_text_from_pdf(pdf_path):
    # Open the PDF file in binary reading mode
    with open(pdf_path, 'rb') as file:
        # Create a PDF reader object
        pdf_reader = PyPDF2.PdfFileReader(file)

        # Get total number of pages
        num_pages = pdf_reader.numPages

        # Initialize text accumulator
        text = ""

        # Extract text from each page
        for page_num in range(num_pages):
            page = pdf_reader.getPage(page_num)
            text += page.extractText()

    return text

pdf_path = 'path_to_your_pdf_file.pdf'
pdf_text = extract_text_from_pdf(pdf_path)
print(pdf_text)

The problem?

First, you have scanned PDFs which either have no text that can be extracted like this or badly generated text with Optical Character Recognition, which is usually mangled beyond belief. Second, this text can be completely discombobulated even with normal text. Patents and scientific articles with multi-column layouts and tables and figures interleaved are good examples of this. So are documents with unusual layouts like problem sets, homework packets, or practice tests.

The PDF file format is impossible and the ecosystem is tough to navigate. There are a tons of libraries under varying licenses with strong copyleft snuck in, and each library often has one or two "killer features" and you end up with like four of them installed. Though if I was forced to choose, I would reccomend pypdfium2 for working with PDFs in Python.

What's the easy solution?

Use computer vision. There is a rich set of literature on this. Useful published academic tools for using computer vision on documents include layoutparser (Allen Institute), layoutlmv3 (Microsoft), and formnet(Google).

There is an easier solution, though. Use OCR with some math, to get something like this

This does better than Deep Learning solutions

Compare that to the results from Google, where the blocks identified in blue encompass the body of two and half articles, which results incompletely intermixed serialization. With many deep-learning-based document-layout-detection models, text from different paragraphs or sections often gets completely mixed together, rendering it much more difficult to use for downstream tasks like LLM prompting and vector retreival.

How does it work?

First we run optical character recognition on the text. These results are shown in red boxes. Second, the algorithm recursively looks for horizontal and vertical lines which goes through the entire document without intersecting any text. These lines divide up the text into blocks. We use increasingly smaller thresholds for the dividing lines, to make sure that the layout is detect accurately.

The final task which remains is to take every block and serialize it top-to-bottom and left-to-right with the OCRed tokens.

Live Demo

https://gallery.oloren.ai/apps/serializedocimg