Dev Log: Serializing PDFs for indexing, retrieval, and prompting LLMs
Aug 13, 2023

Introduction
Many compelling applications of Artificial Intelligence (AI) and Large Language Models (LLM) such as reading scientific literature, interpreting manuals and documentation, question and answer over legal documents, books, test prep etc. require serializing PDF into a text stream. This text stream can then be chunked and indexed for vector retrieval or directly indexed as a problem.
Naive Solution in Python
Simple in theory and simple to get an MVP with just a few lines of code. For most use cases, this is the bare minimum and not useful enough. Below is an example MVP.
The problem?
First, you have scanned PDFs which either have no text that can be extracted like this or badly generated text with Optical Character Recognition, which is usually mangled beyond belief. Second, this text can be completely discombobulated even with normal text. Patents and scientific articles with multi-column layouts and tables and figures interleaved are good examples of this. So are documents with unusual layouts like problem sets, homework packets, or practice tests.
The PDF file format is impossible and the ecosystem is tough to navigate. There are a tons of libraries under varying licenses with strong copyleft snuck in, and each library often has one or two "killer features" and you end up with like four of them installed. Though if I was forced to choose, I would reccomend pypdfium2
for working with PDFs in Python.
What's the easy solution?
Use computer vision. There is a rich set of literature on this. Useful published academic tools for using computer vision on documents include layoutparser
(Allen Institute), layoutlmv3
(Microsoft), and formnet
(Google).
There is an easier solution, though. Use OCR with some math, to get something like this

This does better than Deep Learning solutions
Compare that to the results from Google, where the blocks identified in blue encompass the body of two and half articles, which results incompletely intermixed serialization. With many deep-learning-based document-layout-detection models, text from different paragraphs or sections often gets completely mixed together, rendering it much more difficult to use for downstream tasks like LLM prompting and vector retreival.

How does it work?
First we run optical character recognition on the text. These results are shown in red boxes. Second, the algorithm recursively looks for horizontal and vertical lines which goes through the entire document without intersecting any text. These lines divide up the text into blocks. We use increasingly smaller thresholds for the dividing lines, to make sure that the layout is detect accurately.

The final task which remains is to take every block and serialize it top-to-bottom and left-to-right with the OCRed tokens.