Julian Henry's Blog

📦 Vakyume: A Vacuum Theory PDF-to-C++ Pipeline

18 Feb 2025

View on GitHub

Inspired by my brief stint as a test engineer technician in Applied Materials’ Austin factory (building 32 FTW), the Vakyume project sought to yield software that performed the following:

  1. OCR recognition of pdf or scanned book to extract formulae
  2. conversion of said formulae to Python code using sympy
  3. in the event sympy were to fail, “return SympyFailure”
  4. for all SympyFailures, another script would feed the equation header to an llm to produce candidate code, and verify it is correct until testing proves the library complete
  5. convert the Python into C++

I set out a while back to make an arbitrary textbook-to-python-library library, followed a reasonable regime: sympy conversion of my hand crafted notes programmatically writing python classes, accounted for solving one-odd-out kwargs to make the library automagically solve for missing kwargs, and, industrial applications in mind, chose the goal of converting the python product as an intermediate material for transferring to C++ ultimately.

The source material: the 1986 edition of Process Vacuum System Design and Operation by Ryans and Roper. A book I had sitting around from my Applied Materials days. If you are going to build a pipeline to convert textbooks into code, you might as well pick one you actually care about.

What emerged was an eight-stage orchestration. From textbook PDF to compiled C++ binary. The whole thing.

graph TD
    A[PDF Textbook] -->|Scrape| B(Equation Notes)
    B -->|SymPy| C{Initial Solvers}
    C -->|Fail/Inconsistent| D[LLM Repair]
    D -->|Verify| E{OOO Check}
    E -->|Pass| F[Certified Shards]
    F -->|Reconstruct| G[Modular Python Package]
    G -->|Transpile| H[C++ Library]

Let me walk you through the wreckage.

Extraction & Parsing

The scraper uses PyMuPDF and small local models (Phi-3, Llama-3) to identify numbered equations and variables from the PDF. Equations are stored in a human-readable Python format (lhs = rhs). I built an interactive wizard for chapter selection because I am a masochist who enjoys building CLI tools for projects that may or may not survive the week:

============================================================
  VAKYUME PDF Equation Scraper - Interactive Wizard
============================================================

  PDF: textbook.pdf

--- Model Selection ---
  Available Ollama models:
    [1] llama3:latest (default)
    [2] phi3:latest
    [3] llama3.2:latest

  Select model [1-3] (Enter for default):

--- Chapter Selection ---
  Found 28 chapters in PDF:

    [ 1] Introduction (12 pages)
    [ 2] The Scientific Method (8 pages) [auto-skip: non-equation content]
    [ 3] Kinematics (32 pages)
    ...

  Chapters to process: 10

The point was never to beat GPT-4o at reading a PDF. Typical multimodals already process textbook pdfs well into LaTeX or some such similar medium. The point was to own every stage of the pipeline, to understand what breaks when you try to go from a scanned page to a compiled binary with no human in the loop.

A lot breaks, by the way.

The Verification Problem

We now must understand the verification in detail, as it was the crucial bottleneck.

The plan was, metaprogram to produce the one-odd-out of every variable in the equation, yielding the ability to solve for any set of physical circumstances derived from a single formula. For any equation (e.g., $PV=nRT$), Vakyume generates solvers for every variable ($P, V, n, R, T$).

To that end, I developed the kwasak library, which allowed for ablated solving, provided the underlying methods to solve ABCDE(a=…,b=…,c=…,…) would be of the form ABCDE__a, ABCDE__b, ABCDE__c etc., yet the verification is tricky.

So, we can even imagine a kwarg family, K, in which no equation was solved by sympy. How to verify?

We do not know if the llm output is correct.

K has a,b,c,… methods. So originally what was done was:

  1. propose dummy args
  2. call the function if solved with all variables but “a”. Try for multiple dummy args if the inputs are malformed (e.g. 0/0).
  3. Take the output of the equation for “a” and store it as a new dummy arg. Plug “a” into all other equations as confirmation that they are valid. If they yield the original argument for all other variables set by the dummy args, then we are good, but with a caveat. Sometimes the equations have multiple solutions. We have only conceived of positive solutions, for simplicity and for physical vacuum systems we are usually downpat, but there are situations where all solutions even imaginary are crucial to know.
  4. So, we can say, take the output for “a”, even if it were a list, and try it against all other combinations. We choose a from the floats, so we are feeling that we will not get lucky. We can say the family of dummy args, while billed as a <str, float> hashmap has become an adhoc <str, list[float]>; oh boy!

I digress. You see, you still cannot propose dummy args for equations for which no solutions exist with this proposed verification system. My God!

This became formalized as the “One-Odd-Out” (OOO) check. Pick a random input, solve for one variable, use that result to solve for the others. If the results do not satisfy the original equation within a $1 \times 10^{-4}$ tolerance, the solver is flagged for repair. It is not pretty. It works.

Sharding

We can isolate the equation on its own and feed it output manipulating the system calls to write scratch paper with (correct Pylibs), call the “shard” equation with dummy args, and capture its output, if it even compiles. I know you typed-lang nerds will say, interpreted, but that’s three syllables and I haven’t got the time, buster.

So, now you have the case of the K-family shards of unsolvable equations. You’d have to see the ouroboros of metaprogramming this begets. Say, we verify a shard with dummy args 1, get an answer, dummy args 1 calls shard 2, shard 2 confirms shard 1, but shard 2 and shard 1 are wrong. We are now coming into shard 3 with a family of arguments that is incorrect, let’s say, should be something else.

Now, you may get to 3, in the triad and lament, aha! 3 is incorrect. I knew it, try again. Now, you are provably stuck. Either you try a new regime in which 2 and 3 are prioritized, and create a consensus on ablation until a prevailing head arises, or else risk intellectual dishonesty.

Now you cook up something from shard 3 and feed it to shard 2’s child, another attempt, and find out, shard 3 was correct, so shard 2’s child is a candidate. You finally try these dummy arguments for shard 1, and lo and behold, the thing is solved.

My word!

We have to see the so-called value chains, sequences in harmonious dummy argument concordance, and let them dominate as proposed shards.

You need a gosh darn consensus algorithm for formulae just because sympy fails!

It was non trivial.

LLM-Assisted Repair

When symbolic solvers (SymPy) fail on transcendental or complex engineering forms, Vakyume uses LLM-assisted repair. The LLM is given the equation, a working example shard from the same family, and concrete expected-vs-got test cases to produce a corrected solver function.

I revived the project with the aim of incorporating LLM calls through Ollama to metaprogram on the fly when sympy fails. After riotously struggling to write a seamless meta-programming one-touch solution, I threw in the total on the automagicality, manually copy the examples into Claude/ChatGPT-4o’s of the day healing the library’s failure points, the proverbial artisenal human in the loop. God only knows for how much longer such an anachronism will exist.

Then the automated version came together. Same idea – feed the equation to a model, get back a solver – but through Ollama, with structured prompts, and it actually works most of the time. When it does not, the OOO check catches it. When the OOO check does not catch it, well. See above re: consensus algorithm.

By the way of 100+ equations, only one equation, solving for k, it did not have an algebraic solution. The Python code entered the stratosphere with 7593 lines… from 100+ equations, so ~10x the code from notes ratio.

Multi-Target Synthesis

The end of the pipeline, the part that makes the whole Rube Goldberg machine worth building:

  • Python: Generates a modular package (py/) with the @kwasak decorator for automatic variable dispatch.
  • C++: Transpiles Python AST to C++17, utilizing std::complex and custom LambertW implementations.
  • Documentation: Generates an Equation Certification Report (docs/) with LaTeX-rendered formulas and variable definitions for peer review.
projects/
└── VacuumTheory/
    ├── notes/           # Input: Equation definitions
    ├── shards/          # Intermediate: Individual solvers
    ├── reports/         # Analysis and verification logs
    ├── docs/            # Output: Equation Certification (LaTeX/MD)
    ├── py/              # Output: Modular Python package
    └── cpp/             # Output: C++ headers and source

Five commands. PDF to C++.

# 1. Scrape equations from a PDF
python3 vakyume.py scrape path/to/textbook.pdf -o projects/MyProject/notes

# 2. Run the pipeline: shard, verify, repair
python3 vakyume.py run projects/MyProject

# 3. Reconstruct verified shards into a Python package
python3 vakyume.py reconstruct projects/MyProject

# 4. Transpile to C++ and compile
python3 vakyume.py make-cpp projects/MyProject

# 5. Generate an Equation Certification Report
python3 vakyume.py make-docs projects/MyProject

That was the dream. That is the reality.

The Bitter Law

The Bitter Law that states that data-driven neuron grooming always overtakes principles-first, rules-based approaches to optimization problems, in this case, optimizing a functor that can Hoover up a pdf and spit out C++ code. This project, while originally admirable, ended as a monument to stubbornness: its methodology was a neandrathal amongst the big data machine learning techniques of the present.

Today, a free tool like the Gemini CLI could vastly outshine this entire pipeline in a single prompt. The world moved fast, and Vakyume is a snapshot of what it took to metaprogram the hard way before it became easy.

That said, I learned a tremendous amount about coding and life through making this. Symbolic math, AST manipulation, C++ transpilation, and OOO verification. Too many failure modes, too much glue code, and too brittle to scale gracefully. But I built it. All five stages. The OCR. The sharding. The verification. The repair. The transpilation. From a 1986 vacuum textbook to compiled C++17.

What is even more funny?

I can just copy and paste the SympyFailure equations into ChatGPT-4o or Claude-3.5 and directly paste them, verifiying them myself and saving loads of time, so that is what I have done and propose to you dear viewer. Cheap out!

Meticulosity is for chumps.

“Sorry, your response was filtered by the AI service”.

Yeah, don’t get too cocky. They can pull the plug any time.