Follow-up to Transcribing Grothendieck’s Handwriting with AI. How a real reference section from an archivist exposed a prompt problem the models had been hiding, and why that matters for anyone comparing frontier LLMs on structured-text tasks.


Where we left off

In January I wrote about a weekend experiment: can modern vision-language models read a page of Grothendieck’s handwritten archive? The short answer was yes, with a hierarchy. Gemini 3 Pro handled dense category theory impressively, Claude Opus 4.5 produced cleaner but less ambitious output, ChatGPT 5.2 struggled. I ended the post with the honest warning that every transcription needs verification, because the pattern-matching that enables recognition also enables confident confabulation.

Then the story turned from a weekend experiment into a proper project. This post is the four-month write-up, focused on one specific finding, because it surprised me enough to be worth writing down carefully. All of it is reproducible from github.com/ivan-gentile/la-longue-marche.


A timeline, because the order matters

Late January. The post goes up. Within a week I get an email from Mateo Carmona, the archivist at the Centre for Grothendieckian Studies, who is manually transcribing Part I of La Longue Marche à travers la théorie de Galois. He offers to collaborate on Part II, the 976 handwritten pages still unread, and, crucially, to share the Part I LaTeX he already has. Olivia Caramello, director of the Istituto Grothendieck, agrees to host the output. We now have ground truth.

February. Pipeline design. Pilot benchmarks: 17 configurations on 4 pages. Gemini 3.1 Pro, medium thinking, with the previous page attached as visual context, wins.

March 6. First production run: 398 of 976 pages transcribed. About 3 euros in API costs. Email to Mateo with numbers and a dashboard.

March 9. Full 976-page corpus delivered as two compiled .tex files. Total project cost so far: under 10 euros.

March 20. Stakeholder update: notation normalization shipped, partial commutative-diagram rollout in 140-3 (84 pages of 119 converted from placeholder to real tikzcd), 140-4 diagrams still pending.

March 21. Mateo replies. He has done the best thing an archivist can do for a benchmark: he has sent two versions of the same subsection of Section 49. One is 49.1old.tex, the output our pipeline produced. The other is 49.1new.tex, the corrected version in the form he expects to publish. He also requests a detailed write-up of our prompting strategy, and suggests a second benchmark on a typed Grothendieck manuscript from the Bourbaki archives, a controlled setting where handwriting is removed from the variables.

April 17. The day this post is written. The 49.1 paired ground truth was the turning point. What follows is what that pair revealed.


The diagnostic

I wrote a small tool that categorizes every difference between 49.1old.tex and 49.1new.tex. The categories come from staring at both files side by side and noticing which kinds of divergence show up repeatedly. Five buckets:

  1. Raw pipeline residue. Placeholders the model left in ([unclear: ...], [MARGIN: ...], [DIAGRAM: ...]).
  2. Notation drift. Our output uses \widehat{\mathfrak{G}}_{0,3}^+, Mateo’s uses \mathfrak{S}^{+\wedge}_{0,3}. Our output uses \text{int} and \text{Norm}, his uses \operatorname{int} and \operatorname{Norm}. Our output uses \mathbb{Z}, his uses \mathbf{Z}. Lowercase g for a closed subgroup versus his \mathcal{G}.
  3. Publishable structure. His file has \chapter*, \addcontentsline{toc}{section}{...}, \label{sec:49}, equations numbered with \leqno{(N)} on the right, \footnote{...} for authorial asides, \begin{tikzpicture} for a schematic. Our file had none of these.
  4. Committed readings. 26 unresolved [unclear] markers in ours, zero in his. He had simply committed to the best mathematical reading in context. “Unclear” was usually decidable.
  5. Abbreviation expansion. He writes out sous-groupe, we had left ss-groupe; homomorphisme vs hom..

Turned into a composite [0, 1] quality score (1 means matching Mateo’s conventions exactly), the shipped output scored 0.113.

That number did not feel right. The pipeline read Grothendieck well: individual symbols were usually correct, the prose was mostly faithful, the mathematical content survived. But the output was not publishable LaTeX. It was notes, not a chapter.

Prompt refresh

I rewrote the prompt. Three additions:

  • A notation conventions block that lists the canonical symbols verbatim: “use \mathfrak{S}, not \widehat{\mathfrak{G}}; use \operatorname{Norm}, not \text{Norm}; use \mathbf{Z}, not \mathbb{Z}.”
  • A margin-handling rule: a marginal (N) next to an equation becomes \leqno{(N)}; an authorial remark in the margin becomes \footnote{...}; other annotations stay as \marginpar{...}. The old prompt let the model decide; the new one tells it.
  • A three-excerpt few-shot pool drawn from Sections 19, 25bis, and 31 of Mateo’s Part I. One prose-dense, one with a \leqno-numbered equation, one with a footnote. No extended explanation; the model sees the actual style.

I called this prompt variant mateo-canonical. It is about 6,000 characters of system prompt. Not small, but nothing exotic.

Opus 4.7 vs Gemini 3.1 Pro, round 1 (before the prompt refresh)

This is the part that is worth writing down carefully, because it surprised me.

Before the mateo-canonical prompt existed, I ran both frontier models on twelve pages of Part II (five from Section 49 where we have ground truth, seven blind samples across the corpus), under the same shipped prompt, same previous-page context, same input PDFs.

ModelCost for 12 pagesAvg latencyNotation drift / 1000 charsRaw pipeline residue / 1000 chars
Gemini 3.1 Pro$0.13(batch)7.103.33
Claude Opus 4.7$2.0624.5 s4.053.43

The obvious reading: Opus is 15x more expensive but does meaningfully better on notation. That would justify a “use Opus for the hard pages” policy.

Two caveats, though. One: Opus produced 40% more text per page, and some of that text was not on the current page. Opus occasionally included the previous page’s content in its output, because it interpreted “shown above for context” as “also include in the transcription”. This is a real failure mode that happened on several of the five ground-truth pages. Two: under a prompt that is ambiguous about what “transcribe into LaTeX” means, what we were actually measuring was “which model guesses the archivist’s preferences better when under-specified”, a question about priors, not about transcription quality.

Opus 4.7 vs Gemini 3.1 Pro, round 2 (after the prompt refresh)

I re-ran the five ground-truth pages through mateo-canonical on both models.

VariantCost for 5 pagesRaw residue / kchNotation drift / kchStructure coverageComposite quality
Shipped pipeline output(already paid)4.715.427 %0.113
Claude Opus 4.7 + mateo-canonical$1.1730.711.4271 %0.661
Gemini 3.1 Pro + mateo-canonical$0.0740.800.9071 %0.742
Gemini 3.1 Pro + mateo-canonical, whole-document (10 pages in one call)$0.0880.620.7550 %0.714

Three things happen at once:

  1. The composite score jumps 6.6x, from 0.113 to 0.742 on the same five pages. Most of what had looked like a quality ceiling was a prompt floor. The models were capable of producing publishable LaTeX; no one had asked them to.

  2. The Opus-versus-Gemini ranking inverts. Under the old prompt, Opus’s notation drift was 43% lower than Gemini’s (4.05 vs 7.10). Under mateo-canonical, Gemini’s is 37% lower than Opus’s (0.90 vs 1.42). Gemini also beats Opus on composite (0.742 vs 0.661). And Gemini is 16x cheaper per page.

  3. Whole-document mode is competitive. Gemini 3.1 Pro over ten pages in a single call reaches 0.714, slightly below per-page but with 40% lower cost per page and 5x shorter wall-clock time. For typed text or lightly-handwritten prose this is now the recommended path; for dense handwriting with across-page references the per-page visual context still helps.

What had looked like a model-quality gap was a prompt-ambiguity gap. Opus was not “better at mathematical notation”; it was “better at guessing the right style when the prompt was under-specified”. Once we tell both models the style explicitly, the convergence is immediate, and the remaining comparison is on cost and speed, where Gemini wins cleanly.

This is, to me, the most important finding of the project so far. Because it means most publicly reported benchmarks of frontier LLMs on under-specified tasks are not measuring what they appear to measure.

Round 3: Flash-Lite enters the chat

The Pro-tier Gemini API has a hard ceiling of 250 requests per model per day on the quota I was using, which I discovered the hard way after firing off a full-corpus re-run with mateo-canonical. The re-run would need four calendar days at that rate to finish. Annoying.

While the quota reset was pending, I tried something a little perverse: the same five ground-truth pages through Gemini 3.1 Flash-Lite, the cheapest, smallest, fastest Gemini model, under the same mateo-canonical prompt.

VariantCost for 5 pagesAvg latencyRaw residue / kchNotation drift / kchStructure coverageComposite quality
Claude Opus 4.7 + mateo-canonical$1.17328.6 s0.711.4271 %0.661
Gemini 3.1 Pro + mateo-canonical$0.07467.8 s0.800.9071 %0.742
Gemini 3.1 Flash-Lite + mateo-canonical$0.0087.4 s0.120.8550 %0.777

Flash-Lite beat both Pro and Opus on composite quality, at 10× lower cost than Pro and 150× lower cost than Opus, with a 9× speedup over Pro and 4× over Opus. It lost a little structure coverage (50 % vs 71 %) but gained much lower raw-residue density.

The output sample reads like this, exactly as you’d want it:

49. Homomorphismes de $M_{0,3}$, les groupes $M_{g,0}$ généralisés

I) Considérons un sous-groupe $G$ de $M_{0,3}$, contenant
$\mathfrak{S}^{+\wedge}_{0,3}$, d'image inverse d'un sous-groupe fermé
de $\mathfrak{T}_{0,3}$. [...]

$$G \cap \mathfrak{S}^{+\wedge}_{0,3} \subset M_{0,3}, \quad
  u \in G \implies \mu(u) \equiv 1 \pmod 3 \leqno{(1)}$$

Canonical \mathfrak{S}^{+\wedge}_{0,3} symbol, \operatorname{int}, \leqno{(1)}, \defeq, \marginpar — all present. The tokens are right. The model is small but the prompt is carrying the specification.

I tested the same hypothesis on the typed-text control (Bourbaki schemes, 5 pages, whole-document mode):

VariantCostLatency
Claude Opus 4.7, page-by-page$0.632~100 s
Gemini 3.1 Pro, whole-document$0.05765 s
Gemini 3.1 Flash-Lite, whole-document$0.00713 s

Output quality is essentially indistinguishable from Pro — \mathfrak{p} ideals, \emph{} definitions, tikzcd commutative diagrams, all present. Flash-Lite skipped the archival marginalia that Pro captured, but everything a mathematician would want is there.

Full corpus re-run cost projection: ~€3 (from €10), wall-clock ~3-4 hours (from 10-16 hours with Pro or a week with the Pro daily quota). That re-run is in flight as of publication; the corpus will be a Flash-Lite-with-mateo-canonical output by the time you read this.

The pattern again: with an explicit prompt, the size and cost of the model matter much less than a lot of recent benchmarks suggest. A 15× cheaper model can match or beat a more expensive one on a specification-heavy task, because most of the apparent capability gap was being spent on compensating for prompt under-specification.

The typed-text control

Mateo’s second suggestion, to factor out the handwriting variable, was to run the pipeline on an early, typed Grothendieck manuscript on the theory of schemes. Five pages, two ways:

  • Claude Opus 4.7, page-by-page: $0.632.
  • Gemini 3.1 Pro, whole-document (all five pages in one call): $0.057, 11x cheaper.

Both produce publishable LaTeX: \mathfrak{p} for prime ideals, \emph{} for introduced definitions, a real tikzcd for the commutative diagram on page 2. Gemini additionally captures the archival marginalia (“Archives Grothendieck sept. 59”, “n° 326 bis”) that Claude silently dropped.

%% ===== Page 1 =====
[Archives Grothendieck sept. 59]
\begin{flushright}
n$^\circ$ 326 bis
\end{flushright}

\begin{center} CHAPITRE 0 \\ PRÉLIMINAIRES. \end{center}
\begin{center} §1. \emph{Algèbre commutative.} \end{center}

Nous rappelons ... Un \emph{idéal premier} $\mathfrak{p}$ d'un anneau $A$
est un idéal tel que $A/\mathfrak{p}$ soit intègre ...

On typed input the pipeline produces essentially final-form LaTeX, confirming that the remaining gap on Part II is a prompt-and-post-processing problem, not an OCR problem.

The three-tier evaluation

No single quality metric has survived contact with this project. String similarity, LLM-as-judge, and structural diff against a corrected gold file all disagree in specific ways, and each catches something the others miss.

  • String similarity (SequenceMatcher against a typeset reference) is fine for ranking configurations against each other, misleading as an absolute score. Page alignment, LaTeX variance, and typesetting differences push similarity into the 12 to 18% range regardless of actual quality.
  • LLM-as-judge (Gemini Flash-Lite rating five dimensions per page) is cheap enough for the full corpus and correlates reasonably with human reviewers, but rewards confident-looking output, precisely the failure mode flagged in the January post.
  • Categorized structural diff (the diagnose_49_1.py approach) is by far the most honest signal, but requires a corrected gold version, which costs expert hours.
  • Human A/B preference on blind pairs is the complement: cheap for the reviewer, targeted, tractable once you have around 50 pairs.

When any two disagree, we look at the specific pages. That has caught several things that would otherwise have shipped.

What I got wrong in January

  1. “Model choice matters enormously.” It does, but less than prompt design. The jump from 0.113 to 0.742 came from rewriting the prompt, not from changing models.
  2. “Gemini 3 Pro handles this impressively.” It does, but without explicit structural instructions it quietly produces unpublishable LaTeX: inline (1) equation numbers instead of \leqno{(1)}, [MARGIN: ...] placeholders instead of footnotes, a)/b) inline instead of \begin{enumerate}. The content is mostly right; the LaTeX is not.
  3. “The bottleneck isn’t AI, it’s verification.” I still believe the verification part, but I underweighted how much before verification is a prompt-engineering problem. The models are not struggling to read Grothendieck; they are struggling to know what “transcribe into LaTeX” is supposed to mean when the answer has a particular publication format.

What’s open

  • Full-corpus re-run with mateo-canonical on Flash-Lite. Running at publication time. ~3 hours to complete, ~€3 total. Will replace the production corpus once spot-checks pass.
  • Hand-drawn geometric figures. A handful of pages contain actual drawings, not commutative diagrams. Neither prompt handles these well; human help will be needed.
  • Judge calibration against expert ratings. The benchmark viewer at benchmark_opus_vs_gemini.html collects blind A/B preferences on seven pages; Mateo’s votes will calibrate the LLM-as-judge scores directly.

Credits

The collaboration is with Mateo Carmona (CSG) and Olivia Caramello (Istituto Grothendieck). Every useful finding in this project is downstream of Mateo sharing his Part I transcription and writing 49.1new.tex. The typed-text benchmark was his idea.

Repository: github.com/ivan-gentile/la-longue-marche. Pipeline documentation aimed at Mateo specifically: PIPELINE.md. Both tex_output/*.tex files are checked in.

If you work on handwritten OCR, mathematical manuscript transcription, or evaluation of frontier models under structured-text tasks, I would like to hear from you, especially on failure modes the four evaluation tiers above still miss. Reach out on LinkedIn.


Co-written with Claude Opus 4.7, which, in a small twist of irony, is also one of the two models I was benchmarking.