Featured image for I downloaded and hashed 4.6 million arXiv PDFs. Then the hashes changed.

I downloaded and hashed 4.6 million arXiv PDFs. Then the hashes changed.

By Rio Achuzia

It was supposed to be a weekend project. The goal was simple: build a plugin to demonstrate annotation retrieval for DorsalHub, my shiny new API.

The plugin would showcase fetching public file annotations, using only the file's SHA-256 content hash as the query. In other words, pointing at a file on your computer, and getting some structured information about that file from the API.

graph LR
    A@{ shape: doc, label: "Some File" } --> B(Dorsal + Plugin)
    B -- "SHA-256" --> C[(DorsalHub API)]
    C -- "Annotation" --> B

My first consideration was the dataset. I was already aware of the Kaggle-hosted arXiv dataset, and this felt like a nice opportunity to play with it.

arXiv.org is a huge open-access repository for scientific papers across a number of disciplines, and the arXiv Dataset includes structured metadata about each of them. That's millions of papers dating back to the early 90s.

arXiv Dataset UI
The arXiv dataset is updated regularly.
Parsed Data
An example record from its Data Card.

Looking at the dataset, it's easy to see how great a fit it is. The metadata is clean and up to date, containing almost anything a person would want to know about an arXiv paper.

I planned to use it to create annotations for millions of PDF papers. Each annotation would be a structured record of key information about the paper: Authors, Title, Abstract etc. After linking each annotation to the PDF's hash, I'd have the data to power a plugin to retrieve a structured annotation for any arXiv paper, simply by providing the content hash of the PDF.

Here's how I imagined it would look when it was finished:

dorsal run dorsalhub/arxiv-pdf ./1706.03762v7.pdf
Arxiv-id: 1706.03762
Title: Attention Is All You Need
Abstract: The dominant sequence transduction models are...

Under the hood, all the plugin had to do was to hash the file content and send that hash to the DorsalHub API. The API would then check for an arxiv annotation and, if found, would return it to the client.

This felt like something worth building.

As a demo, it's easy to explain. It showcases annotation retrieval in a way that anyone can understand. Bonus points if it's actually useful to anyone.

So I worked out a plan in two stages.

  1. Backfill the PDF file hashes and arXiv annotations to DorsalHub. This becomes the queryable data.

  2. Code a user-facing plugin to retrieve the arXiv annotation for any given document.

It was a Friday evening when I started, and I had the whole weekend ahead of me. I was feeling optimistic.

The Backfill

In order to retrieve an annotation from an API, first the annotation has to exist.

The steps to achieve this were straightforward:

  • Download the back catalog of arXiv papers (yes, all of them)
  • Parse the arXiv dataset metadata into structured annotations
  • Hash and process each PDF, linking the correct annotation to its file hash
  • Push it all up to the DorsalHub API

The Download

Acquiring the entire collection of arXiv PDFs isn't as difficult as it might sound. While crawling arXiv itself is out of the question (their web rate limiter is famously strict) they provide bulk access to papers in two ways:

  1. Amazon S3: s3://arxiv/pdf

    This bucket holds the complete set of documents, bundled in .tar files. Amazon S3 is the official bulk download method documented on arxiv.org.

    Note that this is a Requester Pays bucket, so downloading the PDFs this way will cost you hundreds of dollars.

  2. Google Cloud Storage: gcs:arxiv-dataset

    This bucket is part of the arXiv Dataset, and is maintained alongside the metadata records. It's also free to access.

rclone lsf :gcs:arxiv-dataset/arxiv/ --gcs-anonymous --max-depth 2
arxiv-dataset/
└── arxiv/
    ├── pdf/       <-- Post-2007 papers are sorted by month and year
    │   ├── 0704/  <-- April 2007
    │   │   ├── 0704.0001v1.pdf 
    │   │   ├── 0704.0001v2.pdf  
    │   │   └── 0704.0002v1.pdf
    │   ├── 0705/
    │   └── ...
    ├── math/      <-- Pre-2007 papers are organized by subject e.g. math
    │   ├── pdf/
    │   │   ├── 9201/  <-- January 1992
    │   │   │   └── 9201201v1.pdf
    │   │   └── ...
    │   └── ...
    ├── cs/
    │   ├── pdf/
    │   └── ...
    └── (more subject categories...)

A note on arXiv IDs

Every arXiv submission is assigned a unique ID, e.g. 0704.0001 or cs/9301101.

  • Pre-2007: The ID combines the subject with its submission date and sequence number. For example, cs/9301101 was submitted to the Computer Science (cs) category in January 1993 (9301) as number 101 for CS for that month.

  • 2007 and newer: The ID is globally unique, dropping the subject. For example, 0704.0001 is simply the first (0001) paper submitted to arXiv in April 2007 (0704).

  • Versions: arXiv IDs optionally include a version number suffix (v1, v2). When papers are re-submitted, arXiv publishes a corrected paper with an incremented version number (0704.0001v2). IDs with the version omitted (0704.0001) implicitly refer to the latest existing version.

For more information on arXiv identifiers, see: http://info.arxiv.org/help/arxiv_identifier.html

After making sure I had a few terabytes spare on my local NAS, I opened a terminal window, logged into it and used Rclone to begin the process of mirroring the Google Cloud Storage (GCS) bucket. Even with my reasonably fast connection, this was going to take a while.

The biggest pain point is that the GCS bucket holds the PDFs as loose files in a deeply nested structure. This means that each tiny PDF has to be downloaded and written to disk individually. Even using Rclone with multiple workers, the overhead from writing so many tiny files to the multi-disk array meant progress was slow. I tried increasing the worker count to up the pace, but this slowed down progress even more, as the disks couldn't keep up and began thrashing.

After a few false starts, I managed to find a balance which kept the NAS volume utilization (a metric for how much work the disks are doing) at around 80 to 90%. This prevented disk thrashing and kept the download rate steady.

Here's the command I used to mirror the large post-2007 arxiv-dataset/arxiv/pdf/ path, which contains the vast majority of the PDFs. I chose to run Rclone via Docker to keep the process isolated and easier to manage for what I anticipated would be a long task:

docker run -d \
  --name mirror \
  --restart always \
  -v /volume1/arXiv/arxiv_mirror/pdf:/data \
  rclone/rclone:latest \
  copy :gcs:arxiv-dataset/arxiv/pdf/ /data/ \
  --gcs-anonymous \
  --transfers 25 \
  --checkers 25 \
  --progress \
  --create-empty-src-dirs \
  --include "*.pdf"

With the NAS making reassuringly busy noises, all I had to do was wait.

The Metadata

On Saturday morning, I checked the download progress. About 10 to 15 PDFs were landing on my NAS every second, but most of the file tree was still empty. A little under 1 TB had downloaded so far, but I didn't have a great idea of how much remained.

Unlike a hard drive on your computer, where an index table helps do things like quickly count how many files are in a directory, GCS buckets are object storage. This means that there aren't really any directories per se.

Checking how many files were left to download would involve iterating over every object in the bucket, filtering for the files I care about, and summing up their reported size attributes. For a bucket containing millions and millions of files, this would take hours.

Seriously, don't bother executing this command:

gsutil du -sh "gs://arxiv-dataset/arxiv/**/*.pdf"

Since I didn't have the patience for counting the PDFs in the GCS bucket, I tried my best to estimate based on reported historic stats about the AWS bucket:

  • The total size of the bucket was 5.6 TB in 2023, growing to 9.2 TB in April 2025
  • Roughly half of the total is PDF files, the rest being occasional HTML submissions, LaTeX and other source files

I was doing this backfill in February 2026. Factoring in accelerating monthly submission growth, I estimated that when it was complete I'd have between 5 TB and 6 TB of PDF files to process.

This was not going to be a weekend project.

While Rclone continued in the background, I had some time on my hands and began digging into the arXiv Dataset.

Each record in the dataset is already a well defined JSON object, and contains a great deal of information about the submission. Here's an example:

{
  "id": "1706.03762",
  "submitter": "Llion Jones",
  "authors": "Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion\n  Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin",
  "title": "Attention Is All You Need",
  "comments": "15 pages, 5 figures",
  "journal-ref": null,
  "doi": null,
  "report-no": null,
  "categories": "cs.CL cs.LG",
  "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
  "abstract": "  The dominant sequence transduction models are based on complex recurrent or\nconvolutional neural networks in an encoder-decoder configuration. The best\nperforming models also connect the encoder and decoder through an attention\nmechanism. We propose a new simple network architecture, the Transformer, based\nsolely on attention mechanisms, dispensing with recurrence and convolutions\nentirely. Experiments on two machine translation tasks show these models to be\nsuperior in quality while being more parallelizable and requiring significantly\nless time to train. Our model achieves 28.4 BLEU on the WMT 2014\nEnglish-to-German translation task, improving over the existing best results,\nincluding ensembles by over 2 BLEU. On the WMT 2014 English-to-French\ntranslation task, our model establishes a new single-model state-of-the-art\nBLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction\nof the training costs of the best models from the literature. We show that the\nTransformer generalizes well to other tasks by applying it successfully to\nEnglish constituency parsing both with large and limited training data.\n",
  "versions": [
    {"version": "v1", "created": "Mon, 12 Jun 2017 17:57:34 GMT"},
    {"version": "v2","created": "Mon, 19 Jun 2017 16:49:45 GMT"},
    {"version": "v3","created": "Tue, 20 Jun 2017 05:20:02 GMT"},
    {"version": "v4","created": "Fri, 30 Jun 2017 17:29:30 GMT"},
    {"version": "v5","created": "Wed, 6 Dec 2017 03:30:32 GMT"},
    {"version": "v6","created": "Mon, 24 Jul 2023 00:48:54 GMT"},
    {"version": "v7","created": "Wed, 2 Aug 2023 00:41:18 GMT"}
  ],
  "update_date": "2023-08-03",
  "authors_parsed": [
    ["Vaswani", "Ashish", ""],
    ["Shazeer", "Noam", ""],
    ["Parmar", "Niki", ""],
    ["Uszkoreit", "Jakob", ""],
    ["Jones", "Llion", ""],
    ["Gomez", "Aidan N.",""],
    ["Kaiser", "Lukasz", ""],
    ["Polosukhin", "Illia", ""]
  ]
}

This is a very clean JSON, from a well maintained dataset with great coverage. There wasn't much work to do, apart from choosing which fields to keep.

All annotations on DorsalHub must conform to a known schema. I already had it in the back of my mind that I would bundle an adapter with the plugin, allowing automatic export to standard citation formats (e.g. BibTeX or CSL-JSON). This meant the schema I built would have to capture at least those fields needed to build a citation.

In the end I constructed the dorsal/arxiv schema. This JSON schema defines the shape of a single annotation, and when applied to the same record above, results in the following:

{
  "arxiv_id": "1706.03762",
  "title": "Attention Is All You Need",
  "abstract": "The dominant sequence transduction models are based on complex recurrent or\nconvolutional neural networks in an encoder-decoder configuration. The best\nperforming models also connect the encoder and decoder through an attention\nmechanism. We propose a new simple network architecture, the Transformer, based\nsolely on attention mechanisms, dispensing with recurrence and convolutions\nentirely. Experiments on two machine translation tasks show these models to be\nsuperior in quality while being more parallelizable and requiring significantly\nless time to train. Our model achieves 28.4 BLEU on the WMT 2014\nEnglish-to-German translation task, improving over the existing best results,\nincluding ensembles by over 2 BLEU. On the WMT 2014 English-to-French\ntranslation task, our model establishes a new single-model state-of-the-art\nBLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction\nof the training costs of the best models from the literature. We show that the\nTransformer generalizes well to other tasks by applying it successfully to\nEnglish constituency parsing both with large and limited training data.",
  "authors": [
    "Ashish Vaswani",
    "Noam Shazeer",
    "Niki Parmar",
    "Jakob Uszkoreit",
    "Llion Jones",
    "Aidan N. Gomez",
    "Lukasz Kaiser",
    "Illia Polosukhin"
  ],
  "categories": [
    "cs.CL",
    "cs.LG"
  ],
  "doi": null,
  "journal_ref": null,
  "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
  "version": "v7",
  "url": "https://arxiv.org/abs/1706.03762"
}

This new validated record retains all of the information needed to build a citation.

For example, a user could ask the Dorsal CLI for a BibTeX reference by passing an --export argument:

dorsal run dorsalhub/arxiv-pdf ./1706.03762v7.pdf --export=bibtex
@misc{1706_03762,
  title = {Attention Is All You Need},
  author = {Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and
Illia Polosukhin},
  eprint = {1706.03762},
  archivePrefix = {arXiv},
  primaryClass = {cs.CL},
  url = {https://arxiv.org/abs/1706.03762},
  year = {2017},
  month = {6}
}

The Test

By this point I'd resigned myself to the fact that the download would take days to finish. Until it was complete, I couldn't fully move on to the next stage: actually processing the documents. But what I could do was test the workflow end to end, and validate my assumptions.

So I trusted Rclone to do its job, and in the meantime I copied over a few months worth of PDFs and started to test things out.

Processing a PDF means extracting core metadata. This forms a structured record that can be synced with DorsalHub, which contains fields like size, media_type, and pdf.page_count.

I processed a few hundred files, linking each to the schema-validated annotations parsed out of the arXiv dataset, and pushed them to my testing (dev) instance of the DorsalHub API.

Now I had queryable data for a sample of documents. This meant I could:

  • Point at a file
  • Calculate its SHA-256 hash
  • Query the API with that hash
  • Get an annotation back

I could test this in two steps, using the Dorsal CLI.

  • First, use dorsal id to get the file record from the API:
dorsal id /mnt/b/arxiv_mirror/pdf/1706/1706.03762v7.pdf
🔎 Identifying file 1706.03762v7.pdf...
╭────────────────────────────────  File Identified ──────────────────────────────────╮
│                                                                                    │
│  Hashes                                                                            │
│       SHA-256:  b7d72988fd8107d07f7d278bf0ba6621adb6ed47df74be4014fa4a01f03aff6a   │
│                                                                                    │
│  File Info                                                                         │
│          Name:  1706.03762v7.pdf                                                   │
│          Size:  2 MiB                                                              │
│    Media Type:  application/pdf                                                    │
│                                                                                    │
│  Tags                                                                              │
│        No tags found.                                                              │
│                                                                                    │
│  Pdf Info → file/pdf                                                               │
│          producer:  pdfTeX-1.40.25                                                 │
│           version:  1.5                                                            │
│        page_count:  15                                                             │
│     creation_date:  2023-08-03T00:07:29Z                                           │
│     modified_date:  2023-08-03T00:07:29Z                                           │
│                                                                                    │
│  Arxiv Info → dorsal/arxiv                                                         │
│       Source:  Model (arXiv Dataset)                                               │
│     Modified:  2026-02-14 10:00                                                    │
│           ID:  77fc8be9-c53e-4461-b5c2-e015d8682aea                                │
│                                                                                    │
│                                                                                    │
╰────────────────────────────────────────────────────────────────────────────────────╯
  • Then, with the dorsal/arxiv annotation ID copied from the record above, run dorsal annotation get:
dorsal annotation get 77fc8be9-c53e-4461-b5c2-e015d8682aea
╭───────────────────────────────── ArXiv Record Result ─────────────────────────────────╮
│ dorsal/arxiv                                 ID: 77fc8be9-c53e-4461-b5c2-e015d8682aea │
│ ───────────────────────────────────────────────────────────────────────────────────── │
│                                                                                       │
│ Attention Is All You Need                                                             │
│ 1706.03762 (v7)                                                                       │
│ Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N.     │
│ Gomez, Lukasz Kaiser, Illia Polosukhin                                                │
│                                                                                       │
│ ╭──────────────────────────────────── Abstract ─────────────────────────────────────╮ │
│ │ The dominant sequence transduction models are based on complex recurrent or       │ │
│ │ convolutional neural networks in an encoder-decoder configuration. The best       │ │
│ │ performing models also connect the encoder and decoder through an attention       │ │
│ │ mechanism. We propose a new simple network architecture, the Transformer, based   │ │
│ │ solely on attention mechanisms, dispensing with recurrence and convolutions       │ │
│ │ entirely. Experiments on two machine translation tasks show these models to be    │ │
│ │ superior in quality while being more parallelizable and requiring significantly   │ │
│ │ less time to train. Our model achieves 28.4 BLEU on the WMT 2014                  │ │
│ │ English-to-German translation task, improving over the existing best results,     │ │
│ │ including ensembles by over 2 BLEU. On the WMT 2014 English-to-French             │ │
│ │ translation task, our model establishes a new single-model state-of-the-art       │ │
│ │ BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction    │ │
│ │ of the training costs of the best models from the literature. We show that the    │ │
│ │ Transformer generalizes well to other tasks by applying it successfully to        │ │
│ │ English constituency parsing both with large and limited training data.           │ │
│ ╰───────────────────────────────────────────────────────────────────────────────────╯ │
│                                                                                       │
│ URL         https://arxiv.org/abs/1706.03762                                          │
│ Categories  cs.CL, cs.LG                                                              │
│ License     http://arxiv.org/licenses/nonexclusive-distrib/1.0/                       │
╰───────────────────────────────────────────────────────────────────────────────────────╯

All the plugin had to do was streamline that two-step process into a single step.

Everything was working exactly as it should.

Right up until the moment it wasn't.

You see, people don't typically download these papers from the Google Cloud Storage bucket. Most people (myself included) go to the arxiv.org page, click "View PDF", and either read it in their web browser or save it to their computer to read later.

A screenshot of the arXiv website, showing a single paper, its abstract and some links.
The main page for a paper, including submission history.
A screenshot of an academic paper being viewed in a web browser.
Viewing a paper via arxiv.org.

The first time I tested the workflow using a newly downloaded PDF, I noticed something strange: it wasn't working.

The hash was different.

shasum -a 256 /mnt/b/arxiv_mirror/pdf/1706/1706.03762v7.pdf | awk '{print $1}'
b7d72988fd8107d07f7d278bf0ba6621adb6ed47df74be4014fa4a01f03aff6a
shasum -a 256 /mnt/c/Users/Rio/Downloads/1706.03762v7.pdf | awk '{print $1}'
bdfaa68d8984f0dc02beaca527b76f207d99b666d31d1da728ee0728182df697

Same paper. Same version. Different hash.

So I downloaded it again, and checked again. Maybe it was corrupted, even a single byte of difference is enough for an entirely different SHA-256 hash. But the outcome was identical: same paper, same version, same filesize, but a different hash.

So I selected a different paper. I downloaded it from arxiv.org and checked my dev API for its hash. This time it worked flawlessly. The API returned the annotation because the file hash was the same. The file hash was the same because the file was byte-identical.

I did this a few more times and the results were mixed: Sometimes the file content was identical to the copy on arxiv.org, but more often than not, the SHA-256 hashes did not match.

The Comparison

A Note on File Hashes

A file hash is a long sequence of letters and numbers (usually a hexadecimal representation) which can be used to identify a file.

The most important feature of a cryptographic file hash (like SHA-256) is its uniqueness: if any two files are different by even a single byte, the secure cryptographic file hash of each of those files will be different.

This file was downloaded from the GCS bucket:

dorsal file scan /mnt/b/arxiv_mirror/pdf/1706/1706.03762v7.pdf
📄 Scanning metadata for 1706.03762v7.pdf
╭────────────────────────── File Record: 1706.03762v7.pdf ───────────────────────────╮
│                                                                                    │
│  Hashes                                                                            │
│       SHA-256:  b7d72988fd8107d07f7d278bf0ba6621adb6ed47df74be4014fa4a01f03aff6a   │
│                                                                                    │
│  File Info                                                                         │
│     Full Path:  /mnt/b/arxiv_mirror/pdf/1706/1706.03762v7.pdf                      │
│      Modified:  2023-08-03 01:07:33                                                │
│          Name:  1706.03762v7.pdf                                                   │
│          Size:  2 MiB                                                              │
│    Media Type:  application/pdf                                                    │
│                                                                                    │
│  Tags                                                                              │
│        No tags found.                                                              │
│                                                                                    │
│  Pdf Info → file/pdf                                                               │
│           creator:  LaTeX with hyperref                                            │
│          producer:  pdfTeX-1.40.25                                                 │
│           version:  1.5                                                            │
│        page_count:  15                                                             │
│     creation_date:  2023-08-03T00:07:29Z                                           │
│     modified_date:  2023-08-03T00:07:29Z                                           │
│                                                                                    │
│                                                                                    │
╰────────────────────────────────────────────────────────────────────────────────────╯

This file was downloaded from arxiv.org:

dorsal file scan /mnt/c/Users/Rio/Downloads/1706.03762v7.pdf
📄 Scanning metadata for 1706.03762v7.pdf
╭────────────────────────── File Record: 1706.03762v7.pdf ───────────────────────────╮
│                                                                                    │
│  Hashes                                                                            │
│       SHA-256:  bdfaa68d8984f0dc02beaca527b76f207d99b666d31d1da728ee0728182df697   │
│                                                                                    │
│  File Info                                                                         │
│     Full Path:  /mnt/c/Users/Rio/Downloads/1706.03762v7.pdf                        │
│      Modified:  2026-02-07 13:43:07                                                │
│          Name:  1706.03762v7.pdf                                                   │
│          Size:  2 MiB                                                              │
│    Media Type:  application/pdf                                                    │
│                                                                                    │
│  Tags                                                                              │
│        No tags found.                                                              │
│                                                                                    │
│  Pdf Info → file/pdf                                                               │
│           creator:  LaTeX with hyperref                                            │
│          producer:  pdfTeX-1.40.25                                                 │
│           version:  1.5                                                            │
│        page_count:  15                                                             │
│     creation_date:  2024-04-10T21:11:43Z                                           │
│     modified_date:  2024-04-10T21:11:43Z                                           │
│                                                                                    │
│                                                                                    │
╰────────────────────────────────────────────────────────────────────────────────────╯
  • Both files are v7 of the paper Attention is All you Need (v1 was submitted in June 2017, while v7 was submitted in August 2023).
  • Both files are exactly 2,215,244 bytes (approx 2 MiB)
  • Both files were compiled with version 1.40.25 of pdfTeX

But there is one crucial difference: the one downloaded from GCS was compiled in August 2023; while the one downloaded from the web was compiled more recently in April 2024. We can see this from the creation_date field in the outputs above (note: creation_date is simply how Dorsal labels the PDF core metadata field /CreationDate)

In fact, when we compare the byte-content of the two files side-by-side, they are 99.9935% identical.

Running cmp, we can see just 145 of the file's 2,215,244 bytes are different:

cmp -l \
  /mnt/b/arxiv_mirror/pdf/1706/1706.03762v7.pdf \
  /mnt/c/Users/Rio/Downloads/1706.03762v7.pdf | wc -l
145

Those 145 bytes of difference are more than enough to ensure each file has a unique SHA-256 hash.

Let's diff the text content in each PDF:

diff --color -u \
  <(strings /mnt/b/arxiv_mirror/pdf/1706/1706.03762v7.pdf) \
  <(strings /mnt/c/Users/Rio/Downloads/1706.03762v7.pdf)

Running this confirms that just three values differ between the documents:

Downloaded from the GCS bucket in February 2026:

CreationDate: 2023-08-03T00:07:29Z
ModDate:      2023-08-03T00:07:29Z
ID:           <45ed92c40015149e90332d9c2e25aa60>

Downloaded from arxiv.org in February 2026:

CreationDate: 2024-04-10T21:11:43Z
ModDate:      2024-04-10T21:11:43Z
ID:           <ff3e15dfc6c8c63548b1c64bc2982fdb>

I repeated this comparison multiple times: I downloaded a PDF from arxiv.org and compared it to its GCS mirror doppelganger.

The results were all over the place:

  • Sometimes the hashes matched, sometimes they didn't
  • Sometimes the file size was the same, sometimes it wasn't
  • Occasionally (like in the 1706.03762v7.pdf example) the only difference was the compilation date

A Note on arXiv Submissions

When someone submits a paper to arXiv, they don't typically upload a PDF directly. The vast majority of papers are submitted as a "raw" bundle of LaTeX files, figures and images which, when compiled by a toolkit such as pdfTeX or GhostScript, form a publication-quality PDF.

From the sample I took, in all cases where the hash was not identical, the two PDFs were compiled at different times, often with different versions of Ghostscript or pdfTeX. This was more than enough to ensure those document pairs had different SHA-256 hashes.

This was an interesting, but deeply frustrating finding. It completely broke my mental model of the arXiv frontend as a static cache of PDF papers, as arXiv was seemingly willing to switch to serving a newer variant of a particular version of a paper.

In short: there was no guarantee that the PDF in the GCS bucket was the same PDF that you would download from the website.

I wanted to use the GCS PDFs to backfill data for a hash-powered inference plugin, but if a single version of a PDF paper could have more than one variant, each with a different hash, what hope was there?

At this point, I started to think about walking away. Maybe I could find a nice, stable set of files to work with. But I couldn't bring myself to. I was in too deep. I actually wanted to know more about what was going on with these papers:

  • What proportion of the GCS PDFs have different hashes from the web PDFs?
  • Are there any patterns in when the PDFs were built?
  • How close could I get to a working plugin using the GCS data?

To answer those questions, I needed to complete the backfill.

Processing those documents and running my own analysis of the file hashes was the only way to get a full picture of what's going on. I logged off for the day. I wanted to focus on something else, anything else, while the NAS rumbled along in the background.

The Mirror

The NAS finally went quiet on Wednesday afternoon. Rclone had successfully finished building the arXiv mirror: 4,653,819 PDFs had landed since it started on Friday evening. 8.31 TB in total.

Screenshot of the Synology NAS operating system, with the Control Panel open. It shows the final size: 8.31 TB.
Synology Control Panel confirming the final size
Screenshot of the Synology NAS operating system, with a file explorer open. It shows a directory listing.
The structure is deeply nested.

I now had everything I needed to start the next phase: processing the documents at scale.

The Process

Looking back, I was surprisingly unfazed at the prospect of processing 4.6 million documents. The biggest bottleneck, to be clear, is hashing. To securely hash a file, you have to read every single byte into memory. This makes file processing an I/O bound operation, which means that you can only process files as quickly as you can read them into the computer's memory.

I decided to tackle it one directory at a time. I would batch documents as they appeared in my file tree, starting with categories (acc-phys, adap-org, ...) and ending with the PDFs organized by date (from 0704 to 2602).

In a Jupyter notebook, I loaded the complete set of arXiv annotations into a dictionary in memory, and began processing the files with Dorsal. I serialized Dorsal's Pydantic outputs to disk for inspection later, and pushed the annotation-linked metadata to the API in batches.

As the script ran I observed the processing rate. It was getting through maybe 8 to 10 documents per second on average. This may sound respectable, but for a weekend project now well into its sixth day, this meant I was looking at another week at least of my computer just extracting the PDFs.

I had to speed things along.

The obvious answer was to use multi-threading. Multi-threading is an approach where processing is split across the numerous execution channels which are baked into modern CPUs. I was currently using just one of my CPU's 12 logical cores, so threading would make it possible for the other cores to help out, and increase my extraction rate.

There was just one problem: the PDF extraction logic in Dorsal has a hard dependency on pdfium, and pdfium is not thread safe. Using non-thread safe code in a threaded environment is the ultimate "at your own risk" strategy.

A Note on Thread Safety

When you run a multi-threaded program, each of those separate threads share the same memory space.

Thread safety is a guarantee given to a codebase that, when it is run in a multi-threaded environment, there will be no unintended interactions. This means using things like memory locks (mutexes) to make sure only one thread can modify a data structure at a given time.

If code is not thread safe, it makes no such guarantees. Certain objects or classes in memory may not be safe to access by more than one thread, and may lead to all kinds of problems including memory corruption, crashes or data loss.

Naturally, I decided to try my luck. Dorsal's use of pdfium is limited to reading a few pieces of core metadata from the document. So I crafted my wings and attached them with wax: I imported python's ThreadPoolExecutor and set my max_workers to 6. I submitted extraction tasks to each worker. And it worked flawlessly. At least at first.

With six threads, I was chewing through maybe 40 documents per second. A week of processing had turned into a task that could finish by tomorrow. Then I checked the annotations as they landed on the API. There was a problem.

The metadata record for each file I was processing is composed of three separate "annotations":

  • file/base: a generic file annotation. It has attributes size, name, media_type and so on.
  • file/pdf: which contains core PDF information: title, producer, page_count, etc.
  • dorsal/arxiv: which contains the arXiv metadata: authors, abstract, url, etc.

In most of the records I checked, the file/pdf annotation was entirely missing. In the payload that landed on the server it was null. I checked the debug logs, and they confirmed the problem: an uncaught "File access error" for the majority of documents.

Threading wasn't going to work.

When multi-threading isn't an option, it's often worth looking into multi-processing. Unlike threads, which share memory, Python's ProcessPoolExecutor spins up completely isolated processes. Each process gets its own Python interpreter and separately imports dorsal and pdfium. No shared memory means no thread safety concerns.

But there was a catch. Because the processes are isolated, the huge dictionary of arXiv annotations I had loaded into memory had to be shared somehow. With pooled processes, Python has to constantly "pickle" and "unpickle" (serializing and deserializing the data) back and forth when processes access shared data. Because each process was assigned a task which it completed in a fraction of a second, the CPU was spending more time packaging and unpackaging keys and values, than actually doing the work of processing. That overhead was enough to cancel out all benefits of using multi-processing, and I was barely able to peak at 12 documents per second.

The alternative wasn't much better. I could bypass the executor pool entirely and just run multiple scripts in different terminals or notebooks. This way I overcame the pickling slowdown problem, but there was a hole in that plan too: the dictionary of arXiv annotations was over 4 GB. My desktop machine only has 64 GB of RAM. That might sound like plenty, but once you spin up multiple python processes, each with the weight of a 4GB+ dictionary, you will quickly run out.

graph LR    
    subgraph "Hitting RAM limits"
        NAS1[(NAS Drive)]

        subgraph W1["Process 2"]
            D1[(4 GB arXiv<br>Annotations)]
        end
        subgraph W2["Process 3"]
            D2[(4 GB arXiv<br>Annotations)]
        end
        subgraph W3["Process ..."]
            D3[(4 GB arXiv<br>Annotations)]
        end

        subgraph W4["Process 1"]
            D4[(4 GB arXiv<br>Annotations)]
        end

        NAS1 --PDFs--> W1 & W2 & W3 & W4
    end

I needed something better.

The Hack

I went back to my serial processing script. I modified it to process just one directory of files, and wrote a separate "launcher" which I could run from the command line. The launcher would open a new console window for each directory of files: the console would iterate over each PDF, serialize the result and push to DorsalHub in batches.

I estimated I'd want to process eight to ten directories at once. I didn't have the RAM for that if I had to load a 4GB+ dictionary into each process. But what about if they could share?

Redis is an in-memory key-value store database which I use in production every day. For a python process, accessing a key from a shared Redis instance isn't much harder than grabbing a value from a local dictionary, and is more than fast enough for the task at hand. I loaded the arXiv annotations to Redis, keyed on their arXiv ID, and updated my script to retrieve the arXiv annotation from Redis. That way I could have as many console windows as I wanted, each processing documents, using the same in-memory store of annotations.

graph LR
    subgraph "Shared Redis"
        NAS2[(NAS Drive)] --PDFs--> C1[Console 1] & C2[Console 2] & C3[Console 3] & C4[Console ...]
        C1 & C2 & C3 & C4 <--> |Annotation lookup| R[(Shared Redis)]
    end

It was beautiful. With the script running, my desktop looked like something out of a cheesy 90s hacker movie. The launcher would handle the spawning of console windows, and as soon as one finished and closed, another was opened. I tinkered with the spawn count until I reached a balance which optimized both the global processing rate and the NAS volume utilization.

Screenshot of the 'Redis Insight' GUI tool, showing a single annotation.
Redis Insight showing the in-memory arXiv annotation db.
A windows desktop screenshot, with some old-style windows command windows layered over each other.
My desktop processing PDFs to publish to the API.

Over the next two days, the script worked its way through the 4.6 million PDF mirror. Console windows would pop up, briefly interrupting my work; a reassuring reminder that another batch had finished.

By Friday afternoon the backfill was complete.

The Autopsy

Let's look again at 1706.03762v7.pdf from the GCS arXiv bucket:

dorsal file scan /mnt/b/arxiv_mirror/pdf/1706/1706.03762v7.pdf
📄 Scanning metadata for 1706.03762v7.pdf
╭────────────────────────── File Record: 1706.03762v7.pdf ───────────────────────────╮
│                                                                                    │
│  Hashes                                                                            │
│       SHA-256:  b7d72988fd8107d07f7d278bf0ba6621adb6ed47df74be4014fa4a01f03aff6a   │
│                                                                                    │
│  File Info                                                                         │
│     Full Path:  /mnt/b/arxiv_mirror/pdf/1706/1706.03762v7.pdf                      │
│      Modified:  2023-08-03 01:07:33                                                │
│          Name:  1706.03762v7.pdf                                                   │
│          Size:  2 MiB                                                              │
│    Media Type:  application/pdf                                                    │
│                                                                                    │
│  Tags                                                                              │
│        No tags found.                                                              │
│                                                                                    │
│  Pdf Info → file/pdf                                                               │
│           creator:  LaTeX with hyperref                                            │
│          producer:  pdfTeX-1.40.25                                                 │
│           version:  1.5                                                            │
│        page_count:  15                                                             │
│     creation_date:  2023-08-03T00:07:29Z                                           │
│     modified_date:  2023-08-03T00:07:29Z                                           │
│                                                                                    │
│                                                                                    │
╰────────────────────────────────────────────────────────────────────────────────────╯

The creation_date field under the PDF Info annotation is 2023-08-03T00:07:29Z. If we compare this timestamp to the submission date recorded on arXiv for that version of the paper (2023-08-03T00:41:18Z) we can see that they closely line up. The document was compiled by arXiv's backend when it was submitted.

This is something we might expect, and it aligns with my earlier mental model of arXiv as a static cache for PDFs. Under this mental model:

  1. A paper is submitted to arXiv - typically as a LaTeX bundle.
  2. arXiv compiles it to a PDF (in this example using version 1.40.25 of pdfTex).
  3. Later, when someone visits the PDF download URL, they are served the compiled PDF.

With that in mind, let's look once more at another 1706.03762v7.pdf, this one downloaded from the arXiv website in February 2026:

dorsal file scan /mnt/c/Users/Rio/Downloads/1706.03762v7.pdf
📄 Scanning metadata for 1706.03762v7.pdf
╭────────────────────────── File Record: 1706.03762v7.pdf ───────────────────────────╮
│                                                                                    │
│  Hashes                                                                            │
│       SHA-256:  bdfaa68d8984f0dc02beaca527b76f207d99b666d31d1da728ee0728182df697   │
│                                                                                    │
│  File Info                                                                         │
│     Full Path:  /mnt/c/Users/Rio/Downloads/1706.03762v7.pdf                        │
│      Modified:  2026-02-07 13:43:07                                                │
│          Name:  1706.03762v7.pdf                                                   │
│          Size:  2 MiB                                                              │
│    Media Type:  application/pdf                                                    │
│                                                                                    │
│  Tags                                                                              │
│        No tags found.                                                              │
│                                                                                    │
│  Pdf Info → file/pdf                                                               │
│           creator:  LaTeX with hyperref                                            │
│          producer:  pdfTeX-1.40.25                                                 │
│           version:  1.5                                                            │
│        page_count:  15                                                             │
│     creation_date:  2024-04-10T21:11:43Z                                           │
│     modified_date:  2024-04-10T21:11:43Z                                           │
│                                                                                    │
│                                                                                    │
╰────────────────────────────────────────────────────────────────────────────────────╯

According to the creation_date field, this variant of the PDF was compiled from its LaTeX source a full 8 months after its submission date.

This demonstrates that arXiv does not fit neatly into the static cache model. Under a static cache model, after a document is submitted to arXiv, it is compiled once. Then that compiled document is served forever.

Instead we should probably see arXiv as a PDF generation system, as it is still able and willing to generate PDFs from source much later than submission.

But how much later?

The Bucket

To get a fuller picture, I tracked the creation_date field across the entire set of 4.6 million PDFs in the GCS bucket.1

Click here to view as a table
Year Total Submitted Compiled Near Submission % Compiled Near Submission
1993 5,099 0 0.0%
1994 7,593 0 0.0%
1995 9,929 0 0.0%
1996 15,761 0 0.0%
1997 19,508 0 0.0%
1998 23,858 1 0.0%
1999 27,206 2 0.0%
2000 29,904 10 0.0%
2001 32,161 149 0.5%
2002 35,201 436 1.2%
2003 38,522 855 2.2%
2004 42,805 1,362 3.2%
2005 45,855 1,757 3.8%
2006 49,326 2,216 4.5%
2007 54,114 2,547 4.7%
2008 56,144 11,737 20.9%
2009 60,274 17,699 29.4%
2010 64,768 25,986 40.1%
2011 71,329 31,211 43.8%
2012 79,395 42,823 53.9%
2013 87,943 65,404 74.4%
2014 93,249 90,591 97.1%
2015 101,722 99,647 98.0%
2016 110,207 108,099 98.1%
2017 120,837 118,590 98.1%
2018 138,243 135,150 97.8%
2019 153,544 150,332 97.9%
2020 174,997 171,500 98.0%
2021 179,665 176,045 98.0%
2022 183,957 180,818 98.3%
2023 206,485 203,170 98.4%
2024 241,833 237,576 98.2%
2025 125,701 121,700 96.8%
A chart with the title 'How Many arXiv GCS PDFs Were Compiled Near Submission Date? (+1 Month)'. Its x axis is 'Submission Date' from 1992 to 2025. It has two y axes: the left from 0 to 35,000 documents; the right from 0 to 100%. It has three plots and the general trend for all three is upward.
A chart, showing how many arXiv PDFs from the Google Cloud Storage bucket were compiled near to their submission date.

How to Read this Chart

  • The Gray Area: Shows the total arXiv submissions over time. This closely tracks the trend seen in arXiv.org's official monthly submission data.

  • The Blue Area: Shows the absolute count of how many documents have a creation_date value which is within one month of submission.

  • The Red Line: Shows the percentage difference between the Gray area and the Blue area.

Some observations:

  1. Since 2014, the PDFs added to the bucket were almost all compiled around their submission date.

    • Upwards of 97% of documents have a creation_date field which is within one month of their submission

    • This suggests that the documents are pushed to the GCS bucket soon after submission, and are not modified at a later date.

  2. Prior to 2012, the majority of documents have a future creation_date value

    • The proportion of documents which were compiled close to their submission date drops dramatically prior to 2012.

    • Since the GCS bucket did not exist (or at least was not public) until 2020, this indicates regular or occasional re-compiling of documents was well established prior to the GCS bucket launching.

  3. There is some evidence of systematic recompiling of documents

    • As we go further back in time, the red line does not show a gradual tapering off. Instead we see dramatic drops in the number of documents compiled near submission date.

    • The values change most swiftly around 2008 and 2014. This may be evidence of batch PDF recompilation events

To get a clearer idea of the trends in the creation_date field, I subtracted the total papers submitted each month from the number compiled:

Click here to view as a table
Year Papers Submitted PDFs Created Net PDFs Created Cumulative Difference
1991 259 0 -259 -259
1992 2540 0 -2540 -2799
1993 5099 0 -5099 -7898
1994 7593 0 -7593 -15491
1995 9929 0 -9929 -25420
1996 15761 6 -15755 -41175
1997 19508 10 -19498 -60673
1998 23858 14 -23844 -84517
1999 27206 23 -27183 -111700
2000 29904 54 -29850 -141550
2001 32161 225 -31936 -173486
2002 35201 631 -34570 -208056
2003 38522 1165 -37357 -245413
2004 42805 1988 -40817 -286230
2005 45855 2591 -43264 -329494
2006 49326 3041 -46285 -375779
2007 54114 3419 -50695 -426474
2008 56144 136911 80767 -345707
2009 60274 19493 -40781 -386488
2010 64768 27747 -37021 -423509
2011 71329 41806 -29523 -453032
2012 79395 44800 -34595 -487627
2013 87943 241387 153444 -334183
2014 93249 418734 325485 -8698
2015 101722 110201 8479 -219
2016 110207 111369 1162 943
2017 120837 121572 735 1678
2018 138243 137874 -369 1309
2019 153544 153642 98 1407
2020 174997 174600 -397 1010
2021 179665 179933 268 1278
2022 183957 183791 -166 1112
2023 206485 206692 207 1319
2024 241833 241843 10 1329
2025 125701 124458 -1243 86
A residual plot chart with the title 'Net PDF Compilation for GCS Bucket'. Its x axis is 'Date' from 2000 to 2025. The y axis shows Net PDFs compiled and is a scale from 0 to 120,000. The chart is a bar chart, with negligible activity aside from a huge spike in 2008 up to 120,000 labeled '2008 Recompile' and a flurry of activity throughout 2014 spiking at around 117,000 in a single month
A chart, showing the net PDFs created, by month, in the GCS bucket.

How to Read this Chart

  • Each bar represents the net PDFs created for one month (e.g. February 2014).

  • Blue bars are negative values, meaning the bucket contains fewer documents whose creation_date matches that month than the month's total submissions.

  • Red bars are positive values, meaning the bucket contains more documents whose creation_date matches that month than the month's total submissions.

Observations:

  1. There is strong evidence of batch PDF compilation "events"

    • Massive positive (red) spikes in 2008 and 2014 show hundreds of thousands of documents being compiled in a relatively short space of time.

    • While we can't be sure why these documents were rebuilt, it is evidence of systematic, bulk recompilation of PDFs.

  2. The original PDFs for older papers no longer exist in the bucket.

    • The vast majority of older documents in the bucket were created long after their submission date.
  3. For papers submitted after 2014, the GCS bucket retains the 'first pressing'.

    • From 2015, the net activity flatlines to near zero. The compilation dates reflect the submission dates.

    • This tells us that once a PDF lands in the bucket, it is not updated. Even if the arXiv frontend regenerates documents from its source, the GCS bucket acts as a time capsule, preserving the "first pressing" of each document.

I could have spent longer profiling the documents in the GCS bucket, but fundamentally, the big question I wanted to answer was this: could I build an annotation retrieval model using the data from the GCS bucket?

To answer this, I needed to actually compare the GCS bucket data with papers downloaded fresh from arXiv.

The Sample

I took a stratified sample of PDFs from the arXiv backfill - 50 per year - and I re-downloaded all of them from the web. I was careful to preserve the full ID for each, and I then used a script to compare the hashes of each to the hashes of the equivalent (same paper, same version) document from the GCS bucket. 1991 to 2026. 36 years. 1800 documents in total.2

Click here to view as a table
Submission Year Total Sampled Exact Matches Match Rate
1991 50 0 0.0%
1992 50 0 0.0%
1993 50 0 0.0%
1994 50 0 0.0%
1995 50 0 0.0%
1996 50 0 0.0%
1997 50 0 0.0%
1998 50 1 2.0%
1999 50 0 0.0%
2000 50 1 2.0%
2001 50 0 0.0%
2002 49 2 4.1%
2003 50 3 6.0%
2004 50 1 2.0%
2005 50 5 10.0%
2006 48 2 4.2%
2007 50 3 6.0%
2008 50 3 6.0%
2009 50 5 10.0%
2010 50 5 10.0%
2011 49 3 6.1%
2012 50 9 18.0%
2013 50 9 18.0%
2014 50 9 18.0%
2015 50 5 10.0%
2016 50 9 18.0%
2017 50 15 30.0%
2018 50 21 42.0%
2019 50 29 58.0%
2020 50 30 60.0%
2021 50 30 60.0%
2022 50 49 98.0%
2023 50 48 96.0%
2024 50 50 100.0%
2025 50 44 88.0%
2026 50 49 98.0%
A bar chart, colored from red (lower) through orange and yellow up to green (taller). The X axis (Submission Year) shows 1991 to 2026. The Y axis (SHA-256 Hash Match Rate %) shows 0 to 100%. There are no bars from 1991 to 1997. 1998 to 2016 are red. 2017 to 2021 are orange through to pale green. 2022 to 2026 are all solid green.
A chart showing exact match counts from arXiv web PDFs vs GCS pdfs

How to Read this Chart

  • Each bar represents one year and shows the percentage of PDFs that were exact byte-for-byte matches with their GCS bucket counterparts (i.e. same paper, same version).

All of my doubts were confirmed. The distribution in this bar chart closely mirrors what we saw earlier, only in much more stark terms. This was undeniable evidence of documents being systematically re-built over time. The further back in time the less likely the live PDF was to be identical to that in the GCS bucket.

It seemed clear to me that the GCS papers were not the ideal fit for my defined task.

Armed with this knowledge, there was only one thing left to do: build the plugin.

The Plugin

Plugins (AKA Dorsal Annotation Models) extend Dorsal's ability to do metadata extraction tasks. Think of them as functions whose input is a file path, and whose output is a validation JSON.

The goal of this plugin: output an arXiv annotation which summarizes a PDF.

Despite the bumps along the way, I figured it was worth seeing this through to the end.

So I continued. Maybe it would make a good write-up. A cautionary tale.

The first version of the model was pretty straightforward:

  1. Hash the file bytes to produce a SHA-256 hash, e.g. e481db0333b3e7011406ecd6932d54bcc2829f0da4ffbc87e5552bf07d812985
  2. Query DorsalHub API for that exact hash
  3. If a record is found, parse out the dorsal/arxiv annotation and return it

I sampled a fresh batch of documents from arXiv to verify the model.

Its performance was pretty much what I expected:

Click here to view as a table
Submission Year Sample Size Success Fail Recall (%)
1991 50 0 50 0%
1992 50 0 50 0%
1993 50 0 50 0%
1994 50 0 50 0%
1995 50 0 50 0%
1996 50 0 50 0%
1997 50 0 50 0%
1998 50 0 50 0%
1999 50 0 50 0%
2000 50 1 49 2%
2001 50 0 50 0%
2002 50 1 49 2%
2003 50 2 48 4%
2004 50 5 45 10%
2005 50 2 48 4%
2006 50 3 47 6%
2007 50 2 48 4%
2008 50 2 48 4%
2009 50 2 48 4%
2010 50 5 45 10%
2011 50 4 46 8%
2012 50 8 42 16%
2013 50 9 41 18%
2014 50 8 42 16%
2015 50 8 42 16%
2016 50 8 42 16%
2017 50 13 37 26%
2018 50 26 24 52%
2019 50 27 23 54%
2020 50 34 16 68%
2021 50 38 12 76%
2022 50 46 4 92%
2023 50 50 0 100%
2024 50 49 1 98%
2025 50 46 4 92%
2026 50 50 0 100%
A stacked bar chart showing mostly fully red bars. The X axis (Submission Year) shows 1991 to 2026. The Y axis (Number of PDFs) is 0 to 50. There is a exponential distribution of green for the bars, starting from 2000 and peaking in 2026. The chart is 25% green and 75% red.
A chart showing the document-level recall of version 1 of the arXiv PDF model.

How to Read this Chart

  • Each stacked bar represents a sample of 50 PDFs submitted within a single year.

  • The green portion represents files where the model could fetch the metadata

  • The red portion represents files where the model failed to fetch the metadata

Some findings:

  • The model's average Recall was 24.9%, meaning it could return the annotation for just 1 in 4 documents sampled.
  • The recall tapers off to 0% as we go back in time, mirroring earlier findings
  • Conversely, recent documents have the highest recall, suggesting arXiv has not recompiled most of those PDFs from their "first pressing" yet.

While analysing the failures, I spotted a pattern in many of the documents. Here's an example:

dorsal file scan /mnt/c/arxiv_batch/9206023v1.pdf
📄 Scanning metadata for 9206023v1.pdf
╭───────────────────── File Record: 9206023v1.pdf (from cache) ──────────────────────╮
│                                                                                    │
│  Hashes                                                                            │
│       SHA-256:  e481db0333b3e7011406ecd6932d54bcc2829f0da4ffbc87e5552bf07d812985   │
│        BLAKE3:  efe99342b87a09d58e52ad97b99bf25ced1bb94256d3b4d99fb700f424bca20e   │
│                                                                                    │
│  File Info                                                                         │
│     Full Path:  /mnt/c/arxiv_batch/9206023v1.pdf                                   │
│      Modified:  2026-03-01 09:19:24                                                │
│          Name:  9206023v1.pdf                                                      │
│          Size:  271 KiB                                                            │
│    Media Type:  application/pdf                                                    │
│                                                                                    │
│  Tags                                                                              │
│        No tags found.                                                              │
│                                                                                    │
│  Pdf Info → file/pdf                                                               │
│             title:  arXiv:hep-th/9206023v1  4 Jun 1992                             │
│           creator:  dvips(k) 5.86 Copyright 1999 Radical Eye Software              │
│          producer:  GPL Ghostscript GIT PRERELEASE 9.22                            │
│           version:  1.4                                                            │
│        page_count:  33                                                             │
│     creation_date:  2018-10-25T21:30:06-04:00                                      │
│     modified_date:  2018-10-25T21:30:06-04:00                                      │
│                                                                                    │
│                                                                                    │
╰────────────────────────────────────────────────────────────────────────────────────╯

Under the Pdf Info annotation, we can clearly see the title field is already populated with the arXiv ID hep-th/9206023v1.

This document is telling us what it is. Right there, in a standard core PDF metadata field:

strings /mnt/c/arxiv_batch/9206023v1.pdf | grep /Title
/Title(arXiv:hep-th/9206023v1  4 Jun 1992)>>endobj

This provided an easy win for boosting the recall. For each document I could look it up in two ways: first by the content hash and then, if that failed, by its arXiv ID.

The embedded ID wasn't available in all of the failures I inspected, but it was in enough of them to make it worth pursuing.

So I decided to re-index the annotations to DorsalHub. This time, each annotation would be linked to a different hash: a hash which represents the arXiv ID. All I had to do was create a tiny text file for each arXiv ID, containing that ID as a string.

Example: hep-th_9206023.txt whose content is the string hep-th/9206023.

graph LR
    B(File Record)
    C[hep-th_9206023.txt] --> B
    A[dorsal/arxiv Annotation] ---> B
    B -- "Publish" --> D[(DorsalHub API)]

Generating hashes for millions of short strings is a blissfully fast task. I linked each arXiv annotation and published to the DorsalHub API. The entire backfill was complete within a couple of hours.

I now had a secondary queryable dataset, to power a lookup process based entirely on the arXiv ID:

flowchart LR
    A@{ shape: doc, label: "9206023.pdf" }

    subgraph PluginBox[Plugin: ArXivPDF]
        B(Hash and Lookup)
    end

    A -- "arXiv ID: hep-th/9206023" --> B
    B -- "076c051fae7b61c757a..." --> C[(DorsalHub API)]
    C -- "Annotation: dorsal/arxiv" --> PluginBox

I then built a Version 2 of the ArXivPDF model, which made use of this new data.

Version 2 of the model adds 3 extra steps over Version 1:

  1. Hash the file bytes to produce a SHA-256 hash,
  2. Query DorsalHub API for that exact hash
  3. If a record is found, parse out the dorsal/arxiv annotation, return it and exit
  4. Attempt to retrieve the arXiv ID from the PDF metadata title field
  5. If the arXiv ID is found, convert it to a SHA-256 hash, and query the DorsalHub API
  6. If a record is found, parse out the dorsal/arxiv annotation, return it and exit.

This turned out to be worth the extra effort. It boosted the recall of the model significantly:

Click here to view as a table
Submission Year Sample Size Success Fail Recall (%)
1991 50 49 1 98%
1992 50 49 1 98%
1993 50 39 11 78%
1994 50 26 24 52%
1995 50 30 20 60%
1996 50 35 15 70%
1997 50 44 6 88%
1998 50 46 4 92%
1999 50 47 3 94%
2000 50 45 5 90%
2001 50 43 7 86%
2002 50 44 6 88%
2003 50 21 29 42%
2004 50 19 31 38%
2005 50 12 38 24%
2006 50 16 34 32%
2007 50 7 43 14%
2008 50 7 43 14%
2009 50 4 46 8%
2010 50 8 42 16%
2011 50 10 40 20%
2012 50 8 42 16%
2013 50 11 39 22%
2014 50 12 38 24%
2015 50 10 40 20%
2016 50 10 40 20%
2017 50 13 37 26%
2018 50 26 24 52%
2019 50 27 23 54%
2020 50 34 16 68%
2021 50 38 12 76%
2022 50 46 4 92%
2023 50 50 0 100%
2024 50 49 1 98%
2025 50 46 4 92%
2026 50 50 0 100%
A stacked bar chart of green and red bars. All bars have some green on the bottom and almost all are capped with red. The X axis (Submission Year) shows 1991 to 2026. The Y axis (Number of PDFs) is 0 to 50. The bars are mostly green from 1991 to 2000 and 2018 to 2026, and are mostly red for all remaining years. The chart is 57% green and 43% red.
A chart showing the document-level recall of version 2 of the arXiv PDF model.

Some findings:

  • The model's average Recall was now 57.3%, meaning it could return the annotation for the majority of documents sampled.
  • The recall tapers off to 0% as we go back in time. It seems likely that a feature of the batch recompiling of those 90s documents with Ghostscript included manually embedding a title in a standard format e.g. arXiv:hep-th/9206023v1 4 Jun 1992.
  • The recall was lowest among documents published in the 2000s and early 2010s, most of which had not been stamped with a standard title containing a crisp arXiv ID

Now that I had queryable data where I could provide an arXiv ID and get back an arXiv annotation, one further improvement for the model was staring me in the face: the file name.

When you download a paper from arXiv, most of the time its ID is included in the file name. This is always the case for papers submitted after 2007.

For pre-2007 papers, only a partial identifier is included in the filename. For example, the papers gr-qc/9407013 and chao-dyn/9407013 both download with the exact same default filename: 9407013v1.pdf. This means there's no clean way to map back from the file name to a pre-2007 arXiv ID. For that reason, I had to exclude pre-2007 papers from my filename check logic when building Version 3.

Here's how Version 3 tackles the problem (Steps 7 to 9 are new):

  1. Hash the file bytes to produce a SHA-256 hash,
  2. Query DorsalHub API for that exact hash
  3. If a record is found, parse out the dorsal/arxiv annotation, return it and exit
  4. Attempt to retrieve the arXiv ID from the PDF metadata title field
  5. If the arXiv ID is found, convert it to a SHA-256 hash, and query the DorsalHub API
  6. If a record is found, parse out the dorsal/arxiv annotation, return it and exit.
  7. If the arXiv ID was not found, parse it from the filename (post-2007 format only)
  8. Convert the arXiv ID to a SHA-256 hash, and query the DorsalHub API
  9. If a record is found, parse out the dorsal/arxiv annotation, return it and exit.
Click here to view as a table
Submission Year Sample Size Success Fail Recall (%)
1991 50 49 1 98%
1992 50 49 1 98%
1993 50 39 11 78%
1994 50 26 24 52%
1995 50 30 20 60%
1996 50 35 15 70%
1997 50 44 6 88%
1998 50 46 4 92%
1999 50 47 3 94%
2000 50 45 5 90%
2001 50 43 7 86%
2002 50 44 6 88%
2003 50 21 29 42%
2004 50 19 31 38%
2005 50 12 38 24%
2006 50 16 34 32%
2007 50 37 13 74%
2008 50 50 0 100%
2009 50 50 0 100%
2010 50 50 0 100%
2011 50 50 0 100%
2012 50 50 0 100%
2013 50 50 0 100%
2014 50 50 0 100%
2015 50 50 0 100%
2016 50 50 0 100%
2017 50 50 0 100%
2018 50 50 0 100%
2019 50 50 0 100%
2020 50 50 0 100%
2021 50 50 0 100%
2022 50 50 0 100%
2023 50 50 0 100%
2024 50 50 0 100%
2025 50 50 0 100%
2026 50 50 0 100%
A stacked bar chart showing mostly fully green bars. The X axis (Submission Year) shows 1991 to 2026. The Y axis (Number of PDFs) is 0 to 50. There is a W-shaped distribution of red caps for the bars from 1991 to 2007. The chart is 86% green and 14% red.
A chart showing the document-level recall of version 3 of the arXiv PDF model.

Some findings:

  • This model's average Recall is 86.2%. We have perfect recall after 2007, meaning all files sampled which could leverage the new filename-based approach were successful.

  • This is a testament to how complete the original arXiv Dataset is.

By this point, my weekend project had gone on for close to three weeks. While I could probably tinker around the edges and improve it here or there, the model does what I set out to do: it demonstrates annotation retrieval.

You can find the final model here: https://dorsalhub.com/models/dorsalhub/arxiv-pdf

Feel free to try it out! To date I've only backfilled arXiv annotations up to the end of January (hashes) and February (IDs), but you should expect reasonable performance for records before that.

And if you work (or play) with file metadata, in any capacity, please try out Dorsal.


  1. For these visuals, the data was filtered to only include v1 of each document, though there's no indication that other versions exhibit different behavior. The submission figures also omit documents which where the Creation Date of the PDF is not reported by the PDF compilation tool, including a significant number of documents compiled with GenPDF in 2025. 

  2. The working sample was 1796. From the sample I generated, four of the 1800 PDFs failed to download. One had been withdrawn; for the remaining three arXiv reported Our automated source to PDF conversion system has failed to produce PDF for the paper (example). This was interesting in-and-of itself so I chose not to resample.