About this project

About & Method

A navigable edition of the Encyclopædia Britannica as printed in Edinburgh, 1771–1860: ten edition-units, 163 volumes, roughly 127,000 pages — 211,481 article pages in 41,842 topic lineages.

The corpus

Page images come from the National Library of Scotland’s Data Foundry, which digitised the early Britannica and released it openly. This project uses one copy of each edition: the 1st (1771), 2nd (1778–83), 3rd (1797), Supplement to the 3rd (1801), 4th (1810), 5th (1815), 6th (1823), Supplement to the 4th–6th (1824), 7th (1842) and 8th (1853–60).

Text extraction

The NLS released OCR text with its scans, made with the best technology of the day; eighteenth-century print — the long s, dense double columns, worn stereotype plates — held it back. The corpus was re-read for this project with Chandra 2, a vision-language OCR model, on H100 GPUs (Digital Research Alliance of Canada). The page stream was then segmented into individual articles and cross-references by a rule-based parser, with repair passes for entries the compositors or the OCR welded together and for headwords swallowed by column banners. Every article page links back to its page scan in the NLS viewer, so any reading here can be checked against the print.

Lineages

Articles are grouped across editions by normalised headword into topic lineages. Consecutive versions are compared with bottom-k sketches of 3-gram word shingles (Jaccard similarity): J ≥ 0.9 reads as a reprint, ≥ 0.6 revised, ≥ 0.2 substantially revised, below that rewritten. The corpus is strongly bimodal — articles were mostly reprinted or torn up, rarely half-revised — which is itself a finding about how the Britannica was made. Verdicts compare consecutive main editions only; the supplements are separate works. Where distinct topics share one headword (ABERNETHY the minister, the surgeon, and the Perthshire town), members are threaded by text overlap and marked ① ② on the lineage page.

The map

Gazetteer entries close with printed coordinates (“W. Long. 69. 48. N. Lat. 46. 55.”). These were extracted for 12,328 places and plotted as printed — the world as the Britannica measured it, not as we do. Suspect coordinates (editions disagreeing with each other, or with the place itself) were adjudicated against Wikidata: verified misprints are plotted at their true position with the printed figure preserved in the popup, while period-typical offsets and colonial homonyms stay where the compositor put them.

Tools

The site is static HTML built with Python; search is Pagefind, the map Leaflet. Large-scale passes — OCR repair, coordinate adjudication, entity grounding — were run with Anthropic’s Claude models, as fleets of agents checking suspect data against Wikidata. Errors that remain are this project’s own: OCR noise survives in the text and segmentation is imperfect. Corrections are welcome — file an issue.

Credits

Page images digitised by the National Library of Scotland. Text extraction, article segmentation, lineage analysis, and the map are this project’s own. Jim Clifford, 2026.