Category: Digging into Data

London’s Text-Mined Hinterlands for the Social Science History Association

London’s Text-Mined Hinterlands for the Social Science History Association

The map below visualizes the text-mined data produced by the Trading Consequences project. We queried the database to identify all the commodities with a strong relationship to London and then found every other location where the text mining pipeline identified a relationship those commodities at least 10 times in a given year. This results in 111,977 rows of data, each representing between 2841 and 10 commodity-place relationships. I will present this data visualization to the Social Science History Association meeting in Toronto this November.

The map above uses CartoDB’s Torque Cat animation to visualize the data as it changes over time. It only distinguishes 10 different commodities, which is already too many to really follow, and displays the remaining commodities in the Other category. The word cloud below shows all of the commodities and ranks them by the number of places and number of years they met the 10 relationships threshold (i.e. the words are bigger if a commodity had a lot of mined relationships with different places and these relationships remained consistent across the whole century).

It is also possible to look at all of the data from the whole of the nineteenth century to see the the locations with a high intensity of relationships with numerous commodities that also have a strong relationship with London.


[This map looks better when you zoom in.]
I should note that this data does not confirm a direct relationship with London and not all of these locations are a part of the city’s increasingly global hinterlands. Some locations would be competing markets sourcing the same materials or producing the same goods as London. British ports were also waystations where goods from the world were transhipped and sent on to other European centres. The text mining identified when a commodity term, like sugar, was in the same sentence as a place name. The text mining shows a strong correlation between London and sugar and a strong correlation between Cuba and sugar. In this case Cuba, I know from other sources, it was among the numerous suppliers of sugar to London. We cannot simply assume, however, that the strong correlation between Leather and Calais in 1822 meant the French port supplied London with Leather in that year. They could be a market for London’s leather or a competitor. To focus the map on London’s hinterlands exclusively, I would need to filter out results based on additional research and an extensive ground-truthing exercise. It would probably be more accurate to say these maps helps illuminate the geography of commodities related to London in the nineteenth century, but this data and the visualizations remain a starting point for further research (like the research I’m doing with Andrew Watson on leather).

You can download the data as a CSV file with this link.

Here is the abstract for the SSHA paper I’m co-authoring with Bea Alex and Uta Hinrichs:
Visualizing Text Mined Geospatial Results: Exploring the Trading Consequences Database.

Trading Consequences Cinchona Data in Voyant Tools

I am working on an abstract for the ESEH in France next summer. I plan to focus on the role of an industrialist, J.E. Howard, in supporting the efforts of British government officials and economic botanists to establish cinchona plantations in Asia. I’ve done a lot of archival research on this topic, but I thought it would be interesting to see what I could find in the Trading Consequences database. The Location Cloud Visualization clearly shows the geographic transfer of cinchona to India and Ceylon, but I needed to dig down past our web visualizations to see what the database has to say about a particular person. To do this, I extracted every sentence that mentions the commodity cinchona in the Trading Consequences corpus, ordered them by their year and exported a text file from the database. This yields a file with 3762 sentences that mention cinchona.

Uploading this data into Voyant Tools makes it easy to explore some of the patterns in the text as it changes over the course of the nineteenth century. For example, we can see the initial importance of India (which would include mentions of the East India Company) and the growing significance of Ceylon and Java as the century went on. It is also notable that Peru and Peruvian were relatively less significant locations in these British government documents.

Using the same tool, we can see the rise and decline in popularity of an alternative spelling of cinchona, “chinchona”, during the middle of the 19th century.

Howard & Son’s factory at City Mills in West Ham

More to the point, we can search for the last names of five of the key individuals involved in the transfer of cinchona: Clement Markham, Richard Spruce, the father and son, William and Joseph Hooker, and John Eliot Howard. Markham was a Indian Office geographer who led an exhibition to Peru to steal cinchona seeds. Spruce, a botanist, collected further seeds from New Granada. The Hookers were both directors of Kew Gardens, with Joseph taking over from his father in 1865. Howard was one of the sons in the Howard & Sons company, which produced much of the quinine manufactured in Britain. In addition to his expertise as a manufacturer, Howard was a leading expert on the botany of cinchona.  The visualization below shows that while Markham, Spruce and William Hooker were key figures in the initial planning and transfers of the early 1860s, Howard gains significance in the corpus in the decades that follow.

The real power of Voyant is that once you identify an interesting trend in the data, it is possible to click on the spike for Howard in the chart above and update some of the other visualizations. Below you can see “Howard” as a key work in context during the spike and further down you can see the actual sentences where Howard is mentioned. With a little more work I could have included the URL for the original document page.

Database and Visualisations Launched

From the Trading Consequences Blog: Today we are delighted to officially announce the launch of Trading Consequences! Over the course of the last two years the project team have been hard at work to use text mining, traditional and innovative historical research methods, and visualization techniques, to turn digitized nineteenth century papers and trading records (and their OCR’d text) into a unique database of commodities and engaging visualization and search interfaces to explore that data. Today we launch the database, searches and visualization tools alongside the Trading Consequences White Paper, which charts our work on the project including technical approaches, some of the challenges we faced, and what and how we have achieved during the project. The White Paper also discusses, in detail, how we built the tools we are launching today and is therefore an essential point of reference for those wanting to better understand how data is presented in our interfaces, how these interfaces came to be, and how you might best use and interpret the data shared in these resources in your own historical research. READ MORE

 

tallowimage

Text Mining 19th Century Place Names

By Jim Clifford

Nineteenth century place names are a major challenge for the Trading Consequences project. The Edinburgh Geoparser uses the Geonames Gazetteer to supply crucial geographic information, including the place names themselves, their longitudes and latitudes, and population data that helps the algorithms determine which “Toronto” is most likely mentioned in the text (there are a lot of Torontos). Based on the first results from our tests, the Geoparser using Geonames works remarkably well. However, it often fails for historic place names that are not in the Geonames Gazetteer. Where is “Lower Canada” or the “Republic of New Granada“? What about all of the colonies created during the Scramble for Africa, but renamed after decolonization? Some of these terms are in Geonames, while others are not: Ceylon and Oil Rivers Protectorate. Geonames also lacks many of the regional terms often used in historical documents, such as “West Africa” or “Western Canada”.

To help reduce the number of missed place names or errors in our text mined results, we asked David Zylberberg, who did great work annotating our test samples, to help us solve many of the problems he identified. A draft of his new Gazetteer of missing 19th century place names is displayed above. Some of these are place names David found in the 150 page test sample that the prototype system missed. This includes some common OCR errors and a few longer forms of place names that are found in Geonames, which don’t totally fit within the 19th century place name gazetteer, but will still be helpful for our project. He also expanded beyond the place names he found in the annotation by identifying trends. Because our project focuses on commodities in the 19th century British world, he worked to identify abandoned mining towns in Canada and Australia. He also did a lot of work in identifying key place names in Africa, as he noticed that the system seemed to work in South Asia a lot better than it did in Africa. Finally, he worked on Eastern Europe, where many German place names changed in the aftermath of the Second World War. Unfortunately, some of these location were alternate names in Geonames and by changing the geoparser settings, we solved this problem, making David’s work on Eastern Europe and a few other locations redundant.  Nonetheless, we now have the beginnings of a database of  place names and region names missing from the standard gazetteers and we plan to publish this database in the near future and invite others to use and add to it. This work is at an early stage, so we’d be very interested to hear from others about how they’ve dealt with similar issues related to text-mining historical documents.

Plant Diseases in the 19th Century

tropical_disease_word_cloud
A word cloud of diseases found in The Diseases of Tropical Plants by Melville Thurston Cook

During the 19th century British industrialists and botanists searched the world for economically useful plants. They moved seeds and plants between continents and developed networks of  trade and plantations to supply British industries and consumers. This global network also spread diseases. Stuart McCook is working on the history of Coffee Rust (Hemileia Vastatrix) and there are a few books that examine the diseases that prevented Brazil from developing rubber plantations. Building on this work, we’re using the Trading Consequences text mining pipeline to try explore the wider trends of plant diseases as they spread through the trade and plantation network.

We need a list of diseases with both the scientific and common names from the time period. The Internet Archive provides a number of text books from the end of the 19th and start of the 20th century. They were written by American botanists, but one book in particular attempts a global survey of tropical plant diseases (The Diseases of Tropical Plants). Because these books are organized in an encyclopedic fashion, it is relatively easy to have a student go through and create a list of plant disease. We’re  working on expanding our list from other sources of the next few weeks. Once the list is complete we’ll add them to our pipeline and extract relationships between mentions of these diseases, locations, dates and commodities in our corpus of 19th century documents. This should allow us to track Sooty Mould, Black Rot, Fleshy Fungi, Coffee Leaf Rust and hundreds of other diseases at points in time when they became enough of a problem to appear in our document collection.

 

How to Build a Macroscope

Timothy Bristow, a digital humanities librarian and Trading Consequences team member, and I are hosting a one day workshop on text mining in the humanities in the library at York University:

A macroscope is designed to capture the bigger picture, to render visible vastly complex systems. Large-scale text mining offers researchers the promise of such perspective, while posing distinct challenges around data access, licensing, dissemination, and preservation, digital infrastructure, project management, and project costs. Join our panel of researchers, librarians, and technologists as they discuss not only the operational demands of text mining the humanities, but also how Ontario institutions can better support this work. Read More