20. Accessing data from digitised resources#

Trove’s digitised resources are delivered in a number of different ways depending on their format and arrangement. See Digitised content viewers and Downloading data from the Trove web interface for hints on using digitised resources through Trove’s web interface.

Access to machine-readable data is even more complicated. The Trove API provides limited information about digitised resources, necessitating a variety of hacks and workarounds. Nonetheless, there are some methods of accessing data from digitised resources that work reliably across multiple formats. These are described below.

For more specific information relating to particular formats see:

20.1. Metadata#

There are three main sources of machine-readable metadata describing digitised resources:

  • search results delivered by the Trove API

  • individual work/version records delivered by the Trove API

  • JSON embedded in the digitised resource viewer

20.1.1. Search results delivered by the Trove API#

Finding digitised resources is not straightforward, so it might take some experimentation to build a query that meets your needs. Once you’ve constructed your search you can harvest the complete set of results using the Trove API. However, because of the way digitised resources are arranged and described, a simple harvest of work records is likely to miss some digitised resources and include duplicate records for others. To construct a dataset of digitised resources that is as complete as possible and yet contains no duplicates, you need to join a number of different processing steps together. This strategy is described in detail in HOW TO: Harvest data relating to digitised resources.

20.1.2. Work/version records delivered by the Trove API#

You can request information about an individual work using the Trove API’s /work endpoint. For example, the pamphlet The gold-finder of Australia : how he went, how he fared, how he made his fortune has the work identifier 9453675. You can request metadata about it using the url:

https://api.trove.nla.gov.au/v3/work/9453675?encoding=json&reclevel=full

Try it!

The link to view a digitised item in one of Trove’s digitised resource viewers is contained in the identifier field. You need to loop through the values in identifier looking for one that has linktype set to fulltext and a url that contains "nla.obj". For example:

import requests

headers = {"X-API-KEY": YOUR_API_KEY}

response = requests.get("https://api.trove.nla.gov.au/v3/work/9453675?encoding=json&reclevel=full", headers=headers)
data = response.json()
for url in data["identifier"]:
    if url["linktype"] == "fulltext" and "nla.obj" in url["value"]:
        break
print(url["value"])
http://nla.gov.au/nla.obj-248742150

Work records can combine metadata from digitised and non-digitised versions, so the information in the top-level record might not accurately represent what’s been digitised. For example, the API response for The gold-finder of Australia gives the date of the publication as 1853-1973, munging together the original publication date and the date of a later reproduction. For this reason, you will probably want to access the individual version records for any work that includes digitised resources. You do this by setting the include parameter to workVersions:

https://api.trove.nla.gov.au/v3/work/9453675?encoding=json&reclevel=full&include=workVersions

Try it!

What if you only have the nla.obj identifier rather than the work identifier? There’s no direct way to look up additional metadata describing a digitised resource from the API endpoint using its nla.obj identifier. To find a corresponding work record, you have to search for the digital object identifier using the /result endpoint. This is not an exact search, and will match the identifier wherever it appears in a record. As a result, it’s possible there might be multiple results requiring some manual checking. Setting l-availability to y should help narrow things down. Here’s an API search for The gold-finder of Australia using its identifier "nla.obj-248742150".

Try it!

20.1.3. JSON embedded in the digitised resource viewer#

Trove’s digitised resource viewers display limited metadata about each item. But there’s more useful metadata embedded as a JSON string in the HTML code of the page. Methods for accessing and using this metadata are fully documented in HOW TO: Extract additional metadata from the digitised resource viewer, but here’s a quick summary.

To access the embedded metadata you need to load the digitised viewer and then scrape the JSON string from the HTML code. The actual metadata available depends on the format of the resource, but can include:

  • lists of pages in digitised books and periodicals, including individual page identifiers

  • lists of articles in a periodical issue

  • details of digitised images, including pixel dimensions

  • complete MARC records from the NLA catalogue

This metadata can be used to enrich and expand records provided by the Trove API, but it also opens up a number of new possibilities. For example, by accessing information about pages in a book or periodical you can automate the download of OCRd text or images.

20.2. Collections#

The NLA’s digitised resources are often presented as ‘collections’. A collection could be the volumes in a multi-volume work, the issues of a periodical, a map series, an album of photographs, or a manuscript collection. While you can use the magazine/title API endpoint to get a list of issues from a periodical, there’s no way to get the contents of other types of collections from the Trove API.

To get machine-readable information about the members of a digitised collection you need to extract information from the browse window of Trove’s digitised collection viewer. This method is fully documented in HOW TO: Get a list of items from a digitised collection.

20.3. Text#

Digitised publications like books, pamphlets, and periodicals usually make their contents available as plain text, extracted from the digitised pages using Optical Character Recognition (OCR). There are two main ways of accessing OCRd text computationally:

  • construct download links for a complete publication or range of pages

  • download OCR data for a single page

20.3.2. Download OCR data for a single page#

This method is fully documented in HOW TO: Get and use OCR data from a book or periodical page, but here’s a quick summary.

If you know the nla.obj identifier of a specific page in a digitised publication, you can access machine-readable information about the OCR process by simply adding /ocr to the identifier url. For example, this page in Pacific Islands Monthly has the identifier nla.obj-326405522. To retrieve the OCR data you just add /ocr to the identifier:

http://nla.gov.au/nla.obj-326405522/ocr

To find the nla.obj identifiers for all the pages in a publication, you can access the metadata embedded in the digitised book and journal viewer and then extract the page identifiers from the page list.

The OCR data is quite complex. It contains information about the position of every word on the page. To extract just the text you have to find all the text blocks, then loop through each line and word, stitching them back together as a plain text document. If all you want is the text, the method described above is probably more efficient, but if you’re interested in the layout as well as the content of a page, this methods opens up some new possibilities.

20.4. Images and PDFs#

Most digitised resources include images you can download. Images can be digitised versions of visual material such as photographs, maps, or artworks, but they can also be scanned copies of pages in a publication or manuscript collection. There are two main methods for accessing digitised images computationally:

  • Construct download links for a range of images

  • Constructing image urls using nla.obj identifiers

In addition, it’s possible to extract illustrations from pages of digitised books and periodicals by using data generated through the OCR process.

20.4.2. Constructing image urls using nla.obj identifiers#

This method is fully documented in HOW TO: Create download links for images using nla.obj identifiers. but here’s a quick summary.

If you know the nla.obj identifier for a page or image, you can download it simply by adding an /image suffix to the identifier url. For example, this photograph of a group of school children with gardening tools has the identifier nla.obj-141828112. To create a direct link to the image, you just add /image to the identifier url:

https://nla.gov.au/nla.obj-141828112/image

20.4.3. Extract illustrations from pages of digitised books and periodicals#

This method is fully documented in Crop images from pages using the OCR coordinates, but here’s a quick summary.

As described above, if you know the nla.obj identifier of a specific page in a digitised publication, you can access machine-readable information about the OCR process by simply adding /ocr to the identifier url.

Within the OCR data there are zs blocks describing the position of each illustration. You can loop through each of these blocks and use the coordinates to crop the illustrations from the full page image. However, the coordinates in the OCR data are sometimes derived from higher resolution versions of the page images than you can download. To workaround this, you can you can access the metadata embedded in the digitised book and journal viewer, extract the dimensions of the high-resolution version of the page, and then convert the coordinates to work with the downloadable version.

../_images/cat-collection.png

Fig. 20.1 Sample from a collection of cat photos harvested from a search for articles with cat or kitten in their title using the GLAM Workbench#