Skip to main content
Ctrl+K
Trove Data Guide - Home
  • Trove Data Guide
  • About this Guide
  • Who is this for?
  • The possibilities of Trove data

What is Trove?

  • 1. Trove is…
  • 2. Categories and zones
  • 3. Works and versions
  • 4. Collections within collections
  • 5. Links and identifiers
  • 6. Interfaces

Understanding search

  • 7. Understanding search
  • 8. ‘Simple’ search options
  • 9. Date searches
  • 10. Search interface hacks
  • 11. Finding NLA digitised content you can download

Accessing data

  • 12. Data access options
  • 13. Downloading data from the Trove web interface
  • 14. Trove API introduction
  • 15. How to
    • 15.1. HOW TO: Download higher resolution versions of images from the web interface
    • 15.2. HOW TO: Harvest a complete set of search results using the Trove API

Digitised newspapers and gazettes

  • 16. Understanding the digitised newspapers
  • 17. Accessing data
    • 17.1. Articles
    • 17.2. Pages
    • 17.3. Issues
    • 17.4. Titles
  • 18. How to
    • 18.1. HOW TO: Get a newspaper issue or article as a PDF
    • 18.2. HOW TO: Get information about the position of OCRd newspaper text
    • 18.3. HOW TO: Create a dataset of digitised newspaper articles

Other digitised resources

  • 19. Understanding digitised resources
  • 20. Accessing data from digitised resources
  • 21. Books
  • 22. Periodicals
    • 22.1. Overview of periodicals
    • 22.2. Finding digitised periodicals
    • 22.3. Accessing data from periodicals
  • 23. Parliamentary papers
    • 23.1. Overview of Parliamentary Papers
    • 23.2. Finding Parliamentary Papers in Trove
  • 24. Oral histories
    • 24.1. Overview of oral histories
    • 24.2. Accessing data from digitised oral histories
  • 25. How to
    • 25.1. HOW TO: Harvest data relating to digitised resources
    • 25.2. HOW TO: Extract additional metadata from the digitised resource viewer
    • 25.3. HOW TO: Get a list of items from a digitised collection
    • 25.4. HOW TO: Get text, images, and PDFs using Trove’s download link
    • 25.5. HOW TO: Create download links for images using nla.obj identifiers
    • 25.6. HOW TO: Get and use OCR data from a book or periodical page
    • 25.7. HOW TO: Scrape metadata from the Trove audio player

Research pathways

  • 26. Introduction
  • 27. Using text
    • 27.1. Data sources
    • 27.2. Tools and resources
    • 27.3. Tutorials and examples
      • Analysing keywords in Trove’s digitised newspapers
  • 28. Using images
    • 28.1. Data sources
    • 28.2. Tools and resources
    • 28.3. Tutorials and examples
      • Working with a Trove collection in Tropy
      • Comparing manuscript collections in Mirador
  • 29. Collection and system data
    • 29.1. Data sources
    • 29.2. Tools and resources
    • 29.3. Tutorials and examples
  • 30. Maps and places
    • 30.1. Data sources
    • 30.2. Tools and resources
    • 30.3. Tutorials and examples
      • Create a layer in GHAP using metadata from Trove’s digitised maps
  • 31. Creating collections
    • 31.1. Tools and resources
    • 31.2. Tutorials and examples
      • Sharing a Trove List as a CollectionBuilder exhibition
  • Contributing to the Trove Data Guide
  • References
  • Repository
  • Open issue
  • .ipynb

HOW TO: Get text, images, and PDFs using Trove’s download link

Contents

  • 25.4.1. Background
  • 25.4.2. Understanding Trove’s download link
  • 25.4.3. DIY download links
  • 25.4.4. Download selected pages
  • 25.4.5. How do you know the number of pages?
  • 25.4.6. Limitations and alternatives

25.4. HOW TO: Get text, images, and PDFs using Trove’s download link#

On this page

  • Background

  • Understanding Trove’s download link

  • DIY download links

  • Download selected pages

  • How do you know the number of pages?

  • Limitations and alternatives

25.4.1. Background#

You can download text, images, and PDFs from individual digitised items using the Trove web interface. But only the text of periodical articles is available for machine access through the Trove API. This makes it difficult to assemble datasets, or build processing pipelines involving digitised resources.

This page documents a workaround developed by reverse-engineering the download link used by the Trove web interface. You can use it to automate the download of text, images, and PDFs from many digitised resources.

25.4.2. Understanding Trove’s download link#

Trove’s digitised object viewers include a ‘Download’ tab that provides options for downloading the current item.

../../_images/download-button.png

Fig. 25.2 Download options for a digitised book#

When you click on the Start download button, your browser actually fires off a request to Trove that looks like this:

https://nla.gov.au/nla.obj-33685055/download?downloadOption=ocr&firstPage=0&lastPage=101

The url contains a few key parameters.

https://nla.gov.au/nla.obj-[ID]/download?downloadOption=[FORMAT]&firstPage=[FIRST]&lastPage=[LAST]

parameter

description

[ID]

the NLA identifier for the current item or collection, for example nla.obj-3199043190

downloadOption

the desired format of the download, it can be one of ocr, pdf, zip, or tif (the available options depend on the type of resource)

firstPage and lastPage

numbers that define the range of items you want to download – ranges start at 0, so if a book had fifty pages you’d set firstPage to 0 and lastPage to 49

Note that ‘pages’ in this context actually refers to the number of images in the digitised version, rather than the number of printed pages in the original work. This is because the digitised version will typically include images of book covers and endpapers as well as printed pages.

25.4.3. DIY download links#

See also

This method is used in a number of GLAM Workbench notebooks to download the OCRd text of books and periodicals. See, for example, Harvesting the text of digitised books (and ephemera) and Get OCRd text from a digitised journal in Trove.

Once you understand the structure of the download urls, you can create your own without using the web interface. All you need to know is a resource’s NLA identifier and the number of pages/images it contains.

For example, The gold finder of Australia : how he went, how he fared, how he made his fortune is a pamphlet published in 1853. It’s NLA identifier is nla.obj-248742150 and it has 80 pages. To download all of the OCRd text from this book, you’d insert the identifier and set lastPage to 79 (80 minus 1):

https://nla.gov.au/nla.obj-248742150/download?downloadOption=ocr&firstPage=0&lastPage=79

Or perhaps you’d like the whole pamphlet as a PDF? Just change downloadOption from ocr to pdf:

https://nla.gov.au/nla.obj-248742150/download?downloadOption=pdf&firstPage=0&lastPage=79

Zip files can be big!

By setting downloadOption to zip you can download all the digitised images that make up a resource in a single zip file. But beware! Pages of digitised books or journals can weigh in at a few megabytes each, so zipping them all together can create one, very large file. The gold finder of Australia only has 80 pages, but a zip file containing every page ends up at 282mb! If you want to download all the images from a collection of resources, it’s probably best to take a more patient approach and download each image individually.

25.4.4. Download selected pages#

See also

The GLAM Workbench notebook Get covers (or any other pages) from a digitised journal in Trove uses this method to download the cover from every issue of a digitised periodical.

You don’t have to ask for everything at once. By adjusting the firstPage and lastPage values, you can download a specific range of pages. If you wanted the first ten pages of The gold finder of Australia as zipped images, you’d set firstPage to 0 and lastPage to 9:

https://nla.gov.au/nla.obj-248742150/download?downloadOption=zip&firstPage=0&lastPage=9

To request a single page, set firstPage and lastPage to the same value. For example, if you wanted an image of the cover, you’d set both firstPage and lastPage to 0:

https://nla.gov.au/nla.obj-248742150/download?downloadOption=zip&firstPage=0&lastPage=0

25.4.5. How do you know the number of pages?#

The value of this method lies in the fact that it can be used programatically to download large collections of digitised items without any manual intervention. But to do that you need some way of finding out the number of pages or images in each item so you can set the lastPage value. This information isn’t included in the metadata from the API, and so requires a little extra effort to extract.

The approach varies depending on the type of resource.

If you’re downloading a resource that is made up of pages, such as a book or periodical, you need to:

  • extract the metadata embedded in the digitised book or journal viewer

  • get the length of the page array from the metadata

If you’re downloading a collection of images, manuscripts, or maps, you can look for the value of maxNumOfChildDownloads embedded in the HTML of the digitised image viewer. Here’s an example that finds the number of photographs in the collection of B.A.N.Z. Antarctic Research Expedition photographs (nla.obj-141170265).

import requests
import re

obj_id = "nla.obj-141170265"

# Get the page from the digitised image viewer
response = requests.get(f"https://nla.gov.au/{obj_id}")

# Find lines where this variable is referenced
matches = re.findall(r"maxNumOfChildDownloads = (\d+)", response.text)

# This variable can be referenced multiple times - we want to get the highest value
num_items = sorted([int(m) for m in matches])[-1]
print(num_items)
151

25.4.6. Limitations and alternatives#

This method works really well if you want to get all the OCRd text out of books or periodical issues. It’s also handy if you want to download selected pages or images, such as the front covers of periodicals.

However, if your aim is to download all the images from a collection of items then there are two potential problems. The first is that the download link doesn’t always provide images at their highest available resolution. This particularly seems to be the case with manuscript and photographic collections.

The other problem is that the zip files can become very large if you request collections that contain a significant number of pages or images. This makes them slow to download and can cause errors. Of course you also have to add in a step to unzip the zips!

If you’re downloading lots of images or the quality of the images is important to you, I’d suggest you try the alternative approach which involves downloading one at a time. This method is fully documented in HOW TO: Create download links for images using nla.obj identifiers.

previous

25.3. HOW TO: Get a list of items from a digitised collection

next

25.5. HOW TO: Create download links for images using nla.obj identifiers

Contents
  • 25.4.1. Background
  • 25.4.2. Understanding Trove’s download link
  • 25.4.3. DIY download links
  • 25.4.4. Download selected pages
  • 25.4.5. How do you know the number of pages?
  • 25.4.6. Limitations and alternatives

By Tim Sherratt

© Copyright 2024 Australian Research Data Commons.

Creative Commons License This work is licensed under a Creative Commons Attribution 4.0 International License.

Version: v1.0-beta.16 (07 November 2024)

The Trove Data Guide received investment from the Australian Research Data Commons (ARDC). The ARDC is funded by the National Collaborative Research Infrastructure Strategy (NCRIS).