11. Downloading data from the Trove web interface#

11.1. Downloading images, PDFs, text, and audio#

Items that have been digitised by the NLA and made available through one of Trove’s digitised item viewers can usually be downloaded in a variety of formats. This includes newspapers, books, journals, images, maps, manuscripts, and oral histories.

The official Trove Help includes a page on Downloading that describes the options available in the various Trove categories for downloading images, PDFs, text, and audio. Different formats have different viewers, but generally speaking you just need to find the download tab in the sidebar, select a format, and click the button.

../_images/web-download-example.png

Fig. 11.1 Example of the download tab in the digitised magazines and newsletters viewer#

Items that are arranged in hierarchical structures, such as some images, maps, and manuscripts, might have an option to download a ‘collection’. If so a Download button will appear on the collection page. This isn’t always available, and there can be limits on how many items in a collection you can download at once. To find the ‘collection’ page, try using the breadcrumb links to move up the record hierarchy.

While many of the same download options are available across different Trove categories, they don’t always mean the same thing! For example, the ‘text’ you get from newspapers is not the same as the ‘text’ you get from books. This table summarises what’s available and describes some of these oddities.

Table 11.1 Available download options#

Trove category

Item type

Download option

Note

Newspapers & gazettes

article

image

The ‘image’ option actually delivers an HTML page with embedded images. Long articles will often be sliced up in unfortunate ways to ‘fit’ an A4 page. To get the images themselves you need to extract them from the HTML and try to reassemble them.

Newspapers & gazettes

article

PDF

Newspapers & gazettes

article

text

The ‘text’ option actually delivers an HTML page that includes the publication details as well as the article text. If you just want the plain OCRd text you’d need to extract it from the HTML and remove the publication details.

Newspapers & gazettes

page

PDF

Newspapers & gazettes

issue

PDF

Books & libraries

single page, range of pages, or complete book

image

Images (single or multiple) are packaged in a zip file with an additional page of copyright information.

Books & libraries

single page, range of pages, or complete book

PDF

Books & libraries

single page, range of pages, or complete book

text

Unlike the newspapers, this is plain text with no formatting.

Books & libraries

single page, range of pages, or complete book

image

Images (single or multiple) are packaged in a zip file with an additional page of copyright information.

Magazines & newsletters

single page, range of pages, or complete issue

PDF

Magazines & newsletters

single page, range of pages, or complete issue

text

Unlike the newspapers, this is plain text with no formatting.

Images, maps, & artefacts

single item, range of items

image

Images are packaged in a zip file with an additional page of copyright information. As well as the standard JPEG format, some maps include an option to download high-resolution TIFF files.

Images, maps, & artefacts

single item, range of items

PDF

Images, maps, & artefacts

collection

image

A maximum of 20 images can be downloaded at one time.

Images, maps, & artefacts

collection

PDF

A maximum of 20 images can be included in the PDF.

Diaries, letters & archives

single page, range of pages

image

Images are packaged in a zip file with an additional page of copyright information.

Diaries, letters & archives

single page, range of pages

PDF

Diaries, letters & archives

collection

image

Images are packaged in a zip file with an additional page of copyright information. Depending on where you are in the collection hierarchy, you might only get the top-level image.

Diaries, letters & archives

collection

PDF

Music, audio & video

oral history interview transcript

text

Music, audio & video

oral history interview transcript

PDF

Music, audio & video

oral history interview

audio recording

MP3 files available at a variety of bitrates (the higher the bitrate, the larger the file), eg: 48kbps, 128kbps, and 256kbps

Some download options you might expect to find are not actually available. These are listed in the table below.

Table 11.2 Missing download formats#

Trove category

Item type

Download format

Note

Newspapers & gazettes

page

image

There’s no option to download a page as an image, just a page image embedded in a PDF.

Magazines & newsletters

article

any

There’s no option to download individual articles as images, PDFs, or text. While you search for individual articles, the viewer only presents pages. This is different to the newspapers where the viewer presents individual articles.

What about image resolutions?

One confusing, and often frustrating, aspect of image downloads is their resolution (or size). You can use the Trove image viewer to zoom in close to many photographs and manuscripts, enabling you to pick up fine details. But if you download the same image, you could find the resolution is much lower. This means you’re limited in how you can use the downloaded image. The available resolutions vary across categories and formats, and you really don’t know what you’ll get until you download it. Many manuscripts, in particular, seem to have low-resolution downloads, which doesn’t help you much when you’re trying to decipher someone’s handwriting! But never fear, there are a few hacks and work arounds you can try to get higher resolution versions.

11.2. Download metadata using citations and BibTex#

BibTex is a file format used to save structured information about references, and is used by many tools to manage citations and build bibliographies. You can download item metadata in BibTex format using Trove’s ‘Citation’ tab.

In the main search interface, the ‘Citation’ tab includes a BibTex option. You can copy or download the BibTex record.

../_images/trove-citation-tab.png

Fig. 11.2 Example of the Citation tab with the BibTex option selected#

In the digitised newspaper viewer, the ‘Citation’ tab includes a button to download a BibTex record.

../_images/newspaper-citation-options.png

Fig. 11.3 Options for downloading newspaper citations#

The Trove viewers for digitised books, journals, images, and maps don’t include a BibTex option.

This is a simple way of capturing metadata in a structured format, but the BibTex records don’t always include the full range of metadata available in Trove.

11.3. Downloading lists#

Trove Lists include a button to ‘Download this List’. Once you click the button you can choose your desired output format: CSV, JSON, XML, or as a list of citations.

The metadata provided by the List download option is quite limited. In particular, newspaper articles are missing information about titles, and the dates are not formatted according to the ISO standard. You can retrieve more and better metadata from Lists by using the Trove API.

11.4. Bulk export#

Trove’s new Bulk Export feature makes it easy to save the results of a search. But it has a number of limitations:

  • Number of results limited to one million

  • Version information is not included with work records

  • Text is not included with newspaper articles

For many research uses you’ll be better off using the Trove API or a tool like the Trove Newspaper Harvester.