21.5. HOW TO: Scrape metadata from the Trove audio player#

Trove’s audio player displays some metadata that isn’t included in the API records. This can include:

  • a descriptive note with the date and place of the recording

  • a note about the availability of transcript

  • roles of the people involved – ie ‘interviewer’ and ‘interviewee’

The scrape_metadata() function below retrieves the audio player page from an oral history’s digital object url and uses BeautifulSoup to find and extract the metadata.

For example, the ‘Listen’ link from this oral history record goes to the url https://nla.gov.au/nla.obj-220905784 which opens the audio player. In the API, the digital object url will be in the identifier field of the work or version, with a linktype of “fulltext”.

    "type": "url",
    "linktype": "fulltext",
    "value": "https://nla.gov.au/nla.obj-220905784"

Try it!

Just give the scrape_metadata() function the digital object url and it will return the metadata as a Python dictionary with the following fields:

  • catalogue_url – url of the NLA catalogue record for this oral history

  • identifier – NLA identifier for this oral history

  • description

  • extent

  • notes

  • contributor

The fields description, extent, and contributor can have multiple values and are returned as lists.

import re

import requests
from bs4 import BeautifulSoup

def scrape_metadata(url):
    Scrape metadata about an oral history from the audio player page.
    response = requests.get(url)
    # If this is a collection page you'll get a 404
    if response.status_code != 200:
        return {}
    soup = BeautifulSoup(response.text)
    # Get the metadata container
    details = soup.find("div", class_="workdetails")
    if not details:
        return {}
    # Get link to NLA catalogue
    catalogue = details.find("section", class_="catalogue")
    catalogue_link = catalogue.find("a", href=re.compile("nla.cat-vn"))["href"]
    # Get oral history id
    oral_history_id = ""
    for string in catalogue.stripped_strings:
        if string.startswith("ORAL TRC"):
            oral_history_id = string
    # Get extent, description and notes
    extent = []
    description = []
    for section in details.find_all("section", class_="extent"):
        if section.string.startswith("Recorded"):
        notes = details.find("section", class_="notes").string
    except AttributeError:
        notes = ""
    # Get contributors and role
    contributors = []
    for div in details.find_all("div", class_="contributor"):
        role = div.find("span", class_="role")
        if role:
            contributors.append(f"{list(div.stripped_strings)[0]} {role.string}")
    return {
        "catalogue_url": catalogue_link,
        "identifier": oral_history_id,
        "description": description,
        "extent": extent,
        "notes": notes,
        "contributor": contributors,
{'catalogue_url': 'http://nla.gov.au/nla.cat-vn4979244',
 'identifier': 'ORAL TRC 6233',
 'description': ['Recorded on 8 October 2010 in Sydney, N.S.W.'],
 'extent': ['2 sound files (ca. 167 min.)'],
 'notes': "Professor Bryan Gaensler is a Professor of Physics at the University of Sydney. In 2011 he will not only take up the position of Director of the ARC Centre of Excellence for All-sky Astrophysics but will also be an Australian Research Council Laureate Fellow. He has previously held positions at MIT, the Smithsonian Institute and Harvard University. He has made a number of ground breaking discoveries in fields such as, astrophysical magnetic fields, supernova explosions, the Magellanic Clouds, astrophysical shocks and the structure of the Milky Way. He hopes to use the Australian Square Kilometre Array Pathfinder to carry out the Polarisation Sky Survey of the Universe's Magnetism which will transform our understanding of magnetic fields in the cosmos.",
 'contributor': ['Bhathal, Ragbir (Ragbir Singh) (interviewer)',
  'Gaensler, Bryan, 1973- (interviewee)']}