25.7. HOW TO: Scrape metadata from the Trove audio player#
Trove’s audio player displays some metadata that isn’t included in the API records. This can include:
a descriptive note with the date and place of the recording
a note about the availability of transcript
roles of the people involved – ie ‘interviewer’ and ‘interviewee’
The scrape_metadata()
function below retrieves the audio player page from an oral history’s digital object url and uses BeautifulSoup to find and extract the metadata.
For example, the ‘Listen’ link from this oral history record goes to the url https://nla.gov.au/nla.obj-220905784 which opens the audio player. In the API, the digital object url will be in the identifier
field of the work or version, with a linktype
of “fulltext”.
"type": "url",
"linktype": "fulltext",
"value": "https://nla.gov.au/nla.obj-220905784"
Just give the scrape_metadata()
function the digital object url and it will return the metadata as a Python dictionary with the following fields:
catalogue_url
– url of the NLA catalogue record for this oral historyidentifier
– NLA identifier for this oral historydescription
extent
notes
contributor
The fields description
, extent
, and contributor
can have multiple values and are returned as lists.
import re
import requests
from bs4 import BeautifulSoup
def scrape_metadata(url):
"""
Scrape metadata about an oral history from the audio player page.
"""
response = requests.get(url)
# If this is a collection page you'll get a 404
if response.status_code != 200:
return {}
soup = BeautifulSoup(response.text)
# Get the metadata container
details = soup.find("div", class_="workdetails")
if not details:
return {}
# Get link to NLA catalogue
catalogue = details.find("section", class_="catalogue")
catalogue_link = catalogue.find("a", href=re.compile("nla.cat-vn"))["href"]
# Get oral history id
oral_history_id = ""
for string in catalogue.stripped_strings:
if string.startswith("ORAL TRC"):
oral_history_id = string
# Get extent, description and notes
extent = []
description = []
for section in details.find_all("section", class_="extent"):
if section.string.startswith("Recorded"):
description.append(section.string.strip())
else:
extent.append(section.string)
try:
notes = details.find("section", class_="notes").string
except AttributeError:
notes = ""
# Get contributors and role
contributors = []
for div in details.find_all("div", class_="contributor"):
role = div.find("span", class_="role")
if role:
contributors.append(f"{list(div.stripped_strings)[0]} {role.string}")
else:
contributors.append(f"{list(div.stripped_strings)[0]}")
return {
"catalogue_url": catalogue_link,
"identifier": oral_history_id,
"description": description,
"extent": extent,
"notes": notes,
"contributor": contributors,
}
scrape_metadata("https://nla.gov.au/nla.obj-220905784")
{'catalogue_url': 'http://nla.gov.au/nla.cat-vn4979244',
'identifier': 'ORAL TRC 6233',
'description': ['Recorded on 8 October 2010 in Sydney, N.S.W.'],
'extent': ['2 sound files (ca. 167 min.)'],
'notes': "Professor Bryan Gaensler is a Professor of Physics at the University of Sydney. In 2011 he will not only take up the position of Director of the ARC Centre of Excellence for All-sky Astrophysics but will also be an Australian Research Council Laureate Fellow. He has previously held positions at MIT, the Smithsonian Institute and Harvard University. He has made a number of ground breaking discoveries in fields such as, astrophysical magnetic fields, supernova explosions, the Magellanic Clouds, astrophysical shocks and the structure of the Milky Way. He hopes to use the Australian Square Kilometre Array Pathfinder to carry out the Polarisation Sky Survey of the Universe's Magnetism which will transform our understanding of magnetic fields in the cosmos.",
'contributor': ['Bhathal, Ragbir (Ragbir Singh) (interviewer)',
'Gaensler, Bryan, 1973- (interviewee)']}