Skip to main content
Ctrl+K
Trove Data Guide - Home
  • Trove Data Guide
  • About this Guide
  • Who is this for?
  • The possibilities of Trove data

What is Trove?

  • 1. Trove is…
  • 2. Categories and zones
  • 3. Works and versions
  • 4. Collections within collections
  • 5. Links and identifiers
  • 6. Interfaces

Understanding search

  • 7. Understanding search
  • 8. ‘Simple’ search options
  • 9. Date searches
  • 10. Search interface hacks
  • 11. Finding NLA digitised content you can download

Accessing data

  • 12. Data access options
  • 13. Downloading data from the Trove web interface
  • 14. Trove API introduction
  • 15. How to
    • 15.1. HOW TO: Download higher resolution versions of images from the web interface
    • 15.2. HOW TO: Harvest a complete set of search results using the Trove API

Digitised newspapers and gazettes

  • 16. Understanding the digitised newspapers
  • 17. Accessing data
    • 17.1. Articles
    • 17.2. Pages
    • 17.3. Issues
    • 17.4. Titles
  • 18. How to
    • 18.1. HOW TO: Get a newspaper issue or article as a PDF
    • 18.2. HOW TO: Get information about the position of OCRd newspaper text
    • 18.3. HOW TO: Create a dataset of digitised newspaper articles

Other digitised resources

  • 19. Understanding digitised resources
  • 20. Accessing data from digitised resources
  • 21. Books
  • 22. Periodicals
    • 22.1. Overview of periodicals
    • 22.2. Finding digitised periodicals
    • 22.3. Accessing data from periodicals
  • 23. Parliamentary papers
    • 23.1. Overview of Parliamentary Papers
    • 23.2. Finding Parliamentary Papers in Trove
  • 24. Oral histories
    • 24.1. Overview of oral histories
    • 24.2. Accessing data from digitised oral histories
  • 25. How to
    • 25.1. HOW TO: Harvest data relating to digitised resources
    • 25.2. HOW TO: Extract additional metadata from the digitised resource viewer
    • 25.3. HOW TO: Get a list of items from a digitised collection
    • 25.4. HOW TO: Get text, images, and PDFs using Trove’s download link
    • 25.5. HOW TO: Create download links for images using nla.obj identifiers
    • 25.6. HOW TO: Get and use OCR data from a book or periodical page
    • 25.7. HOW TO: Scrape metadata from the Trove audio player

Research pathways

  • 26. Introduction
  • 27. Using text
    • 27.1. Data sources
    • 27.2. Tools and resources
    • 27.3. Tutorials and examples
      • Analysing keywords in Trove’s digitised newspapers
  • 28. Using images
    • 28.1. Data sources
    • 28.2. Tools and resources
    • 28.3. Tutorials and examples
      • Working with a Trove collection in Tropy
      • Comparing manuscript collections in Mirador
  • 29. Collection and system data
    • 29.1. Data sources
    • 29.2. Tools and resources
    • 29.3. Tutorials and examples
  • 30. Maps and places
    • 30.1. Data sources
    • 30.2. Tools and resources
    • 30.3. Tutorials and examples
      • Create a layer in GHAP using metadata from Trove’s digitised maps
  • 31. Creating collections
    • 31.1. Tools and resources
    • 31.2. Tutorials and examples
      • Sharing a Trove List as a CollectionBuilder exhibition
  • Contributing to the Trove Data Guide
  • References
  • Repository
  • Open issue
  • .ipynb

Understanding search

Contents

  • 7.1. The limits of search
  • 7.2. Search is a research method

7. Understanding search#

On this page

Learn about some of the limits, complexities, and challenges of using search in Trove.

  • The limits of search

  • Search is a research method

7.1. The limits of search#

Search is such a fundamental part of our online lives, we don’t really think about it much. We just type words into the box, click enter, and start wading through the results. But search indexes have biases, they embed politics, they make assumptions about who we are and what we want. This is as true of Trove as any search tool.

In particular, search interfaces like Trove are not good at communicating their own limits. They tell us what they’ve found, but not what is findable. This has important implications for researchers trying to interpret their search results. What does it mean if your Trove query returns nothing? Have the relevant resources been preserved, catalogued, digitised, indexed? Is there an issue with copyright? Are you searching the full text or just metadata? Is the metadata complete and consistent?

On the other hand, search interfaces can sometimes be too helpful, returning results of limited relevance just in case they might be of interest. Trove builds a certain amount of fuzziness into its results for this reason. But if you’re using Trove to assemble and analyse large-scale datasets, you need to understand and control this fuzziness – to understand the limits of your dataset.

Trove’s digitised newspapers can help you find and track changes in language and terminology. For example, at some point Australians started referring to ‘the Great War’ as the ‘World War I’, but when? You can explore this question by using QueryPic to visualise newspaper searches for ‘the Great War’, ‘World War I’, and ‘World War II’.

../_images/qp-wwi.png

Fig. 7.1 Number of digitised newspaper articles matched by the queries ‘the great war’, ‘world war i’, and ‘world war ii’.#

Not surprisingly, it seems that the shift in language occurred in the 1940s, but if you look closely at the ‘world war i’ results you notice something odd – there’s a noticeable peak from 1914 to 1919. But this is not because people were envisaging that the current war would be one in a series. It’s because Trove users have been adding the tag ‘World War I’ to articles from the period, and Trove searches user tags and comments by default.

../_images/qp-wwi-only.png

Fig. 7.2 Number of digitised newspaper articles matched by the queries ‘world war i’.#

Trove’s default behaviour helps people find relevant articles, leveraging the knowledge of Trove users. But it also pollutes the data, mixing up the provenance of keyword matches. But the main problem is that it’s just not obvious that this is happening. Trove does little to alert you to the scope of its searches. Nor is there an obvious workaround. You can exclude matches in tags and comments by adding the text: prefix to your keywords, but this also switches off word stemming, so the results are not exactly the same.

If you’re going to use Trove data in your research, you need to think beyond the search box, to question your own assumptions, and think critically about the systems that deliver the data to you.

7.2. Search is a research method#

If you’re working in a physical archive you don’t expect to just submit a query to the person on the desk and have every relevant record delivered to you. You have to learn about the provenance of the records and the way they’re arranged and described. It takes time, but it’s a key part of the research process.

You should treat Trove the same way. Each search is an opportunity to learn a little more about the way Trove works. If you don’t find what you’re looking for, consult the documentation and experiment with alternative queries. Think about what works and what doesn’t. It’s an iterative process.

  • Understand the technical context — How does it work? Consult the documentation (and this Guide) to understand your options.

  • Be creative and strategic — Solve your puzzle by experimenting and looking for clues in the results.

  • Stay critical — Always assume that Trove isn’t telling you everything.

With so many rich cultural heritage collections now available online, constructing and interpreting search queries has become an important research method for HASS researchers.

previous

6. Interfaces

next

8. ‘Simple’ search options

Contents
  • 7.1. The limits of search
  • 7.2. Search is a research method

By Tim Sherratt

© Copyright 2024 Australian Research Data Commons.

Creative Commons License This work is licensed under a Creative Commons Attribution 4.0 International License.

Version: v1.0-beta.16 (07 November 2024)

The Trove Data Guide received investment from the Australian Research Data Commons (ARDC). The ARDC is funded by the National Collaborative Research Infrastructure Strategy (NCRIS).