We're going to look at our first dataset today. Specifically, this will be a dataset that undergirds the public display of a specific library's collections.
This will allow us to build a conceptual model of the links between what people see when they explore an archive, and the structured data that has to be put in place to allow them to find what they're looking for.
Let's look at how the Library of Congress presents its collections in human and machine-readable formats.
- Public-facing webpages
- Open up the Library of Congress collections portal
- What do we see here, structurally and descriptively?
- Click through to a specific collection
- What do we see here?
- Open up the Library of Congress collections portal
- API -- same thing in a different format! Just add
?fo=json&at=results
- What do we see in the main collections view?
- What do we see in the view of the specific collection?
Log into Python Anywhere.
Open up a terminal. Click the "New console" >>> Python 3.10
button.
On the command line, do the following, one line at a time:
import requests
r=requests.get('https://www.loc.gov/collections/?fo=json&at=results')
r.status_code
import json
j=json.loads(r.text)
print(json.dumps(j,indent=2))
print(j.keys())
What do we see here?
We'll continue our exercise solo.
Create a markdown file with your username (like last round).
Pick one of the sub-collections, like "10th-16th Century Liturgical Chants"
The Python Requests module
The Python JSON module
Poke around in your collection, in
- The JSON format
- The normal webpage view
- The command-line view
What is interesting about these collections and these views? What is difficult / doesn't make sense? How might you want to make use of this data?