This repository contains the configuration for concertcloud.live. concertcloud.live is a website that helps you find concerts worldwide. The data is gathered with goskyr, a configurable command-line scraper written in go. Twice a week goskyr runs through a Github action using the configuration files in config/
and updates the concert database.
Despite the fact that there are other websites that offer an overview about concerts and events (such as https://www.songkick.com/, https://www.jambase.com/, etc.) I found none that are complete and/or include smaller event locations that might only be known to locals. That's why I came up with this idea.
The idea of making this repository public is to enable others to contribute to the configuration file and hence expand the data available on concertcloud.live. I might have a good knowledge about concert locations in my home town but there are other people (you?) that know better about other towns. Everyone that has a basic understanding of programming and html/css should be able to extend the configuration file in order for the scraper to include more concert locations.
If you know a concert venue that you'd like to add to concertcloud.live just fork this repository, add a new config snippet and open a pull request to merge the newly added snippet into the main branch. Have a look at the README of goskyr to make yourself familiar with the configuration syntax. Looking at the existing configurations might also give you some hints about how to write your own.
We'll demonstrate the process for the location "Konzerthaus Schüür" with the url https://www.schuur.ch/programm/
-
Install goskyr
Download the latest prebuilt binary from the releases page and unpack into this directory. For other options check out the install section of goskyr's README.
-
Generate initial config snippet
Since v0.2.5 goskyr provides functionality to automatically generate a config snippet for a given url. We're going to rely on this feature to generate an initial version of the configuration. Additionally, we're going to make use of the new machine learning feature that is available since goskyr v0.4.0 to give us a first prediction of the field names. Unfortunately, goskyr still lacks the ability to generate the entire configuration so we'll have to make some modifications afterwards.
-
Run goskyr with the
-g
and--model
flagsIn your terminal run
./goskyr -g https://www.schuur.ch/programm/ --model concert-20230509-mod
You'll be presented with a table that shows different fields from the website with corresponding examples. In case you don't see the fields you'd expect from looking at the website there might be a couple of things you can try. Adding the option-m
allows you to set a minimum number of occurences of the extracted fields (default is 20). Only fields that occur at least this many times are added to the table to filter out noise. In some cases though a list of items on a website may be shorter so you may want to decrease that number accordingly. A second thing that you could try is using the-d
flag to render js. Note, that chrome needs to be installed for this to work. -
Select fields
With the ↑ and ↓ arrow keys you can navigate through the rows and with the return key you can select or de-select a row (ie a field). In case there are many fields to select from the color coding can be useful by giving fields that are close to each other (in the html tree) a similar color. In our example case we can ignore the colors. Once you selected the fields that you want to extract from the website (in our example we select all fields except the one with
Party
/Konzert
.. as values) press the tab key to navigate to the button below the table and press return to generate the configuration. Note that the predicted field names are probably not always correct but still should provide some help to finalize the configuration more quickly.
-
-
Update the generated configuration accordingly
First, to get a feeling what data would be extracted with the previously generated configuration run
./goskyr
. In our case, this should print a number of json items containing concert info. Now that you have an idea of what data is scraped with the current configuration, we need to adapt a few things for this specific use case.-
Field names
Most of the fields mentioned should be correctly named already due to the machine learning. Still you might have to do some renaming. In this specific case of 'Schüür' it sometimes happens that there are two
title
fields. Rename one tocomment
. If possible, you'll want the following dynamic fields:title
- mandatoryurl
- mandatorycomment
- optionalgenresText
- optional - NEW. This text field will be used by the API to try to extract genres from, so it should contain some.
-
Date field
In the ideal case (as should be the case in our example) nothing needs to be done for the date field configuration. However, sometimes the automatic extraction algorithm makes mistakes in which case we need to correct the configuration. To better understand the date extraction read the section on the
Key: type
under Dynamic Fields. -
Additional fields
Right now we only configured dynamic fields, ie fields whose values change depending on the scraped website. For concertcloud.live we also need a couple of static fields:
city
,location
,type
andsourceUrl
. Check the config files inconfig/
to find out how to configure those. If the mandatory fields are not present then the scraper won't be able to send the data to the api that feeds the website. Also note, thattype
should be a static field with 'concert' as value andcity
should be a static field with the city name in English. -
Check the output
Run
./goskyr
again and check whether the output makes sense. If it does, change the value of thename
field to the name of the location and copy the config snippet to the corresponding file in theconfig
directory. The example location has already been added to the corresponding configuration file so you can see the final version there.
-
-
Make a pull request
There are still a few limititions that will be solved in the future and include the following but might not be limited to:
- a field (type
text
&url
) can only have one selector across all items of one concert location. - if you encounter any other bugs or limitations feel free to open an issue in the goskyr repo.