Form for entering corpus definitions (first draft) #982
Labels
corpus
changes to corpus definitions or new corpora
enhancement
improvements to user functionality
major
major changes to functionality and/or the code base
needs-mockup
this suggestion could use a picture before it is implemented
The interface should have a menu to enter corpus definitions.
This form will be rather complicated. Our current concept is to realise this as a process with multiple steps, which guides the curator through the stages of adding a corpus. See the detailed proposal below.
Step 0: initialising a corpus
The curator opens the corpus creation menu. They are asked to provide a unique name for the corpus.
The backend adds the corpus to the SQL database, and links it to the curator's user ID
Step 1: defining data extraction
When the curator has confirmed their choice, they are taken to step 1 of the form, where they will fill in how the data should be extracted.
To start, they are required to upload an example CSV file. (The upload can have a size limit, it really should be a small sample).
The backend reads the provided CSV file and assembles a list of the available column names. The rest of the form now unlocks and the curator can start filling things in.
The most important step here is to add fields to the corpus. For each field, the curator fills in the name, type, and picks from a dropdown which column from the CSV should be used to extract the value. (Some advanced options may also be available here.)
Step 2: verifying extraction
The curator has confirmed their definition and moves to the next menu, where the backend will try to extract the data from the example file and the curator can see if everything works as expected.
The backend runs the
documents()
function of the corpus, resulting in the JSON data that would be sent to elasticsearch during indexing. Of course, any error messages or warnings will be shown here.The extracted JSON is sent to the frontend and shown to the curator. They can review the JSON and download it to run tests, if they want.
At this point, the curator can go back to step 1 and make some more edits to the form. If there were no errors and the curator is satisfied, they can go to step 3.
Step 3: indexing
The curator is happy with their choice and wants to index the corpus. There are some options to realise the uploading and indexing, which has been added as a separate issue. It is possible to start with a draft version where this step is 'contact a developer and ask them to index the corpus'.
Step 4: interface settings
We now have an indexed corpus in I-analyzer. The corpus availability is still set to 'private', so the curator can see it but regular users can't.
The curator can view the corpus in the I-analyzer interface. They may have reason to clear the index and go back to step 2. If not, they use this step of the form to fine-tune the interface.
Here, the curator chooses interface settings that don't affect the elasticsearch index. For example, they can pick which fields are shown as filters, or choose the image and description for the corpus.
As will be explained in the form, the settings in this step can be changed at any time. Changing the settings in steps before this point would require re-indexing, but everything here can be changed and work immediately.
Step 5: publish the corpus
The corpus is ready and the curator sets its availability to 'public'. The corpus is now available to regular users. It probably makes sense if they can also pick which user roles have access at this point.
The text was updated successfully, but these errors were encountered: