Form for entering corpus definitions (first draft) #982

lukavdplas · 2022-11-25T09:26:28Z

The interface should have a menu to enter corpus definitions.

This form will be rather complicated. Our current concept is to realise this as a process with multiple steps, which guides the curator through the stages of adding a corpus. See the detailed proposal below.

Step 0: initialising a corpus

The curator opens the corpus creation menu. They are asked to provide a unique name for the corpus.

The backend adds the corpus to the SQL database, and links it to the curator's user ID

Step 1: defining data extraction

When the curator has confirmed their choice, they are taken to step 1 of the form, where they will fill in how the data should be extracted.

To start, they are required to upload an example CSV file. (The upload can have a size limit, it really should be a small sample).

The backend reads the provided CSV file and assembles a list of the available column names. The rest of the form now unlocks and the curator can start filling things in.

The most important step here is to add fields to the corpus. For each field, the curator fills in the name, type, and picks from a dropdown which column from the CSV should be used to extract the value. (Some advanced options may also be available here.)

Step 2: verifying extraction

The curator has confirmed their definition and moves to the next menu, where the backend will try to extract the data from the example file and the curator can see if everything works as expected.

The backend runs the documents() function of the corpus, resulting in the JSON data that would be sent to elasticsearch during indexing. Of course, any error messages or warnings will be shown here.

The extracted JSON is sent to the frontend and shown to the curator. They can review the JSON and download it to run tests, if they want.

At this point, the curator can go back to step 1 and make some more edits to the form. If there were no errors and the curator is satisfied, they can go to step 3.

Step 3: indexing

The curator is happy with their choice and wants to index the corpus. There are some options to realise the uploading and indexing, which has been added as a separate issue. It is possible to start with a draft version where this step is 'contact a developer and ask them to index the corpus'.

Step 4: interface settings

We now have an indexed corpus in I-analyzer. The corpus availability is still set to 'private', so the curator can see it but regular users can't.

The curator can view the corpus in the I-analyzer interface. They may have reason to clear the index and go back to step 2. If not, they use this step of the form to fine-tune the interface.

Here, the curator chooses interface settings that don't affect the elasticsearch index. For example, they can pick which fields are shown as filters, or choose the image and description for the corpus.

As will be explained in the form, the settings in this step can be changed at any time. Changing the settings in steps before this point would require re-indexing, but everything here can be changed and work immediately.

Step 5: publish the corpus

The corpus is ready and the curator sets its availability to 'public'. The corpus is now available to regular users. It probably makes sense if they can also pick which user roles have access at this point.

The text was updated successfully, but these errors were encountered:

This was referenced Nov 25, 2022

Upload CSV example file #983

Open

Documentation for corpus definition form #984

Open

Index corpus from interface #985

Closed

lukavdplas added enhancement improvements to user functionality major major changes to functionality and/or the code base labels Nov 25, 2022

lukavdplas mentioned this issue Nov 28, 2022

Index version management from interface #1007

Open

lukavdplas added the corpus changes to corpus definitions or new corpora label Dec 8, 2022

lukavdplas mentioned this issue May 15, 2023

Compatibility between old and new corpus definitions #979

Closed

lukavdplas mentioned this issue Nov 24, 2023

Corpus validation after all fields are added #1333

Closed

This was referenced Feb 1, 2024

Infer visualisations from field definition #637

Closed

Schema for JSON/YAML corpus definition #1410

Closed

lukavdplas added the needs-mockup this suggestion could use a picture before it is implemented label Feb 27, 2024

lukavdplas mentioned this issue May 8, 2024

Feature/json corpus api #1569

Merged

lukavdplas mentioned this issue May 22, 2024

Automate / infer decisions for corpus definitions #987

Closed

lukavdplas assigned JeltevanBoheemen Jul 30, 2024

lukavdplas linked a pull request Dec 13, 2024 that will close this issue

Feature/corpus form #1659

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Form for entering corpus definitions (first draft) #982

Form for entering corpus definitions (first draft) #982

lukavdplas commented Nov 25, 2022

Form for entering corpus definitions (first draft) #982

Form for entering corpus definitions (first draft) #982

Comments

lukavdplas commented Nov 25, 2022

Step 0: initialising a corpus

Step 1: defining data extraction

Step 2: verifying extraction

Step 3: indexing

Step 4: interface settings

Step 5: publish the corpus