Compatibility between old and new corpus definitions #979

lukavdplas · 2022-11-25T09:24:30Z

Following #978 , we now have a new and legacy format to save corpus definitions. Depending on the implementation we prefer, we have two options:

Converting old corpus definitions to the new database-only format
Allowing the system to draw corpus definitions from both the database and the python module.

lukavdplas · 2023-03-03T11:07:16Z

For both old and new corpora, we have the issue that our source data may consist of complicated XMLs, scraped HTML data, or anything else that just requires a tailored script to go through, but storing corpora as database-only would not support this.

The design philosophy for new corpora, especially for the idea of researchers adding their own corpora, is that the python script to read complex XML files can be detached from the web application. A researcher could write their own code to save their data as uncomplicated CSV files, and import those in I-analyzer.

The same can be said for our current corpora: you can separate the functionality to parse the source data from the functionality to make a corpus object in I-analyzer and index in elasticsearch.

My proposal would be to make a separate repository, e.g. ianalyzer-extractors, which would more or less branch off from our current addcorpus and corpora modules. This would serve the function of taking a directory of source files for a corpus (e.g. the XML files for the Times), extract the documents, and output everything in an index-ready format, e.g. some neatly formatted CSV or JSON files.

These formatted files can then be used as source data for I-analyzer, which adds everything to an elasticsearch index etc.

(N.B.: The ianalyzer-extractors could also be combined with some of our code for harvesting or scraping.)

This means that existing corpus definitions essentially get separated in

A python class in ianalyzer-extractors which mainly consists of the sources function and the extractor for each field.
A YAML definition that can be imported in i-analyzer which describes the elasticsearch index and interface options.

I think there are some advantages to this separation, but it does mean that we won't have a single corpus definition anymore.

lukavdplas · 2023-05-15T09:54:01Z

I've been giving this some more thought. An alternative way of going about this is to follow the second method I describe above - have two methods of adding corpora exist side-by-side - at least initially.

Roughly speaking, we could follow the following method:

Expand the Corpus model significantly
Add a second method of adding a Corpus object
Gradually transition existing corpora to being database-only, or leave things as is.

In more detail, this would look like the following.

1. Expand corpus model

Expand the current Corpus model so it includes everything that is currently serialised to the frontend. Since these are the properties we are already serialising, finding a database representation is definitely possible.
Add a save method to the corpus definition class, which saves the corpus to the database. This will replace the current serialize method.
Adjust the CorpusSerializer class: since all information is now present in the database, it can just serialise from there, without looking at the python class.
Note that the datamodel is still a subset of the python classes, but we can now start to adjust other backend functions. They should use the data model when they can, and only load the python class when they need to (e.g. for document image requests).

We should now have reached a point where most corpus functionality during runtime only consults the datamodel. The bits that do require the actual python class should be one of the following:

The function is only evaluated during index time (not while the application is active)
The function can be compartmentalised as an optional, advanced feature - such as presenting document images.

This means that we can introduce other methods of adding corpora to the database, without breaking core functionality.

2. Alternative method of adding corpora

At this point, we can add a new method of adding corpora to a database. (#981 or #982, though the former may be easier to start with.)

As an intermediate step, this method can just provide a representation of everything in the datamodel. This should be sufficient for connecting to an existing index.

Of course, the method only becomes useful if we include some extra information about extracting data, and functionality for indexing, so that is the next step. At this point, the indexing script needs to distinguish between the different types of corpora: should it load a python class or use database representations?

3. ?

At this point, we can either:

Accept that we have two methods of adding corpora with different use cases
Gradually work towards making the python corpora database-only. This can be done by a) incrementally expanding the JSON method where needed, and b) outfactoring data extraction into a preprocessing package like I describe above.

lukavdplas · 2024-02-07T16:23:52Z

Summary of where we're at:

For now, we'll stick with having two methods of entering corpus definitions. If both of those are saved in the database, it doesn't really create issues during runtime.

That means that we'll also accept that, for the time being, python-based definition may support some features that won't be supported in JSON-based definitions.

lukavdplas · 2024-02-27T16:03:19Z

I'm closing this since there is no particular work that needs to be done for this. The work to create a shared database model (which is then serialised for the API) was effectively finished with #1226. Python corpora don't need further adjustments; JSON-based corpora can be added as an additional option. As for that, #978 covers necessary database expansions, and #1410 covers defining and parsing the JSON model.

lukavdplas added backend changes to the django backend corpus changes to corpus definitions or new corpora labels Nov 25, 2022

lukavdplas mentioned this issue Aug 9, 2023

Feature/expand corpus model #1226

Merged

lukavdplas mentioned this issue Dec 4, 2023

Save complete corpus definition in database #978

Closed

lukavdplas closed this as completed Feb 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compatibility between old and new corpus definitions #979

Compatibility between old and new corpus definitions #979

lukavdplas commented Nov 25, 2022

lukavdplas commented Mar 3, 2023

lukavdplas commented May 15, 2023

lukavdplas commented Feb 7, 2024

lukavdplas commented Feb 27, 2024

Compatibility between old and new corpus definitions #979

Compatibility between old and new corpus definitions #979

Comments

lukavdplas commented Nov 25, 2022

lukavdplas commented Mar 3, 2023

lukavdplas commented May 15, 2023

1. Expand corpus model

2. Alternative method of adding corpora

3. ?

lukavdplas commented Feb 7, 2024

lukavdplas commented Feb 27, 2024