Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compatibility between old and new corpus definitions #979

Closed
lukavdplas opened this issue Nov 25, 2022 · 4 comments
Closed

Compatibility between old and new corpus definitions #979

lukavdplas opened this issue Nov 25, 2022 · 4 comments
Labels
backend changes to the django backend corpus changes to corpus definitions or new corpora

Comments

@lukavdplas
Copy link
Contributor

Following #978 , we now have a new and legacy format to save corpus definitions. Depending on the implementation we prefer, we have two options:

  • Converting old corpus definitions to the new database-only format
  • Allowing the system to draw corpus definitions from both the database and the python module.
@lukavdplas lukavdplas added backend changes to the django backend corpus changes to corpus definitions or new corpora labels Nov 25, 2022
@lukavdplas
Copy link
Contributor Author

For both old and new corpora, we have the issue that our source data may consist of complicated XMLs, scraped HTML data, or anything else that just requires a tailored script to go through, but storing corpora as database-only would not support this.

The design philosophy for new corpora, especially for the idea of researchers adding their own corpora, is that the python script to read complex XML files can be detached from the web application. A researcher could write their own code to save their data as uncomplicated CSV files, and import those in I-analyzer.

The same can be said for our current corpora: you can separate the functionality to parse the source data from the functionality to make a corpus object in I-analyzer and index in elasticsearch.

My proposal would be to make a separate repository, e.g. ianalyzer-extractors, which would more or less branch off from our current addcorpus and corpora modules. This would serve the function of taking a directory of source files for a corpus (e.g. the XML files for the Times), extract the documents, and output everything in an index-ready format, e.g. some neatly formatted CSV or JSON files.

These formatted files can then be used as source data for I-analyzer, which adds everything to an elasticsearch index etc.

(N.B.: The ianalyzer-extractors could also be combined with some of our code for harvesting or scraping.)

This means that existing corpus definitions essentially get separated in

  • A python class in ianalyzer-extractors which mainly consists of the sources function and the extractor for each field.
  • A YAML definition that can be imported in i-analyzer which describes the elasticsearch index and interface options.

I think there are some advantages to this separation, but it does mean that we won't have a single corpus definition anymore.

@lukavdplas
Copy link
Contributor Author

I've been giving this some more thought. An alternative way of going about this is to follow the second method I describe above - have two methods of adding corpora exist side-by-side - at least initially.

Roughly speaking, we could follow the following method:

  1. Expand the Corpus model significantly
  2. Add a second method of adding a Corpus object
  3. Gradually transition existing corpora to being database-only, or leave things as is.

In more detail, this would look like the following.

1. Expand corpus model

  • Expand the current Corpus model so it includes everything that is currently serialised to the frontend. Since these are the properties we are already serialising, finding a database representation is definitely possible.
  • Add a save method to the corpus definition class, which saves the corpus to the database. This will replace the current serialize method.
  • Adjust the CorpusSerializer class: since all information is now present in the database, it can just serialise from there, without looking at the python class.
  • Note that the datamodel is still a subset of the python classes, but we can now start to adjust other backend functions. They should use the data model when they can, and only load the python class when they need to (e.g. for document image requests).

We should now have reached a point where most corpus functionality during runtime only consults the datamodel. The bits that do require the actual python class should be one of the following:

  • The function is only evaluated during index time (not while the application is active)
  • The function can be compartmentalised as an optional, advanced feature - such as presenting document images.

This means that we can introduce other methods of adding corpora to the database, without breaking core functionality.

2. Alternative method of adding corpora

At this point, we can add a new method of adding corpora to a database. (#981 or #982, though the former may be easier to start with.)

As an intermediate step, this method can just provide a representation of everything in the datamodel. This should be sufficient for connecting to an existing index.

Of course, the method only becomes useful if we include some extra information about extracting data, and functionality for indexing, so that is the next step. At this point, the indexing script needs to distinguish between the different types of corpora: should it load a python class or use database representations?

3. ?

At this point, we can either:

  • Accept that we have two methods of adding corpora with different use cases
  • Gradually work towards making the python corpora database-only. This can be done by a) incrementally expanding the JSON method where needed, and b) outfactoring data extraction into a preprocessing package like I describe above.

@lukavdplas
Copy link
Contributor Author

Summary of where we're at:

For now, we'll stick with having two methods of entering corpus definitions. If both of those are saved in the database, it doesn't really create issues during runtime.

That means that we'll also accept that, for the time being, python-based definition may support some features that won't be supported in JSON-based definitions.

@lukavdplas
Copy link
Contributor Author

I'm closing this since there is no particular work that needs to be done for this. The work to create a shared database model (which is then serialised for the API) was effectively finished with #1226. Python corpora don't need further adjustments; JSON-based corpora can be added as an additional option. As for that, #978 covers necessary database expansions, and #1410 covers defining and parsing the JSON model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend changes to the django backend corpus changes to corpus definitions or new corpora
Projects
None yet
Development

No branches or pull requests

1 participant