-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compatibility between old and new corpus definitions #979
Comments
For both old and new corpora, we have the issue that our source data may consist of complicated XMLs, scraped HTML data, or anything else that just requires a tailored script to go through, but storing corpora as database-only would not support this. The design philosophy for new corpora, especially for the idea of researchers adding their own corpora, is that the python script to read complex XML files can be detached from the web application. A researcher could write their own code to save their data as uncomplicated CSV files, and import those in I-analyzer. The same can be said for our current corpora: you can separate the functionality to parse the source data from the functionality to make a corpus object in I-analyzer and index in elasticsearch. My proposal would be to make a separate repository, e.g. These formatted files can then be used as source data for I-analyzer, which adds everything to an elasticsearch index etc. (N.B.: The This means that existing corpus definitions essentially get separated in
I think there are some advantages to this separation, but it does mean that we won't have a single corpus definition anymore. |
I've been giving this some more thought. An alternative way of going about this is to follow the second method I describe above - have two methods of adding corpora exist side-by-side - at least initially. Roughly speaking, we could follow the following method:
In more detail, this would look like the following. 1. Expand corpus model
We should now have reached a point where most corpus functionality during runtime only consults the datamodel. The bits that do require the actual python class should be one of the following:
This means that we can introduce other methods of adding corpora to the database, without breaking core functionality. 2. Alternative method of adding corporaAt this point, we can add a new method of adding corpora to a database. (#981 or #982, though the former may be easier to start with.) As an intermediate step, this method can just provide a representation of everything in the datamodel. This should be sufficient for connecting to an existing index. Of course, the method only becomes useful if we include some extra information about extracting data, and functionality for indexing, so that is the next step. At this point, the indexing script needs to distinguish between the different types of corpora: should it load a python class or use database representations? 3. ?At this point, we can either:
|
Summary of where we're at: For now, we'll stick with having two methods of entering corpus definitions. If both of those are saved in the database, it doesn't really create issues during runtime. That means that we'll also accept that, for the time being, python-based definition may support some features that won't be supported in JSON-based definitions. |
I'm closing this since there is no particular work that needs to be done for this. The work to create a shared database model (which is then serialised for the API) was effectively finished with #1226. Python corpora don't need further adjustments; JSON-based corpora can be added as an additional option. As for that, #978 covers necessary database expansions, and #1410 covers defining and parsing the JSON model. |
Following #978 , we now have a new and legacy format to save corpus definitions. Depending on the implementation we prefer, we have two options:
The text was updated successfully, but these errors were encountered: