Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data processing lifecycle of EDTFField? #17

Open
jmurty opened this issue Jun 5, 2017 · 1 comment
Open

Data processing lifecycle of EDTFField? #17

jmurty opened this issue Jun 5, 2017 · 1 comment

Comments

@jmurty
Copy link
Contributor

jmurty commented Jun 5, 2017

We need to give some thought to the lifecycle of data going into the EDTFField.

Initial questions are:

  1. when should the EDTF field be updated based on the value of the associated natural_text_field?

    1. if the natural_text_field is cleared, presumable the EDTF and other derived field values should also be cleared?
    2. if the natural_text_field value is invalid or cannot be turned into an EDTF, should the existing EDTF and other derived field values be cleared?
    3. since parsing of natural_text_field is very slow, could we somehow cache the before-and-after values and only re-parse it when it changes?
  2. can/should we permit direct interaction with the EDTFField?

    1. the field is intended to be used only via automatic processing of the natural_text_field source field, but this is both slow and likely to lead to less precise data if there is already a known EDTF value that could be set directly (e.g. via the GLAMkit Collections API)
    2. if the EDTF field is set directly and the natural_text_field is not set, should this be reverse-generated from the EDTF value?
    3. what if the EDTF field is set directly, then a natural_text_field value is set subsequently. Should the EDTF value always be replaced by the parsednatural_text_field value?
    4. should we permit, and do we need to detect or handle, the other derived fields like lower_fuzzy_field being set directly instead of via the EDTFField's processing
@cogat
Copy link
Contributor

cogat commented Jun 6, 2017

Great points @jmurty

I think the upshot is we need to know for an EDTFField whether it was automatically or manually populated, a bit like Occurrences for events.

I envisage this as an associated checkbox called "update automatically", default True. Given that, my answers to your questions are:

  1. if unchecked, never. If checked, whenever the natural_text_field is changed.
    i. yes
    ii. yes
    iii. actually, parsing of natural text into EDTF string is fast; parsing of EDTF string into a python object is slow. I think the best bang-for-buck is to speed up the parser. I implemented the BNF grammar there are a lot of very pedantic definitions of what strings digits can specify a month, etc. I think relaxing these low-level constraints and validating them in Python will speed things up heaps. In answer to the question, comparing before-and-after EDTF strings seems reasonable. It would probably even be a speed-up to retrieve the object from the DB at pre-save and check that.

  2. If unchecked, yes. If checked, then perhaps yes if the checkbox becomes unchecked if you interact directly with the field, otherwise no.
    i. agreed
    ii. in future it should be possible to derive a natural text description of an EDTF field. This is likely to be locale/client-specific though. I don't see a great need to store the description, and especially not in natural_text_field to avoid feedback loops.
    iii. checkbox should remove this possibility
    iv. the intention is that those fields are always derived from the EDTF field. Some collections systems have their own lower/upper bound dates. I doubt any such collections systems ALSO handle EDTF, so we can in those cases construct an EDTF date from the lower/upper bounds, which will get us the same derived dates. The difficulty arises when the lower/upper and natural dates contradict, which should be resolved by the above.

How does that sound?

jmurty added a commit that referenced this issue Jun 6, 2017
- use faster pyparsing grammar constructs and
  arrangements to significantly speed up parsing
- enable skipped parsing unit tests now that they
  are not infeasibly slow
- add testing requirements to setup.py

Anecdotal speed increase is from about 30 seconds
to run the `test_date_values` tests down to below
3 seconds.

See also #17
@cogat cogat removed their assignment Jan 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants