-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue #12189 Added mechanism to propagate tripped QC test flags to al… #60
base: master
Are you sure you want to change the base?
Conversation
…l affected parameters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where are the unit tests to make sure the qc tests are propagating as they are supposed to be?
Also - I'd like to see some timing results. Will this affect normal stream queries, or is this only affecting the netcdf download?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to look into why the unit test is failing.
@@ -10,10 +10,30 @@ | |||
|
|||
|
|||
@log_timing(_log) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this annotation should still be applied to _get_stream_metadata_list() since this function is called multiple times in _get_stream_metadata_list().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please restate this. You used the same function name as the calling and the called function. I refactored the 1st line of _get_stream_metadata to create _get_stream_metadata_list. I don't see what you mean when you say the function is called multiple times by the other when I look at these two functions.
|
||
|
||
def get_refdes_streams_for_subsite_node(subsite, node): | ||
return [(Dict['referenceDesignator']['subsite']+'-'+Dict['referenceDesignator']['node'] + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to me a bit odd that you have to reconstruct the reference designator when it's in the dictionary as a key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I've seen the data that's returned from the call to _get_stream_metadata_list has a reference_designator map that contains only the subsite, node and sensor elements.
Dan - I believe this will affect all stream_engine queries. I'll work on generating timing results. Keep in mind most of what happens is triggered when a QC test failure is recognized. I'll work on adding a test to test_stream_request.py. Mark - I'm trying to make sense of the Travis CI test failure. I'll get back to you on that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, the _run_qc and the _initialize methods have become very long. It would be a major help to readability and understanding if some of the logic could be abstracted away into new functions.
@@ -10,10 +10,30 @@ | |||
|
|||
|
|||
@log_timing(_log) | |||
def _get_stream_metadata_list(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _get_stream_metadata_list() function is getting called in 3 functions: _get_stream_metadata(), get_refdes_streams_for_subsite_node(), and get_streams_for_subsite_node_sensor(). It looks like stream_request.py calls at least two of these functions in _run_qc(). I'm not sure how expensive the metadata_service_api.get_stream_metadata_records() and underlying query are, but it may be a good idea to try to limit how often the metadata lookup occurs.
QC_RESULTS = '_qc_results' | ||
RELATED = 'R' | ||
# these tests were deemed not ready for propagation by the Data Team | ||
QC_TEST_FAIL_SKIP_PROPAGATE = ('dataqc_gradienttest', 'dataqc_localrangetest', 'dataqc_polytrendtest') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this constant be better off defined in the config file, config/default.py? Then code changes would not be necessary if the data team saw reason to add or remove entries from this group.
for dataset in stream_dataset.datasets.itervalues(): | ||
self.qc_executor.qc_check(param, dataset) | ||
# check for QC test failure on 1+ data particles for the parameter | ||
if (param_qc_executed in dataset.keys()) and (param_qc_results in dataset.keys()) and \ | ||
(dataset[param_qc_executed].values.min() > dataset[param_qc_results].values.min()): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand how the comparison of min values is supposed to indicate whether or not there were any QC failures. I'm thinking dataset[param_qc_executed].values looks something like [21, 21, 21, 21, ..., 21] and dataset[param_qc_results].values is the same structure, but with potentially different numbers. Not sure if these values are possible, but for example, if the qc_executed values were [21, 21, 16, 4] and the qc_results values were [21, 21, 8, 4], then the minimum values would be equal and I think the check would fail to see there was a QC failure (executed value of 16 vs results value of 8).
if i & variance and self.qc_test_flag_map[i][1]] | ||
if failed_tests_to_propagate: | ||
qc_executed_values[position] = \ | ||
qc_executed_values[position] | self.qc_propagate_flag |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that comments 12 and 13 of the 12189 ticket mention this, but I don't understand why QC is propagated by changing bits in the qc_executed parameter instead of the qc_results parameter. My understanding was that _qc_executed indicated what QC tests were run for param and _qc_results indicated which QC tests passed for param. It seems strange to me, therefore, to indicate a propagated failure in qc_executed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind, this should be OK. I wasn't realizing there is a special qc propagation flag and thought other test flags were getting overwritten.
I am concerned that the logic for determining which parameters QC failures should propagate to does not match the way Stream Engine is resolving parameters. When there are multiple potential sources for an input parameter, it is important that the propagation of QC failures reflects only the input parameter that is actually used to calculate the derived parameter.
|
@sgfoote - I suspect that we should not continue work on this until Stream Engine has been updated to use Preload's parameter resolution. This was also causing difficulty for me in resolving other Stream Engine issues. |
Had a brief chat with Dan, and the solution of recording the derivation map as it is determined can proceed. We just need to keep in mind that the current logic for choosing input parameters to algorithms will eventually be replaced by Preload's parameter resolution. |
…l affected parameters