-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: sort objects in dicts/sets by normalize_token rather than str #10918
Conversation
Can one of the admins verify this patch? Admins can comment |
Here's a very minimal reproducer involving awkward array: import json
import awkward as ak
from dask.base import tokenize, normalize_token
a = ak.Array([1, 2, 3, 4, 5])
layout, report = ak.typetracer.typetracer_with_report(a.layout.form_with_key())
b = ak.Array(layout)
print(report.data_touched, report.shape_touched)
d = {"what": b}
normalized = sorted(d.items(), key=lambda x: str(normalize_token(x)))
print(report.data_touched, report.shape_touched)
normalized = sorted(d.items(), key=str)
print(report.data_touched, report.shape_touched) resulting in:
Which means str is intrusive to these objects while normalize_token itself is not, and could be instead handled through This should aid in the deterministic keys effort in the long run. |
Darn, this methodology has the unfortunate outcome that we are sensitive to the history of the graph. So if you start from a different graph and optimize it, that is not the same as constructing an optimized graph by hand. This is why the SubgraphCallable tests are failing. |
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 15 files ± 0 15 suites ±0 3h 11m 48s ⏱️ - 6m 0s Results for commit f1e196f. ± Comparison against base commit 7726d31. ♻️ This comment has been updated with latest results. |
Hmm not fixed by including #10883 (actually this issue is broken further by that PR). |
OK - this seems to get it all in reasonable shape, doesn't break any existing rules, and now has a test that shows what the problem is. After getting it all in shape I tried merging #10883 again and that PR still causes the test to break. |
@lgray I'd be tempted to remove the test dependency on awkward, and instead make a mock type that throws an error in |
@@ -73,6 +73,7 @@ complete = [ | |||
] | |||
test = [ | |||
"pandas[test]", | |||
"awkward", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-1 from me. I'd rather not add test dependencies to our already heavy test suite.
Also this doesn't actually run the added test in CI.
Superseded by #10919 |
@@ -1075,7 +1075,7 @@ def normalize_set(s): | |||
# Note: in some Python version / OS combinations, set order changes every | |||
# time you recreate the set (even within the same interpreter). | |||
# In most other cases, set ordering is consistent within the same interpreter. | |||
return "set", _normalize_seq_func(sorted(s, key=str)) | |||
return "set", _normalize_seq_func(sorted(s, key=tokenize)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is quite expensive as it will tokenize everything twice. The benefit compared to str is only (1) when str is expensive and (2) when different objects have the same str output.
This ensures that
__dask_tokenize__
is the only entrypoint that other modules need to care about when interacting with dask.base.tokenize.pre-commit run --all-files
Closes dask-contrib/dask-awkward#468