-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Performance and Reduce memory usage #4
Conversation
…s inline to save memory
Oh, yeah, in the plot "negative cases" refers to the number of negative labels in the binary classification data set I used to test this. The negative cases dominate the data set. |
Hi @gerner , thanks for your PR. I think it would be better to split this in 3 PRs, one for the Reservoir Sampler, one for the nearest neighbors search and maybe another one for logging . I have some comments regarding these:
For all 3 PRs, I need to have unit tests. [1] "Li, Kim-Hung (4 December 1994). "Reservoir-Sampling Algorithms of Time Complexity O(n(1+log(N/n)))". ACM Transactions on Mathematical Software. 20 (4): 481–493." |
@johny-c thanks for feedback. I saw the 3 part PR request coming. I'll do that and add unit tests. Reservoir Sampling Yes, I pulled the algo from Wikipedia, the so called "Algorithm L", as you cite. Are you looking for that reference in code or somewhere else?. In terms of testing, are you just looking for a comparison or reservoir sampling vs post-hoc sampling in general? Would a unit test that compares timing and checks means/stdev be sufficient? What about just running these tests offline and presenting a plot of histograms and some results? NearestNeighbors eprint
|
Hi, I also checked river but nothing there either, so let's go with your implementation. Documentation: I mean something like the docstring of the Reservoir sampling: I want to see some evidence here (yes, like plots / histograms) showing comparisons in terms of accuracy, speed and memory. The unit tests should only check if the implementation is correct. So, you need to come up with some small examples where you know the groundtruth or it is trivial - they shouldn't run for too long, definitely not on real data sets. NN search: if that is the case, I wonder if that should then rather be a PR in logging: This is a large conversation, that affects more than just this library. This issue in scikit-learn is open for 10 years now, so I guess it's not that straightforward. For now I would keep things as they are, as I have not received any complaints about logging. I don't mind if you open an issue or a PR though, to have a more in-depth discussion here. |
I'm wrapping up a new PR for Reservoir sampling. It looks to be slower than rng.choice by quite a lot. To make it competitive I think it'd have to be cythonized or in native code. However, I think that's a small amount of time compared to the rest of the work that's getting done. I'll have more details in a separate PR. I tried the sklearn NN based implementation, and you are correct, that brute force search performs the same as the hand coded one I did, so I'll cut that PR. Thanks for pushing on that, I was wrong. It sounds like I'm unlikely to solve logging issue right now. I might still address progress logging it in my fork if I end up shipping this to prod. |
Hey @gerner , no problem. |
Reduce memory usage and improve performance by:
Below is a comparison of different methods to select target neighbors. A polynomial degree 2 trendline is included. NearestNeighbors is the original implementation that constructs a NearestNeighbors from sklearn and then uses kneighbors. Dualtree forces the use of a ball_tree and uses the dual_tree option for finding kneighbors. pairwise_distances_chunked is the implementation from this PR that uses sklearn.metrics.pairwise_distances_chunked directly. The pairwise_distances_chunked method is much faster than constructing the NearestNeighbors object and using it's kneighbors method to find the target neighbors.
Note, the dimenstionality of this dataset is very high, much higher than the 15 cutoff NearestNeighbors uses to switch to brute force.
The reservoir sampling is used to do on-line uniform random sampling from the set of imposters, rather than sampling after all imposters have been selected. Holding all impostors for all nodes turns out to be very memory intensive for large datasets, so on-line sampling helps a lot.
Finally, I moved all the prints to write to stderr instead of stdout. Having logging is useful, having it write to stdout can pollute application specific results that makes it hard to use the verbose option in some settings.
Tests still pass, although I had to reduce the precision of one of the tests from 6 digits to 5. I also made the tests run with sklearn 0.24 which moved the test utils to a private module. I think most of those are avialable in numpy, but I wasn't sure about all of them, so I just left it and have a try/import/except/other import around those.