Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Performance and Reduce memory usage #4

Closed

Conversation

gerner
Copy link

@gerner gerner commented Jan 28, 2021

Reduce memory usage and improve performance by:

  • use pairwise_distances_chunked which is much faster than constructing NearestNeighbors and using it for kneighbors (see plot below)
  • sample impostor indices using a reservoir sampler instead of sampling after the fact

Below is a comparison of different methods to select target neighbors. A polynomial degree 2 trendline is included. NearestNeighbors is the original implementation that constructs a NearestNeighbors from sklearn and then uses kneighbors. Dualtree forces the use of a ball_tree and uses the dual_tree option for finding kneighbors. pairwise_distances_chunked is the implementation from this PR that uses sklearn.metrics.pairwise_distances_chunked directly. The pairwise_distances_chunked method is much faster than constructing the NearestNeighbors object and using it's kneighbors method to find the target neighbors.

Note, the dimenstionality of this dataset is very high, much higher than the 15 cutoff NearestNeighbors uses to switch to brute force.

image

The reservoir sampling is used to do on-line uniform random sampling from the set of imposters, rather than sampling after all imposters have been selected. Holding all impostors for all nodes turns out to be very memory intensive for large datasets, so on-line sampling helps a lot.

Finally, I moved all the prints to write to stderr instead of stdout. Having logging is useful, having it write to stdout can pollute application specific results that makes it hard to use the verbose option in some settings.

Tests still pass, although I had to reduce the precision of one of the tests from 6 digits to 5. I also made the tests run with sklearn 0.24 which moved the test utils to a private module. I think most of those are avialable in numpy, but I wasn't sure about all of them, so I just left it and have a try/import/except/other import around those.

@gerner
Copy link
Author

gerner commented Jan 28, 2021

Oh, yeah, in the plot "negative cases" refers to the number of negative labels in the binary classification data set I used to test this. The negative cases dominate the data set.

@gerner gerner changed the title Reduce memory usage Improve Performance and Reduce memory usage Jan 29, 2021
@johny-c
Copy link
Owner

johny-c commented Jan 30, 2021

Hi @gerner , thanks for your PR. I think it would be better to split this in 3 PRs, one for the Reservoir Sampler, one for the nearest neighbors search and maybe another one for logging . I have some comments regarding these:

  • Reservoir Sampler: This is a good idea. It would be nice to add documentation to this. I guess the appropriate reference for the particular algorithm is [1], according to Wikipedia (?). Also, I would prefer using numpy instead of math for these operations and giving the variables more descriptive names. Maybe, there is some official implementation out there? Finally, I would like to have a comparison for speed and accuracy with and without reservoir sampling for 3-4 diverse datasets.
  • NN Search: I am not sure about this one. For sure, there can be cases / data sets where going "brute force" is faster. But I believe the implementation in scikit-learn is taking into account a variety of cases and switches to a different algorithm (ball-tree, kd-tree, etc.) accordingly. I think I used brute force at first myself. If you really need to enforce using brute force, you can pass {algorithm: "brute"} through neighbors_params. That would do the same as your implementation, right?
  • logging: Sorry, I don't like the eprint solution so much. If there is going to be a change in printing, I would switch to using python's logging module. Although, this could cause other issues.

For all 3 PRs, I need to have unit tests.

[1] "Li, Kim-Hung (4 December 1994). "Reservoir-Sampling Algorithms of Time Complexity O(n(1+log(N/n)))". ACM Transactions on Mathematical Software. 20 (4): 481–493."

@gerner
Copy link
Author

gerner commented Jan 30, 2021

@johny-c thanks for feedback.

I saw the 3 part PR request coming. I'll do that and add unit tests.

Reservoir Sampling
I looked around for some off the shelf reservoir sampler, or any kind of on-line sampling from numpy or sklearn, but couldn't find anything.

Yes, I pulled the algo from Wikipedia, the so called "Algorithm L", as you cite. Are you looking for that reference in code or somewhere else?.

In terms of testing, are you just looking for a comparison or reservoir sampling vs post-hoc sampling in general? Would a unit test that compares timing and checks means/stdev be sufficient? What about just running these tests offline and presenting a plot of histograms and some results?

NearestNeighbors
The data set I tested on for NN was of high enough dimension that NN was using brute-force for the tests I did. That's the blue line in my plot. I'll double check that to make sure. I hope I'm right because I didn't want to do this work :) If I am correct I'll include a little more info about that.

eprint
I prefer logging for this kind of thing as well. But using logging requires a little bit of setup that doesn't belong in a library (i.e. logging.basicConfig) to get any output so I thought that was a non-starter for folks that just want verbose=2 to work. I thought this was a reasonable compromise. Basically I want some kind of logging, but I can't have it going to stdout. Here are some options:

  • Go with logging, as you suggest, and just deal the fact that setting verbose=2 is not sufficient. I still need to figure out how verbose would interact with turning on/off loggers and logging levels which is a little weird. Maybe it's easiest to still wrap logging.info calls with verbose as currently done?
  • Change all your "prints" to print(..., file=sys.stderr). A minor change, but error prone and doesn't protect future work.
  • Have an optional callback or a couple of callbacks. An app can set verbose=2 and get the current behavior, or verbose=0 plus set some callbacks and still get some customized progress feedback. I'm imagining a callback around selecting target neighbors (e.g. class plus how many records have neighbors selected so far) and another around learning the distance metric (e.g. the same thing you're printing right now: iterations, loss, active triplets). The callbacks are just for progress notification, not for any other purpose.

@johny-c
Copy link
Owner

johny-c commented Jan 30, 2021

Hi, I also checked river but nothing there either, so let's go with your implementation.

Documentation: I mean something like the docstring of the LargeMarginNearestNeighbor class, describing how the sampling method works, the variables used and putting the reference at the end.

Reservoir sampling: I want to see some evidence here (yes, like plots / histograms) showing comparisons in terms of accuracy, speed and memory. The unit tests should only check if the implementation is correct. So, you need to come up with some small examples where you know the groundtruth or it is trivial - they shouldn't run for too long, definitely not on real data sets.

NN search: if that is the case, I wonder if that should then rather be a PR in scikit-learn regarding their implementation of NearestNeighbors.

logging: This is a large conversation, that affects more than just this library. This issue in scikit-learn is open for 10 years now, so I guess it's not that straightforward. For now I would keep things as they are, as I have not received any complaints about logging. I don't mind if you open an issue or a PR though, to have a more in-depth discussion here.

@gerner
Copy link
Author

gerner commented Feb 1, 2021

I'm wrapping up a new PR for Reservoir sampling. It looks to be slower than rng.choice by quite a lot. To make it competitive I think it'd have to be cythonized or in native code. However, I think that's a small amount of time compared to the rest of the work that's getting done. I'll have more details in a separate PR.

I tried the sklearn NN based implementation, and you are correct, that brute force search performs the same as the hand coded one I did, so I'll cut that PR. Thanks for pushing on that, I was wrong.

It sounds like I'm unlikely to solve logging issue right now. I might still address progress logging it in my fork if I end up shipping this to prod.

@johny-c
Copy link
Owner

johny-c commented Feb 2, 2021

Hey @gerner , no problem.
Regarding Reservoir sampling, I suspect you can get a speedup by precomputing 2 vectors of random numbers out of the for loop with numpy, instead of generating them one by one in the for loop. This might not be enough, but it's definitely an easy change.

@gerner
Copy link
Author

gerner commented Feb 3, 2021

@johny-c I just opened #5 which has an improved version of the reservoir sampling change plus various tests. I'll close this PR now.

@gerner gerner closed this Feb 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants