-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use reservoir sampling for impostors to save memory #5
base: master
Are you sure you want to change the base?
Conversation
I should also point out that on the data set I'm actually using, finding the target neighbors takes way longer than training the metric. And in my case the current master implementation that does post-hoc sampling of impostors uses more memory than is available on the machine I'm training on causing the training process to get killed. So this is a great trade-off for me. |
Hi @gerner , thanks for your work. A few comments:
|
Sure, I like your idea about configuration. Are you suggesting making Sure, I'll add a test that compares performance of the different choices for impostor storage/sampling. |
That's right, how impostors are sampled is independent of how they are stored. Therefore maybe there should be a separate parameter (e.g. |
here's a performance comparison on the iris dataset. Notice I picked a very small number of max_impostors to (hopefully) ensure we actually do sampling: samplers = ['uniform', 'reservoir']
scores = {impostor_sampler: list() for impostor_sampler in samplers}
n_tries = 500
iris = datasets.load_iris()
perm = rng.permutation(iris.target.size)
iris_data = iris.data[perm]
iris_target = iris.target[perm]
for impostor_sampler in samplers:
for i in tqdm(range(n_tries)):
X_train, X_test, y_train, y_test =train_test_split(iris_data, iris_target)
lmnn_mdl = lmnn.LargeMarginNearestNeighbor(
impostor_sampler=impostor_sampler,
max_impostors=5)
lmnn_mdl.fit(X_train, y_train)
knn = KNeighborsClassifier(n_neighbors=lmnn_mdl.n_neighbors_)
LX_train = lmnn_mdl.transform(X_train)
knn.fit(LX_train, y_train)
LX_test = lmnn_mdl.transform(X_test)
scores[impostor_sampler].append(knn.score(LX_test, y_test))
fig, ax = plt.subplots(figsize=(10,10))
for impostor_sampler in samplers:
ax.hist(scores[impostor_sampler], bins=30, alpha=0.3, label=f'{impostor_sampler} mean={np.mean(scores[impostor_sampler]):.3f}')
ax.legend()
plt.show() |
I made reservoir sampling optional, controlled by a parameter similar to |
Thanks for the changes :) I think the RS has another advantage, namely that we can use one object for the whole loop over classes. Therefore I would create a separate
|
You're a tough customer. Do I get a citation out of this? j/k I think I see what you're saying, but it's kind of marginal, right? More than half the code will be identical. What if I make an alternative to ReservoirSampler, UniformSampler. The two are polymorphic so the only difference will be when we create an instance of ReservoirSampler vs UniformSampler. We get rid of the if/elses and abstract some of the code around extending arrays, checking length, doing sampling, etc. That makes more sense to me. I set Below are 100 trials on digits using default params other than Below are 100 trials on digits using max_impostors=50, to compare the effect of using And for faces I did a single run of each, score was 0.96 for uniform and 1.0 for reservoir with default parameters. With |
… sampling polymorphic
I converted the uniform sampling approach to use a polymorphic class with ReservoirSampler. This cleaned up _find_impostors_blockwise and I think accomplishes your goal of readability without duplicating the actual interesting logic (about margin radii and such). |
Oh, I also re-ran timing and memory tests and the updated uniform sampling implementation is comparable to the previous one. |
Haha, I know - I just want to make sure things will work as expected. Regarding citations, there is actually a PR to |
Are you suggesting creating a new file impostor_sampling.py to hold |
Yes, I don't mind but I think |
Done. Let me know what else I can do to streamline the merge and publish back to pypi. I love to give back to open source and I've learned a little bit in this process, but mostly I want to get this into the hands of my colleagues without passing around a custom tarball :) |
Hi @gerner . I finally found some time to have a look at this. I updated the code a bit and I will leave some review comments in github. I think there are still issues that need fixing and some code that I'm not yet familiar with. One general comment is to have a look at the pull request checklist suggested by scikit-learn (https://scikit-learn.org/stable/developers/contributing.html#pull-request-checklist) so that your code does not have PEP errors and so on, when you push. Also, I think I need to first abstract the impostor storage in a similar way as we did for impostor sampling. Then, we could perhaps use one Sampler object per iteration, which would make more sense algorithmically. |
k = self.sample_size | ||
if n > k: | ||
ind = self.random_state.choice(n, k, replace=False) | ||
return np.asarray(self._values, dtype=object)[ind] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the overhead here, when values contains both distances and indices? Maybe there is a more efficient way to sample from multiple arrays? We could also use the knowledge that there are always either 1 or 2 arrays used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, this is not ideal, and you're right the root cause is needing to support sampling from parallel arrays of different types (just indicies or indicies and distances). I saw this as a bigger problem and didn't address any of it in favor of brevity.
# NOTE: At this point, the reservoir might still be not full | ||
|
||
# general case: sample from latest_values | ||
while self._next_i - offset < len(latest_values): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this while loop can be vectorized. I need to look at this in more depth, but I think a lot of time can be saved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I agree, and this is where things are slower for the reservoir sampler. Related to this is the idea of pre-computing the sample indicies in blocks. I could easily do some of this (https://github.com/johny-c/pylmnn/pull/5/files/42d42ab02d310aa4f5a8d7b716ae2bb50396aa16#diff-adeb34b813966f09fdcba22cd080044cf631ad669d04a94a05f971dd51c868b0R131-R139), but my math of random distributions isn't quite good enough to vectorize all of it.
More concisely: I believe it's possible to create a vector of all the necessary elements from the geometric distribution for the given block we're extending from. I just don't know how.
k = self.sample_size | ||
rng = self.random_state | ||
|
||
latest_values = list(latest_values) # TODO: This seems unnecessary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would make a copy even when not necessary. Also here we have to maybe exploit the fact that there are 1 or 2 arrays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is here because the caller might be (is in the current impl) sending in an iterable, not necessarily a list which makes some of the operations a little trickier (e.g. random access, getting the length). I agree, there's a better solution and it's tied up in needing to support parallel np arrays of different types.
I get the feeling you're not just going to merge as is :) seems like there's three problems:
Anything else? In your mind, what's required for merge? |
Well, yes, I really don't want to break something that works up to now. Regarding extra requirements:
This only makes sense if we have concluded on how the code should look like, but you may want to do it now already to catch any potential bugs in the implementation of RS and get a feeling of its "real-world" performance. I implemented a storage class (see branch Some thoughts:
|
My top post in this thread generates a few large-ish datasets (10k examples) with different properties and shows the memory runtime properties of both sampling strategies. A later post includes properties of the sampling comparing reservoir to uniform as well as running on the faces sklearn sample dataset. I've also been running my branch in my own production setting on a large data set with tens or hundreds of thousands of examples with hundreds of features each (which is why I need this work). So I have confidence in that regard. From my perspective the run-time trade-offs here are very small compared to just handling feature data before and after training. In terms of testing on more example datasets, did you have something else in mind? Your thoughts about refactoring how impostors are chosen and stored seem reasonable. They also seem like a bigger change that might require a lot more work. That sounds like a good idea in concept, but what do you think about locking in this functionality first? What if I clean up any pep8 violations and how the 1 vs 2 array cases are handled, especially avoiding making extra copies of data? Do you think we could merge that change? |
Yes, sorry I missed your post about faces and digits. Yes, if you tackle these issues, it should be fine. I just want then to make sure |
Thanks for the note, and agreed on not causing a regression. I'm thinking of just special casing one param vs two. That will make the uniform sampling approach more similar to what was happening before (just with an extra method call or two instead of in-line). It'll probably take me a week or so to get time. |
Sure, I was thinking maybe it makes sense to make |
TL;DR
Reservoir sampling impostors as they are generated avoids O(n^2) memory usage, recovers the underlying distribution about as well as
numpy.RandomState.choice
, but is slower (~2.5x). For cases with few impostors (relative tomax_impostors
) this has a very small effect. But this can add a lot of time to training, up ~60% in tests below, where there are a lot of impostors relative to the sample size.Overview
While generating impostors blockwise, store their indicies (and optionally distance) in a reservoir sampler instead of storing all of them and sampling after generating. This usually isn't an issue because the size of the impostor indicies (even all of them) should be small relative to other state, e.g. the feature matrix. However the number of possible impostors grows with O(n_samples^2) in the worst case. So in some cases, depending on the layout of examples and how gradient descent proceeds, this can be very large relative to the size of other state.
This pulls out and improves the ResevoirSample from #4
The reservoir sampling algorithm is taken from "Algorithm L" described in the Wikipedia page for reservoir sampling which in turn references:
Caveat
Note, in the case that we have to sample impostors this implementation comes with a run-time cost. The implementation is takes about 2.5x as long as numpy.RandomState.choice. This can be a large part of training time in some cases, especially when there are many impostors. But when there are few impostors (relative to
max_imposters
), this has a small impact on training time.Tests
In addition to the unit test included in the PR (and coverage from existing tests), these are external tests to verify correctness and analyze performance.
Correctness
Recovery of Population Mean
First we verify that the sample performs as well as
numpy.RandomState.choice
with respect to measuring the mean of the underlying population:Distribution Match
Now we check if the distribution "matches" the underlying population being sampled, and compare to
numpy.RandomState.choice
.Performance
Memory
Here I look at memory usage in a simplified setting similar to how sampling is used in lmnn.py. I measure it using maxrss size. So each run below is from a fresh python process (restarted jupyter notebook kernel).
max rss ussage increased 20MB.
max rss ussage increased nearly 400MB.
Timing
These results show
ReservoirSample
takes ~2.5x as long asnp.RandomState.choice
. This is after some optimization, notably generating random numbers and doing some of the arithmetic in batches (of size = sample size). Naively doing this took over 5x as long asnumpy.RandomState.choice
. We'll see below how this can impact end-to-endLargeMarginNearestNeighbors
training time.End-to-End Test
Here we look at end-to-end performance of training using both implementations. These were taken from current master
6f5e385 vs this change, each test with a fresh python process. Specifically we're looking at worst case performance (impostors are totally mixed with class neighbors) vs a couple of other cases where classes are better separated.
Test set up
Totally random classes:
Partially offset classes:
Well separated classes:
Results
Per-iteration training time (taken from logging):
numpy.RandomState.choice
Max RSS increase:
numpy.RandomState.choice