Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use reservoir sampling for impostors to save memory #5

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

gerner
Copy link

@gerner gerner commented Feb 3, 2021

TL;DR

Reservoir sampling impostors as they are generated avoids O(n^2) memory usage, recovers the underlying distribution about as well as numpy.RandomState.choice, but is slower (~2.5x). For cases with few impostors (relative to max_impostors) this has a very small effect. But this can add a lot of time to training, up ~60% in tests below, where there are a lot of impostors relative to the sample size.

Overview

While generating impostors blockwise, store their indicies (and optionally distance) in a reservoir sampler instead of storing all of them and sampling after generating. This usually isn't an issue because the size of the impostor indicies (even all of them) should be small relative to other state, e.g. the feature matrix. However the number of possible impostors grows with O(n_samples^2) in the worst case. So in some cases, depending on the layout of examples and how gradient descent proceeds, this can be very large relative to the size of other state.

This pulls out and improves the ResevoirSample from #4

The reservoir sampling algorithm is taken from "Algorithm L" described in the Wikipedia page for reservoir sampling which in turn references:

.. [1] Li, Kim-Hung (4 December 1994).
           "Reservoir-Sampling Algorithms of Time Complexity O(n(1+log(N/n)))"
           ACM Transactions on Mathematical Software. 20 (4): 481–493.

Caveat

Note, in the case that we have to sample impostors this implementation comes with a run-time cost. The implementation is takes about 2.5x as long as numpy.RandomState.choice. This can be a large part of training time in some cases, especially when there are many impostors. But when there are few impostors (relative to max_imposters), this has a small impact on training time.

Tests

In addition to the unit test included in the PR (and coverage from existing tests), these are external tests to verify correctness and analyze performance.

Correctness

Recovery of Population Mean

First we verify that the sample performs as well as numpy.RandomState.choice with respect to measuring the mean of the underlying population:

rng = np.random.RandomState(0)
n_samples = 10000
block_n_rows = 880
n_sample = 100
n_tries = 1000

large_sample_means = []
reservoir_means = []
choice_means = []

for i in range(n_tries):
    # sort X so it's elements are not uniformly distributed w.r.t. position
    # this will help to illustrate sampling bias
    X = np.sort(rng.randn(n_samples))

    sampler = utils.ReservoirSample(n_sample, rng)
    for chunk in gen_batches(n_samples, block_n_rows):
        sampler.extend(X[chunk])

    large_sample_means.append(np.mean(X))
    reservoir_means.append(np.mean(sampler.current_sample()))
    choice_means.append(np.mean(rng.choice(X, n_sample, replace=False)))

fig, ax = plt.subplots(figsize=(10,10))
ax.hist(large_sample_means, bins=30, alpha=0.3, label=f'source pop means mean={np.mean(large_sample_means):.3f} std={np.std(large_sample_means):.3f}')
ax.hist(choice_means, bins=30, alpha=0.3, label=f'rng.choice mean={np.mean(choice_means):.3f} std={np.std(choice_means):.3f}')
ax.hist(reservoir_means, bins=30, alpha=0.3, label=f'ReservoirSample mean={np.mean(reservoir_means):.3f} std={np.std(reservoir_means):.3f}')
ax.legend()
plt.show()

image

Distribution Match

Now we check if the distribution "matches" the underlying population being sampled, and compare to numpy.RandomState.choice.

n_samples = 100000
block_n_rows = 880
n_sample = 1000
n_tries = 1000

X = np.sort(rng.randn(n_samples))
choice_sample = None
reservoir_sample = None

sampler = utils.ReservoirSample(n_sample, rng)
for chunk in gen_batches(n_samples, block_n_rows):
    sampler.extend(X[chunk])
    choice_sample = rng.choice(X, n_sample, replace=False)
    reservoir_sample = sampler.current_sample()

fig, ax = plt.subplots(figsize=(10,10))
ax.hist(X, density=True, bins=30, alpha=0.3, label=f'source pop means mean={np.mean(X):.3f} std={np.std(X):.3f}')
ax.hist(choice_sample, density=True, bins=30, alpha=0.3, label=f'rng.choice mean={np.mean(choice_sample):.3f} std={np.std(choice_sample):.3f}')
ax.hist(reservoir_sample, density=True, bins=30, alpha=0.3, label=f'ReservoirSample mean={np.mean(reservoir_sample):.3f} std={np.std(reservoir_sample):.3f}')
ax.legend()
plt.show()

image

Performance

Memory

Here I look at memory usage in a simplified setting similar to how sampling is used in lmnn.py. I measure it using maxrss size. So each run below is from a fresh python process (restarted jupyter notebook kernel).

n_samples = 10000000
block_n_rows = 80000
n_sample = 500000
print(f'max rss usage before {resource.getrusage(resource.RUSAGE_SELF).ru_maxrss}')
sampler = utils.ReservoirSample(n_sample, rng)
for _ in range(int(n_samples/block_n_rows)):
    X_chunk = rng.randn(block_n_rows)
    sampler.extend(X_chunk)

sample = sampler.current_sample()
print(f'max rss usage after {resource.getrusage(resource.RUSAGE_SELF).ru_maxrss}')
max rss usage before 132584
max rss usage after 153092

max rss ussage increased 20MB.

print(f'max rss usage before {resource.getrusage(resource.RUSAGE_SELF).ru_maxrss}')
ind = []
for _ in range(int(n_samples/block_n_rows)):
    X_chunk = rng.randn(block_n_rows)
    ind.extend(X_chunk)

sample = rng.choice(len(ind), n_sample, replace=False)
print(f'max rss usage after {resource.getrusage(resource.RUSAGE_SELF).ru_maxrss}')
max rss usage before 131712
max rss usage after 528252

max rss ussage increased nearly 400MB.

Timing

n_samples = 10000
block_n_rows = 800
n_sample = 100
n_tries = 1000

X = rng.randn(n_samples)

start_time=time.time()
for i in range(n_tries):
    ind = []
    for chunk in gen_batches(n_samples, block_n_rows):
        ind.extend(X[chunk])
    
    sample = X[rng.choice(len(ind), n_sample, replace=False)]
choice_time = time.time() - start_time
print(f'choice time: {choice_time/n_tries:0.5f}s')

start_time=time.time()
for i in range(n_tries):
    sampler = utils.ReservoirSample(n_sample, rng)
    for chunk in gen_batches(n_samples, block_n_rows):
        sampler.extend(X[chunk])

    sample = sampler.current_sample()
    
reservoir_time = time.time() - start_time
print(f'reservoir time: {reservoir_time/n_tries:0.5f}s')
choice time: 0.00081s
reservoir time: 0.00207s

These results show ReservoirSample takes ~2.5x as long as np.RandomState.choice. This is after some optimization, notably generating random numbers and doing some of the arithmetic in batches (of size = sample size). Naively doing this took over 5x as long as numpy.RandomState.choice. We'll see below how this can impact end-to-end LargeMarginNearestNeighbors training time.

End-to-End Test

Here we look at end-to-end performance of training using both implementations. These were taken from current master
6f5e385 vs this change, each test with a fresh python process. Specifically we're looking at worst case performance (impostors are totally mixed with class neighbors) vs a couple of other cases where classes are better separated.

Test set up

n_samples = 10000
feature_dim = 5
n_classes = 2

Totally random classes:

X = rng.rand(n_samples, feature_dim)
y = rng.randint(0, n_classes, n_samples)
print(f'max rss usage before {resource.getrusage(resource.RUSAGE_SELF).ru_maxrss}')
lmnn.LargeMarginNearestNeighbor(n_components=2, n_neighbors=10, max_iter=5, verbose=2, n_jobs=7).fit(X, y)
print(f'max rss usage after {resource.getrusage(resource.RUSAGE_SELF).ru_maxrss}')

Partially offset classes:

X = np.concatenate([rng.rand(n_samples//2, feature_dim), rng.rand(n_samples//2, feature_dim)+0.75])
y = np.concatenate([np.asarray([0]*(n_samples//2)),np.asarray([1]*(n_samples//2))])
print(f'max rss usage before {resource.getrusage(resource.RUSAGE_SELF).ru_maxrss}')
lmnn.LargeMarginNearestNeighbor(n_components=2, n_neighbors=10, max_iter=5, verbose=2, n_jobs=7).fit(X, y)
print(f'max rss usage after {resource.getrusage(resource.RUSAGE_SELF).ru_maxrss}')

Well separated classes:

X = np.concatenate([rng.rand(n_samples//2, feature_dim), rng.rand(n_samples//2, feature_dim)+1])
y = np.concatenate([np.asarray([0]*(n_samples//2)),np.asarray([1]*(n_samples//2))])
print(f'max rss usage before {resource.getrusage(resource.RUSAGE_SELF).ru_maxrss}')
lmnn.LargeMarginNearestNeighbor(n_components=2, n_neighbors=10, max_iter=5, verbose=2, n_jobs=7).fit(X, y)
print(f'max rss usage after {resource.getrusage(resource.RUSAGE_SELF).ru_maxrss}')

Results

Per-iteration training time (taken from logging):

numpy.RandomState.choice ReservoirSample
Random classes 5.96s 9.76s
Offset classes 0.52s 0.78s
Separated classes 0.33s 0.33s

Max RSS increase:

numpy.RandomState.choice ReservoirSample
Random classes 1018MB 161MB
Offset classes 77MB 80MB
Separated classes 31MB 31MB

@gerner
Copy link
Author

gerner commented Feb 3, 2021

I should also point out that on the data set I'm actually using, finding the target neighbors takes way longer than training the metric. And in my case the current master implementation that does post-hoc sampling of impostors uses more memory than is available on the machine I'm training on causing the training process to get killed. So this is a great trade-off for me.

@johny-c
Copy link
Owner

johny-c commented Feb 8, 2021

Hi @gerner , thanks for your work. A few comments:

  • Since there is a trade off wrt memory and speed, I would keep the previous code and add reservoir as another option of impostor computation and storage (parameter impostor_storage).
  • Regarding data sets, could you do/show these comparisons for some real data sets (e.g. digits, iris, faces, mnist)? There I would also want to see how/if the classification accuracy is affected with vs without RS.
  • I would name the class ReservoirSampler instead of ReservoirSample.

@gerner
Copy link
Author

gerner commented Feb 8, 2021

Sure, ReservoirSampler is a better name.

I like your idea about configuration. Are you suggesting making reservoir a choice for the same parameter? As in, list | sparse | reservoir from fastest/least memory efficient to slowest/most memory efficient? If so I would make reservoir mean use a sparse matrix and reservoir sampling. I ask because sparse and reservoir are functionally independent: you could do one without the other and vice versa. Personally I think conflating these is fine because it's a little subtle and if you are looking to save memory while sampling impostors then you almost certainly don't want a dense n x n distance matrix laying around.

Sure, I'll add a test that compares performance of the different choices for impostor storage/sampling.

@johny-c
Copy link
Owner

johny-c commented Feb 8, 2021

That's right, how impostors are sampled is independent of how they are stored. Therefore maybe there should be a separate parameter (e.g. impostor_sampler). Regarding the density of the matrix, I think that's mostly controlled by the max_impostors parameter.

@gerner
Copy link
Author

gerner commented Feb 8, 2021

here's a performance comparison on the iris dataset. Notice I picked a very small number of max_impostors to (hopefully) ensure we actually do sampling:

samplers = ['uniform', 'reservoir']
scores = {impostor_sampler: list() for impostor_sampler in samplers}
n_tries = 500

iris = datasets.load_iris()
perm = rng.permutation(iris.target.size)
iris_data = iris.data[perm]
iris_target = iris.target[perm]

for impostor_sampler in samplers:
    for i in tqdm(range(n_tries)):
        X_train, X_test, y_train, y_test =train_test_split(iris_data, iris_target)


        lmnn_mdl = lmnn.LargeMarginNearestNeighbor(
                impostor_sampler=impostor_sampler,
                max_impostors=5)

        lmnn_mdl.fit(X_train, y_train)
        knn = KNeighborsClassifier(n_neighbors=lmnn_mdl.n_neighbors_)
        LX_train = lmnn_mdl.transform(X_train)
        knn.fit(LX_train, y_train)

        LX_test = lmnn_mdl.transform(X_test)

        scores[impostor_sampler].append(knn.score(LX_test, y_test))

fig, ax = plt.subplots(figsize=(10,10))
for impostor_sampler in samplers:
    ax.hist(scores[impostor_sampler], bins=30, alpha=0.3, label=f'{impostor_sampler} mean={np.mean(scores[impostor_sampler]):.3f}')
ax.legend()
plt.show()

image

@gerner
Copy link
Author

gerner commented Feb 8, 2021

I made reservoir sampling optional, controlled by a parameter similar to impostor_store, with a default auto choice that uses the same heuristic as sparse. Setting both params to auto will get a sparse store and reservoir sampling. I moved the sampling into the method _find_impostors_blockwise which takes a new parameter to enable reservoir sampling.

@johny-c
Copy link
Owner

johny-c commented Feb 9, 2021

Thanks for the changes :) I think the RS has another advantage, namely that we can use one object for the whole loop over classes. Therefore I would create a separate _find_impostors_with_reservoir function. This would also make the _find_impostors_blockwise more readable (aka less if ... else clauses). I can take this over if you want, but it's gonna take some weeks until I have some free time. Some more comments:

  • Right now, you have a check for impostor_store being set to reservoir. I guess this shouldn't be there if we have the impostor_sampler param.
  • Would you run the same as you did for iris but keeping all other params to their defaults (e.g. max_impostors)?
  • Would you run the same for some of the larger / more "real world" data sets (faces, digits, mnist)?

@gerner
Copy link
Author

gerner commented Feb 9, 2021

You're a tough customer. Do I get a citation out of this? j/k

I think I see what you're saying, but it's kind of marginal, right? More than half the code will be identical. What if I make an alternative to ReservoirSampler, UniformSampler. The two are polymorphic so the only difference will be when we create an instance of ReservoirSampler vs UniformSampler. We get rid of the if/elses and abstract some of the code around extending arrays, checking length, doing sampling, etc. That makes more sense to me.

I set max_impostors low because otherwise I think there's no sampling with the iris dataset. Below are the results with default params.

image

Below are 100 trials on digits using default params other than impostor_sampler. Again I think there's little to no sampling happening.

image

Below are 100 trials on digits using max_impostors=50, to compare the effect of using numpy.RandomState.choice vs ReservoirSampler.

image

And for faces I did a single run of each, score was 0.96 for uniform and 1.0 for reservoir with default parameters. With max_impostors=50 scores were uniform=0.99 and reservoir=0.98.

@gerner
Copy link
Author

gerner commented Feb 9, 2021

I converted the uniform sampling approach to use a polymorphic class with ReservoirSampler. This cleaned up _find_impostors_blockwise and I think accomplishes your goal of readability without duplicating the actual interesting logic (about margin radii and such).

@gerner
Copy link
Author

gerner commented Feb 9, 2021

Oh, I also re-ran timing and memory tests and the updated uniform sampling implementation is comparable to the previous one.

@johny-c
Copy link
Owner

johny-c commented Feb 10, 2021

Haha, I know - I just want to make sure things will work as expected. Regarding citations, there is actually a PR to scikit-learn and a plan to submit this to metric-learn. So, if the RS goes in, I would certainly add your name.
The UniformSampler is a good idea and design. I would put these two then in a new file (impostor_sampling).
I'll go over the whole thing in a few weeks and merge if ready. In the meantime I can only spot a few typos in the docstrings. Thanks for your patience ;)

@gerner
Copy link
Author

gerner commented Feb 10, 2021

Are you suggesting creating a new file impostor_sampling.py to hold ReservoirSampler and UniformSampler? These aren't actually about impostors per se. They're more general than that which is why I put them into utils.py. If you think they should go in their own file, how about just sampling.py?

@johny-c
Copy link
Owner

johny-c commented Feb 10, 2021

Yes, I don't mind but I think impostor_sampling is more descriptive with respect to LMNN.

@gerner
Copy link
Author

gerner commented Feb 11, 2021

Done.

Let me know what else I can do to streamline the merge and publish back to pypi. I love to give back to open source and I've learned a little bit in this process, but mostly I want to get this into the hands of my colleagues without passing around a custom tarball :)

@johny-c
Copy link
Owner

johny-c commented Mar 27, 2021

Hi @gerner . I finally found some time to have a look at this. I updated the code a bit and I will leave some review comments in github. I think there are still issues that need fixing and some code that I'm not yet familiar with. One general comment is to have a look at the pull request checklist suggested by scikit-learn (https://scikit-learn.org/stable/developers/contributing.html#pull-request-checklist) so that your code does not have PEP errors and so on, when you push.

Also, I think I need to first abstract the impostor storage in a similar way as we did for impostor sampling. Then, we could perhaps use one Sampler object per iteration, which would make more sense algorithmically.

k = self.sample_size
if n > k:
ind = self.random_state.choice(n, k, replace=False)
return np.asarray(self._values, dtype=object)[ind]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the overhead here, when values contains both distances and indices? Maybe there is a more efficient way to sample from multiple arrays? We could also use the knowledge that there are always either 1 or 2 arrays used.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, this is not ideal, and you're right the root cause is needing to support sampling from parallel arrays of different types (just indicies or indicies and distances). I saw this as a bigger problem and didn't address any of it in favor of brevity.

# NOTE: At this point, the reservoir might still be not full

# general case: sample from latest_values
while self._next_i - offset < len(latest_values):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this while loop can be vectorized. I need to look at this in more depth, but I think a lot of time can be saved.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I agree, and this is where things are slower for the reservoir sampler. Related to this is the idea of pre-computing the sample indicies in blocks. I could easily do some of this (https://github.com/johny-c/pylmnn/pull/5/files/42d42ab02d310aa4f5a8d7b716ae2bb50396aa16#diff-adeb34b813966f09fdcba22cd080044cf631ad669d04a94a05f971dd51c868b0R131-R139), but my math of random distributions isn't quite good enough to vectorize all of it.

More concisely: I believe it's possible to create a vector of all the necessary elements from the geometric distribution for the given block we're extending from. I just don't know how.

k = self.sample_size
rng = self.random_state

latest_values = list(latest_values) # TODO: This seems unnecessary
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would make a copy even when not necessary. Also here we have to maybe exploit the fact that there are 1 or 2 arrays.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is here because the caller might be (is in the current impl) sending in an iterable, not necessarily a list which makes some of the operations a little trickier (e.g. random access, getting the length). I agree, there's a better solution and it's tied up in needing to support parallel np arrays of different types.

@gerner
Copy link
Author

gerner commented Apr 1, 2021

I get the feeling you're not just going to merge as is :)

seems like there's three problems:

  1. some stylistic violations (i.e. pep8 which I could solve by using flake8 as per https://scikit-learn.org/stable/developers/contributing.html#pull-request-checklist)
  2. the current approach to multiple parallel arrays is too clunky and has unnecessary overhead
  3. the sampling could be way faster if we can figure out how to vectorize it in the extend method

Anything else? In your mind, what's required for merge?

@johny-c
Copy link
Owner

johny-c commented Apr 1, 2021

Well, yes, I really don't want to break something that works up to now.

Regarding extra requirements:

  • run a comparison (in terms of accuracy, memory and runtime) on some of the larger data sets as suggested before.

This only makes sense if we have concluded on how the code should look like, but you may want to do it now already to catch any potential bugs in the implementation of RS and get a feeling of its "real-world" performance.

I implemented a storage class (see branch impostor_storage) to abstract some of that logic too. In the end, it would probably make sense to merge these two classes (ImpostorStore and ImpostorSampler) into one.

Some thoughts:

  • I'm rethinking about the way impostors are represented / stored. A triplet format would make sampling easier as we would only need to sample from one data structure. The obvious downside is that we would need to use and sample from python lists instead of numpy arrays, which would make things slower overall.
  • The find_impostors function is where most time is spent. Maybe there is a better way to integrate it with the rest of the algorithm, that would also alleviate some of the issues we face here.

@gerner
Copy link
Author

gerner commented Apr 1, 2021

My top post in this thread generates a few large-ish datasets (10k examples) with different properties and shows the memory runtime properties of both sampling strategies. A later post includes properties of the sampling comparing reservoir to uniform as well as running on the faces sklearn sample dataset. I've also been running my branch in my own production setting on a large data set with tens or hundreds of thousands of examples with hundreds of features each (which is why I need this work). So I have confidence in that regard. From my perspective the run-time trade-offs here are very small compared to just handling feature data before and after training. In terms of testing on more example datasets, did you have something else in mind?

Your thoughts about refactoring how impostors are chosen and stored seem reasonable. They also seem like a bigger change that might require a lot more work. That sounds like a good idea in concept, but what do you think about locking in this functionality first?

What if I clean up any pep8 violations and how the 1 vs 2 array cases are handled, especially avoiding making extra copies of data? Do you think we could merge that change?

@johny-c
Copy link
Owner

johny-c commented Apr 4, 2021

Yes, sorry I missed your post about faces and digits. Yes, if you tackle these issues, it should be fine. I just want then to make sure ImpostorSampler is consistent and works as before for the non-reservoir sampler case.

@gerner
Copy link
Author

gerner commented Apr 7, 2021

Thanks for the note, and agreed on not causing a regression.

I'm thinking of just special casing one param vs two. That will make the uniform sampling approach more similar to what was happening before (just with an extra method call or two instead of in-line). It'll probably take me a week or so to get time.

@johny-c
Copy link
Owner

johny-c commented Apr 8, 2021

Sure, I was thinking maybe it makes sense to make find_impostors_blockwise a generator method, so that the Sampler object, whether ReservoirSampler or other , is only created in the find_impostors method.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants