Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for run containers #12

Open
lemire opened this issue Sep 19, 2015 · 14 comments
Open

Add support for run containers #12

lemire opened this issue Sep 19, 2015 · 14 comments

Comments

@lemire
Copy link
Member

lemire commented Sep 19, 2015

As of version 0.5, the java version of Roaring bitmaps supports run containers...

https://github.com/lemire/RoaringBitmap/blob/master/src/main/java/org/roaringbitmap/RunContainer.java

These can be created by the "runOptimize" function...

https://github.com/lemire/RoaringBitmap/blob/master/src/main/java/org/roaringbitmap/RoaringBitmap.java#L1286

These containers can improve the compression ratio in some cases.

@josephglanville
Copy link
Contributor

I would like to take this on, run containers provide very good wins for my workload.

I propose adding a new Vec based container that stores intervals, either as pair tuples or as a struct:

struct Interval {
  start: u16,
  end: u16,
}

Run(Vec<Interval>)

Not 100% sure (will verify) but I think the struct takes the same space as the tuple (u16, u16).

I think storing start, end will result in slightly nicer code vs storing start, run_length.
Serialisation will of course follow the format specification.

@Nemo157
Copy link
Collaborator

Nemo157 commented Mar 24, 2020

@josephglanville sounds good, let me know if you have any questions.

@saik0
Copy link
Contributor

saik0 commented Feb 14, 2022

Picking this up.

I'm not clear what equality / cmp semantic we want for roaring bitmaps.

What do the other lang roaring libs do?

let input: RoaringBitmap = (0..100).collect();

// input is not run compressed
assert!(!input.remove_run_compression())

let mut compressed = input.clone();

// compressed is run compressed
assert!(input.run_compress())

// Should this assertion succeed or panic?
assert!(input == compressed)

@saik0
Copy link
Contributor

saik0 commented Feb 20, 2022

Another problem I'm running into when implementing this:

Most of the integration tests make assumptions about the implementation details of roaring's containers. The optimal container for all of the following are run containers. Adding run containers is going to require a large refactoring of our existing tests.

tests/union_with.rs

#[test]
fn array_and_bitmap() {
    let mut bitmap1 = (0..2000).collect::<RoaringBitmap>();
    let bitmap2 = (1000..8000).collect::<RoaringBitmap>();
    let bitmap3 = (0..8000).collect::<RoaringBitmap>();

    bitmap1 |= bitmap2;

    assert_eq!(bitmap1, bitmap3);
}

@Kerollmops
Copy link
Member

Kerollmops commented Feb 22, 2022

Do you think that it could be a good way to fix that by introducing an iterator type that wraps another iterator i.e. 100..10000 with some density parameter i.e. Density::dense(0..1000), Density::sparse(0..1000)?

Or should we just use a set of bitmaps from the official datasets?

@saik0
Copy link
Contributor

saik0 commented Feb 24, 2022

I think the underlying problem is more is more fundamental. Integration tests are supposed to be black box tests. These have knowledge of implementation details. Containers are non-public, thus these tests should have no knowledge of container types. IMO: They should be moved to unit tests.

@gsson
Copy link

gsson commented Jul 27, 2022

Just as an FYI and to provide a possible reason for this to happen :)

I ran into this while trying to deserialise bitmaps generated by the Java version (which failed spectacularly). For my use-case the serialization/deserialization interoperability is more important than being able to generate run containers.

I tried the https://github.com/Kerollmops/roaring-rs/tree/run-containers branch which seems to work, but it's quite far behind master. I tried to make it up to date, but there had been some moving around of code that I ran out of time trying to fix.

Being able to easily compile to wasm is also important to me, which is why I wanted to use the native Rust version.

@aersam
Copy link

aersam commented Jul 7, 2023

Hi there
Any chance to get this running? I'm trying to use roaring-rs to decode deltalake's deletion vectors and this is the last blocker I have

@Kerollmops
Copy link
Member

Kerollmops commented Jul 7, 2023

Hey @aersam,

Any chance to get this running? I'm trying to use roaring-rs to decode deltalake's deletion vectors and this is the last blocker I have

We worked on that but didn't finish the job. However, it could be really simple to accept the running containers and convert them into either array or bitmap containers when deserializing 🤔

@aersam
Copy link

aersam commented Jul 7, 2023

well, as long as I can read them I don't care about implementation details - especially write is not important for me (at least for some time) . Others might have different requirements, but I think reading those would be a first step

@Kerollmops
Copy link
Member

Kerollmops commented Jul 7, 2023

Hey @aersam,

Would it be possible for you to try this PR in your project and tell me that everything works fine, please? This PR makes it possible to read bitmaps with run containers, don't do any operations as it converts them into array or bitmap containers.

@aersam
Copy link

aersam commented Jul 7, 2023

It compiles and it runs, to test if the indexes are correct is a bit more complicated, but they seem very reasonable
I think I can properly test the indexes on monday, if you want

@Kerollmops
Copy link
Member

Kerollmops commented Jul 7, 2023

I think I can properly test the indexes on monday, if you want

That would be great, I added a test that makes sure that bitmaps are correctly deserialized but it would be better with more tests in the wild.

@aersam
Copy link

aersam commented Jul 10, 2023

I can confirm that it properly works, I checked it against the java implementation and all indexes are equal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants