Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update murmur3 length / family to match current uses #236

Merged
merged 5 commits into from
Oct 15, 2021
Merged

update murmur3 length / family to match current uses #236

merged 5 commits into from
Oct 15, 2021

Conversation

willscott
Copy link
Contributor

@willscott willscott commented Oct 14, 2021

0x22 is currently used widely through IPFS as representing the murmur3_64 length. This makes the table match reality, and allocates 0x24 for the 128 bit length.

@willscott willscott requested a review from Stebalien October 14, 2021 00:45
table.csv Outdated Show resolved Hide resolved
table.csv Outdated
murmur3-32, multihash, 0x23, draft,
murmur3-128, multihash, 0x24, draft,
Copy link

@aschmahmann aschmahmann Oct 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do people want to use this? Should it be higher up in the table? I'm willing to wait on a user here asking for this before allocating a codec.

Also, murmur3-128 is not a defined thing. There is a 32bit and 64bit version of it. So we'd need a codec for each of the ones we care about. (e.g. murmur3-128-x86 and murmur3-128-x64)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we've got 128 exported as a symbol in a lot of places at this point. e.g. https://github.com/multiformats/go-multicodec/blob/master/code_table.go#L70 i'd prefer not removing it but have no problem with a larger number

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to look at those as 0x22 which has "murmur3-128" here already. Let's just fix up 0x22 to have a better name, or fix up the uses of it such that it's clear what kind of sorcery is being invoked and not add new entries. Nobody is happy with murmur3 in this table anymore so lets not expand it, at least here in the 1 byte range.

table.csv Outdated Show resolved Hide resolved
table.csv Outdated Show resolved Hide resolved
@rvagg
Copy link
Member

rvagg commented Oct 14, 2021

I know this seems like bikeshedding, but murmur3 is a real problem so if we're going to be changing names (which is annoying to users), I suggest we do it properly. The existing MURMUR3_128 (etc.) symbols don't have to change and break anyone using it, we could add additional MURMUR3_X64_128 as well if we want to be consistent.

Copy link
Member

@rvagg rvagg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, but I'd like to leave this open for enough time to let eyes that normally see these to weigh in since this is a thorny one

Copy link
Member

@vmx vmx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not deeply in this topic, but what @rvagg wrote sounds very reasonable, I agree.

Copy link
Contributor

@warpfork warpfork left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

I definitely approve of the choice to change the name of the 0x22 code.

I'm indifferent to what we change the name to -- and trust that the discussion here covered that already, so 👍 .

Agree with @rvagg that I don't even see it as important to add a code for murmur3-128. But if we do, I also don't object to something in a multibyte range (and the PR now has it as 0x1022, so, alright, fine by me). Good by me to merge with that; also good by me to drop that new entry and merge.

Copy link
Member

@Stebalien Stebalien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

although I'd mark it as stable

table.csv Outdated Show resolved Hide resolved
@rvagg rvagg merged commit 36789e0 into master Oct 15, 2021
@rvagg rvagg deleted the murmur3 branch October 15, 2021 02:32
@@ -21,7 +21,7 @@ keccak-512, multihash, 0x1d, draft,
blake3, multihash, 0x1e, draft, BLAKE3 has a default 32 byte output length. The maximum length is (2^64)-1 bytes.
sha2-384, multihash, 0x20, permanent, aka SHA-384; as specified by FIPS 180-4.
dccp, multiaddr, 0x21, draft,
murmur3-128, multihash, 0x22, draft,
murmur3-x64-64, multihash, 0x22, permanent, The first 64-bits of a murmur3-x64-128 - used for UnixFS directory sharding.
Copy link

@aschmahmann aschmahmann Oct 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rvagg @willscott I don't know what the rush was to have this PR merged, but as mentioned both this and the newer x64-128 codes are not particularly well specced especially in relation to each other.

mumur3 implementations are a rats nest of unspecified confusion, I described the situation in multiformats/go-multihash#135 and linked here. This lack of spec (i.e. relying on behaviors of one particular library) has also been described in ipld/specs#131 (comment).

A concrete example of this is what are the 0x22 and 0x1022 hashes for The quick brown fox jumps over the lazy dog? This thread (aappleby/smhasher#6) seems to indicate that it should be 6C1B07BC7BBC4BE347939AC4A93C437A however, as described in spaolacci/murmur3#21 (and verified by a simple test) the library proposed in multiformats/go-multihash#150 gives e34bbc7bbc071b6c7a433ca9c49a9347.

So which one is murmur3-x64-128 supposed to be and what is 0x22 supposed to be? It seems possible they could be different endianness/that x64-128 is not a longer version of 0x22.

Copy link
Member

@rvagg rvagg Oct 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aschmahmann merging isn't ruling out additional tweaks here, but I'm not sure what that would involve because the murmur3 "ecosystem" is such a mess. In JS we've had compatibility problems too, some of them JS specific, but most recently we landed on https://cimi.io/murmurhash3js-revisited/, which for The quick brown fox jumps over the lazy dog gets e34bbc7bbc071b6c7a433ca9c49a9347 also.

I think the real problem relates to the conversion to a byte array and the lack of an agreed standard in doing so. If I run applyby/smhasher against The quick brown fox jumps over the lazy dog I get 16378391709484522348 + 8809951995912426311 as my 128 bit output. Writing those to a byte array in that order as little-endian I get 6c1b07bc7bbc4be347939ac4a93c437a, but writing them as big-endian I get e34bbc7bbc071b6c7a433ca9c49a9347. So the output integers are correct, but the lack of agreed encoding of those integers to a byte array seems to explain the descrepancy. It seems that in both the JS and Go implementations we're using that we're doing this as big-endian. But is that something we want to document here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I'll also note that aappleby/smhasher#6 has the erroneous output for Guava as 4cae51b5316602c01c7c5642843e5fe7, which doesn't figure in any of the above, it seems to be an actual error in that implementation).

Copy link

@aschmahmann aschmahmann Oct 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But is that something we want to document here?

@rvagg my understanding is that generally the that a major point of the multicodec table is to be able to have different groups reuse and implement the same schemes and identify their data the same way such that they can build compatible systems. In the case of multihash I'd generally assume that we're specifying data of the form <hash code><hash len><hash bytes> and how a user could take in a set of bytes and compute the same multihash output.

As a result, it seems like knowing how to order the output bytes to create the codec seems important for reproducibility. Am I missing something?

In general I'd hope that we could have some criteria for most codec entries which is to be able to point people at a spec (even if it's a living one) for how to implement these things and what the human meaning behind any particular 0xABCD actually means.


My main objections to what was merged here were:

  1. Labeling 0x1022 as having a code despite it being both unspecified and unrequested
    • It doesn't personally matter to me what we specify, but I'd hope the next unfortunate soul to come along here and try to implement this could figure out whether the spaolacci approach (i.e. big endian) or the https://github.com/phensley/python-smhasher (i.e. small endian) was the one they should use to produce compatible results
    • We should be trying to make things better over time not worse, so let's try and push towards specifying what is meant by a new code.
  2. Labeling 0x22 as murmur3-x64-64 despite knowing that this was incompletely specified and could still lead to confusion.
    • So let's label what we're doing. We can throw it in the notes field if we don't want to add more to the label name, and let any future soul who wants murmur3-x64-64-little-endian claim that if they want it.

Copy link

@Brn2bern Brn2bern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants