-
Notifications
You must be signed in to change notification settings - Fork 208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update murmur3 length / family to match current uses #236
Conversation
table.csv
Outdated
murmur3-32, multihash, 0x23, draft, | ||
murmur3-128, multihash, 0x24, draft, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do people want to use this? Should it be higher up in the table? I'm willing to wait on a user here asking for this before allocating a codec.
Also, murmur3-128 is not a defined thing. There is a 32bit and 64bit version of it. So we'd need a codec for each of the ones we care about. (e.g. murmur3-128-x86 and murmur3-128-x64)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we've got 128 exported as a symbol in a lot of places at this point. e.g. https://github.com/multiformats/go-multicodec/blob/master/code_table.go#L70 i'd prefer not removing it but have no problem with a larger number
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need to look at those as 0x22
which has "murmur3-128" here already. Let's just fix up 0x22
to have a better name, or fix up the uses of it such that it's clear what kind of sorcery is being invoked and not add new entries. Nobody is happy with murmur3 in this table anymore so lets not expand it, at least here in the 1 byte range.
I know this seems like bikeshedding, but murmur3 is a real problem so if we're going to be changing names (which is annoying to users), I suggest we do it properly. The existing |
Co-authored-by: Rod Vagg <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, but I'd like to leave this open for enough time to let eyes that normally see these to weigh in since this is a thorny one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not deeply in this topic, but what @rvagg wrote sounds very reasonable, I agree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
I definitely approve of the choice to change the name of the 0x22 code.
I'm indifferent to what we change the name to -- and trust that the discussion here covered that already, so 👍 .
Agree with @rvagg that I don't even see it as important to add a code for murmur3-128. But if we do, I also don't object to something in a multibyte range (and the PR now has it as 0x1022, so, alright, fine by me). Good by me to merge with that; also good by me to drop that new entry and merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
although I'd mark it as stable
@@ -21,7 +21,7 @@ keccak-512, multihash, 0x1d, draft, | |||
blake3, multihash, 0x1e, draft, BLAKE3 has a default 32 byte output length. The maximum length is (2^64)-1 bytes. | |||
sha2-384, multihash, 0x20, permanent, aka SHA-384; as specified by FIPS 180-4. | |||
dccp, multiaddr, 0x21, draft, | |||
murmur3-128, multihash, 0x22, draft, | |||
murmur3-x64-64, multihash, 0x22, permanent, The first 64-bits of a murmur3-x64-128 - used for UnixFS directory sharding. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rvagg @willscott I don't know what the rush was to have this PR merged, but as mentioned both this and the newer x64-128
codes are not particularly well specced especially in relation to each other.
mumur3 implementations are a rats nest of unspecified confusion, I described the situation in multiformats/go-multihash#135 and linked here. This lack of spec (i.e. relying on behaviors of one particular library) has also been described in ipld/specs#131 (comment).
A concrete example of this is what are the 0x22
and 0x1022
hashes for The quick brown fox jumps over the lazy dog
? This thread (aappleby/smhasher#6) seems to indicate that it should be 6C1B07BC7BBC4BE347939AC4A93C437A
however, as described in spaolacci/murmur3#21 (and verified by a simple test) the library proposed in multiformats/go-multihash#150 gives e34bbc7bbc071b6c7a433ca9c49a9347
.
So which one is murmur3-x64-128
supposed to be and what is 0x22 supposed to be? It seems possible they could be different endianness/that x64-128 is not a longer version of 0x22.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aschmahmann merging isn't ruling out additional tweaks here, but I'm not sure what that would involve because the murmur3 "ecosystem" is such a mess. In JS we've had compatibility problems too, some of them JS specific, but most recently we landed on https://cimi.io/murmurhash3js-revisited/, which for The quick brown fox jumps over the lazy dog
gets e34bbc7bbc071b6c7a433ca9c49a9347
also.
I think the real problem relates to the conversion to a byte array and the lack of an agreed standard in doing so. If I run applyby/smhasher against The quick brown fox jumps over the lazy dog
I get 16378391709484522348
+ 8809951995912426311
as my 128 bit output. Writing those to a byte array in that order as little-endian I get 6c1b07bc7bbc4be347939ac4a93c437a
, but writing them as big-endian I get e34bbc7bbc071b6c7a433ca9c49a9347
. So the output integers are correct, but the lack of agreed encoding of those integers to a byte array seems to explain the descrepancy. It seems that in both the JS and Go implementations we're using that we're doing this as big-endian. But is that something we want to document here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I'll also note that aappleby/smhasher#6 has the erroneous output for Guava as 4cae51b5316602c01c7c5642843e5fe7
, which doesn't figure in any of the above, it seems to be an actual error in that implementation).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But is that something we want to document here?
@rvagg my understanding is that generally the that a major point of the multicodec table is to be able to have different groups reuse and implement the same schemes and identify their data the same way such that they can build compatible systems. In the case of multihash I'd generally assume that we're specifying data of the form <hash code><hash len><hash bytes>
and how a user could take in a set of bytes and compute the same multihash output.
As a result, it seems like knowing how to order the output bytes to create the codec seems important for reproducibility. Am I missing something?
In general I'd hope that we could have some criteria for most codec entries which is to be able to point people at a spec (even if it's a living one) for how to implement these things and what the human meaning behind any particular 0xABCD actually means.
My main objections to what was merged here were:
- Labeling 0x1022 as having a code despite it being both unspecified and unrequested
- It doesn't personally matter to me what we specify, but I'd hope the next unfortunate soul to come along here and try to implement this could figure out whether the spaolacci approach (i.e. big endian) or the https://github.com/phensley/python-smhasher (i.e. small endian) was the one they should use to produce compatible results
- We should be trying to make things better over time not worse, so let's try and push towards specifying what is meant by a new code.
- Labeling 0x22 as murmur3-x64-64 despite knowing that this was incompletely specified and could still lead to confusion.
- So let's label what we're doing. We can throw it in the notes field if we don't want to add more to the label name, and let any future soul who wants murmur3-x64-64-little-endian claim that if they want it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@
0x22 is currently used widely through IPFS as representing the
murmur3_64
length. This makes the table match reality, and allocates 0x24 for the 128 bit length.