Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update murmur3 length / family to match current uses #236

Merged
merged 5 commits into from
Oct 15, 2021
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion table.csv
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ keccak-512, multihash, 0x1d, draft,
blake3, multihash, 0x1e, draft, BLAKE3 has a default 32 byte output length. The maximum length is (2^64)-1 bytes.
sha2-384, multihash, 0x20, permanent, aka SHA-384; as specified by FIPS 180-4.
dccp, multiaddr, 0x21, draft,
murmur3-128, multihash, 0x22, draft,
murmur3-x64-64, multihash, 0x22, permanent, The first 64-bits of a murmur3-x64-128 - used for UnixFS directory sharding.
Copy link

@aschmahmann aschmahmann Oct 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rvagg @willscott I don't know what the rush was to have this PR merged, but as mentioned both this and the newer x64-128 codes are not particularly well specced especially in relation to each other.

mumur3 implementations are a rats nest of unspecified confusion, I described the situation in multiformats/go-multihash#135 and linked here. This lack of spec (i.e. relying on behaviors of one particular library) has also been described in ipld/specs#131 (comment).

A concrete example of this is what are the 0x22 and 0x1022 hashes for The quick brown fox jumps over the lazy dog? This thread (aappleby/smhasher#6) seems to indicate that it should be 6C1B07BC7BBC4BE347939AC4A93C437A however, as described in spaolacci/murmur3#21 (and verified by a simple test) the library proposed in multiformats/go-multihash#150 gives e34bbc7bbc071b6c7a433ca9c49a9347.

So which one is murmur3-x64-128 supposed to be and what is 0x22 supposed to be? It seems possible they could be different endianness/that x64-128 is not a longer version of 0x22.

Copy link
Member

@rvagg rvagg Oct 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aschmahmann merging isn't ruling out additional tweaks here, but I'm not sure what that would involve because the murmur3 "ecosystem" is such a mess. In JS we've had compatibility problems too, some of them JS specific, but most recently we landed on https://cimi.io/murmurhash3js-revisited/, which for The quick brown fox jumps over the lazy dog gets e34bbc7bbc071b6c7a433ca9c49a9347 also.

I think the real problem relates to the conversion to a byte array and the lack of an agreed standard in doing so. If I run applyby/smhasher against The quick brown fox jumps over the lazy dog I get 16378391709484522348 + 8809951995912426311 as my 128 bit output. Writing those to a byte array in that order as little-endian I get 6c1b07bc7bbc4be347939ac4a93c437a, but writing them as big-endian I get e34bbc7bbc071b6c7a433ca9c49a9347. So the output integers are correct, but the lack of agreed encoding of those integers to a byte array seems to explain the descrepancy. It seems that in both the JS and Go implementations we're using that we're doing this as big-endian. But is that something we want to document here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I'll also note that aappleby/smhasher#6 has the erroneous output for Guava as 4cae51b5316602c01c7c5642843e5fe7, which doesn't figure in any of the above, it seems to be an actual error in that implementation).

Copy link

@aschmahmann aschmahmann Oct 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But is that something we want to document here?

@rvagg my understanding is that generally the that a major point of the multicodec table is to be able to have different groups reuse and implement the same schemes and identify their data the same way such that they can build compatible systems. In the case of multihash I'd generally assume that we're specifying data of the form <hash code><hash len><hash bytes> and how a user could take in a set of bytes and compute the same multihash output.

As a result, it seems like knowing how to order the output bytes to create the codec seems important for reproducibility. Am I missing something?

In general I'd hope that we could have some criteria for most codec entries which is to be able to point people at a spec (even if it's a living one) for how to implement these things and what the human meaning behind any particular 0xABCD actually means.


My main objections to what was merged here were:

  1. Labeling 0x1022 as having a code despite it being both unspecified and unrequested
    • It doesn't personally matter to me what we specify, but I'd hope the next unfortunate soul to come along here and try to implement this could figure out whether the spaolacci approach (i.e. big endian) or the https://github.com/phensley/python-smhasher (i.e. small endian) was the one they should use to produce compatible results
    • We should be trying to make things better over time not worse, so let's try and push towards specifying what is meant by a new code.
  2. Labeling 0x22 as murmur3-x64-64 despite knowing that this was incompletely specified and could still lead to confusion.
    • So let's label what we're doing. We can throw it in the notes field if we don't want to add more to the label name, and let any future soul who wants murmur3-x64-64-little-endian claim that if they want it.

murmur3-32, multihash, 0x23, draft,
ip6, multiaddr, 0x29, permanent,
ip6zone, multiaddr, 0x2a, draft,
Expand Down Expand Up @@ -132,6 +132,7 @@ sha2-256-trunc254-padded, multihash, 0x1012, permanent, SHA2-
sha2-224, multihash, 0x1013, permanent, aka SHA-224; as specified by FIPS 180-4.
sha2-512-224, multihash, 0x1014, permanent, aka SHA-512/224; as specified by FIPS 180-4.
sha2-512-256, multihash, 0x1015, permanent, aka SHA-512/256; as specified by FIPS 180-4.
murmur3-x64-128, multihash, 0x1022, draft,
ripemd-128, multihash, 0x1052, draft,
ripemd-160, multihash, 0x1053, draft,
ripemd-256, multihash, 0x1054, draft,
Expand Down