update murmur3 length / family to match current uses #236

willscott · 2021-10-14T00:44:19Z

0x22 is currently used widely through IPFS as representing the murmur3_64 length. This makes the table match reality, and allocates 0x24 for the 128 bit length.

table.csv

aschmahmann · 2021-10-14T00:50:33Z

table.csv

 murmur3-32,                     multihash,      0x23,           draft,
+murmur3-128,                    multihash,      0x24,           draft,


Do people want to use this? Should it be higher up in the table? I'm willing to wait on a user here asking for this before allocating a codec.

Also, murmur3-128 is not a defined thing. There is a 32bit and 64bit version of it. So we'd need a codec for each of the ones we care about. (e.g. murmur3-128-x86 and murmur3-128-x64)

we've got 128 exported as a symbol in a lot of places at this point. e.g. https://github.com/multiformats/go-multicodec/blob/master/code_table.go#L70 i'd prefer not removing it but have no problem with a larger number

I think you need to look at those as 0x22 which has "murmur3-128" here already. Let's just fix up 0x22 to have a better name, or fix up the uses of it such that it's clear what kind of sorcery is being invoked and not add new entries. Nobody is happy with murmur3 in this table anymore so lets not expand it, at least here in the 1 byte range.

table.csv

rvagg · 2021-10-14T01:16:25Z

I know this seems like bikeshedding, but murmur3 is a real problem so if we're going to be changing names (which is annoying to users), I suggest we do it properly. The existing MURMUR3_128 (etc.) symbols don't have to change and break anyone using it, we could add additional MURMUR3_X64_128 as well if we want to be consistent.

Co-authored-by: Rod Vagg <[email protected]>

rvagg

lgtm, but I'd like to leave this open for enough time to let eyes that normally see these to weigh in since this is a thorny one

vmx

I'm not deeply in this topic, but what @rvagg wrote sounds very reasonable, I agree.

warpfork

LGTM.

I definitely approve of the choice to change the name of the 0x22 code.

I'm indifferent to what we change the name to -- and trust that the discussion here covered that already, so 👍 .

Agree with @rvagg that I don't even see it as important to add a code for murmur3-128. But if we do, I also don't object to something in a multibyte range (and the PR now has it as 0x1022, so, alright, fine by me). Good by me to merge with that; also good by me to drop that new entry and merge.

Stebalien

although I'd mark it as stable

table.csv

resolved

aschmahmann · 2021-10-15T13:55:04Z

table.csv

@@ -21,7 +21,7 @@ keccak-512,                     multihash,      0x1d,           draft,
 blake3,                         multihash,      0x1e,           draft,     BLAKE3 has a default 32 byte output length. The maximum length is (2^64)-1 bytes.
 sha2-384,                       multihash,      0x20,           permanent, aka SHA-384; as specified by FIPS 180-4.
 dccp,                           multiaddr,      0x21,           draft,
-murmur3-128,                    multihash,      0x22,           draft,
+murmur3-x64-64,                 multihash,      0x22,           permanent, The first 64-bits of a murmur3-x64-128 - used for UnixFS directory sharding.


@rvagg @willscott I don't know what the rush was to have this PR merged, but as mentioned both this and the newer x64-128 codes are not particularly well specced especially in relation to each other.

mumur3 implementations are a rats nest of unspecified confusion, I described the situation in multiformats/go-multihash#135 and linked here. This lack of spec (i.e. relying on behaviors of one particular library) has also been described in ipld/specs#131 (comment).

A concrete example of this is what are the 0x22 and 0x1022 hashes for The quick brown fox jumps over the lazy dog? This thread (aappleby/smhasher#6) seems to indicate that it should be 6C1B07BC7BBC4BE347939AC4A93C437A however, as described in spaolacci/murmur3#21 (and verified by a simple test) the library proposed in multiformats/go-multihash#150 gives e34bbc7bbc071b6c7a433ca9c49a9347.

So which one is murmur3-x64-128 supposed to be and what is 0x22 supposed to be? It seems possible they could be different endianness/that x64-128 is not a longer version of 0x22.

@aschmahmann merging isn't ruling out additional tweaks here, but I'm not sure what that would involve because the murmur3 "ecosystem" is such a mess. In JS we've had compatibility problems too, some of them JS specific, but most recently we landed on https://cimi.io/murmurhash3js-revisited/, which for The quick brown fox jumps over the lazy dog gets e34bbc7bbc071b6c7a433ca9c49a9347 also.

I think the real problem relates to the conversion to a byte array and the lack of an agreed standard in doing so. If I run applyby/smhasher against The quick brown fox jumps over the lazy dog I get 16378391709484522348 + 8809951995912426311 as my 128 bit output. Writing those to a byte array in that order as little-endian I get 6c1b07bc7bbc4be347939ac4a93c437a, but writing them as big-endian I get e34bbc7bbc071b6c7a433ca9c49a9347. So the output integers are correct, but the lack of agreed encoding of those integers to a byte array seems to explain the descrepancy. It seems that in both the JS and Go implementations we're using that we're doing this as big-endian. But is that something we want to document here?

(I'll also note that aappleby/smhasher#6 has the erroneous output for Guava as 4cae51b5316602c01c7c5642843e5fe7, which doesn't figure in any of the above, it seems to be an actual error in that implementation).

But is that something we want to document here?

@rvagg my understanding is that generally the that a major point of the multicodec table is to be able to have different groups reuse and implement the same schemes and identify their data the same way such that they can build compatible systems. In the case of multihash I'd generally assume that we're specifying data of the form <hash code><hash len><hash bytes> and how a user could take in a set of bytes and compute the same multihash output.

As a result, it seems like knowing how to order the output bytes to create the codec seems important for reproducibility. Am I missing something?

In general I'd hope that we could have some criteria for most codec entries which is to be able to point people at a spec (even if it's a living one) for how to implement these things and what the human meaning behind any particular 0xABCD actually means.

My main objections to what was merged here were:

Labeling 0x1022 as having a code despite it being both unspecified and unrequested

It doesn't personally matter to me what we specify, but I'd hope the next unfortunate soul to come along here and try to implement this could figure out whether the spaolacci approach (i.e. big endian) or the https://github.com/phensley/python-smhasher (i.e. small endian) was the one they should use to produce compatible results

We should be trying to make things better over time not worse, so let's try and push towards specifying what is meant by a new code.

Labeling 0x22 as murmur3-x64-64 despite knowing that this was incompletely specified and could still lead to confusion.

So let's label what we're doing. We can throw it in the notes field if we don't want to add more to the label name, and let any future soul who wants murmur3-x64-64-little-endian claim that if they want it.

Brn2bern

@

update murmur3 length / family to match current uses

c229a2d

willscott requested a review from Stebalien October 14, 2021 00:45

aschmahmann previously requested changes Oct 14, 2021

View reviewed changes

2nd pass

676e0d2

rvagg reviewed Oct 14, 2021

View reviewed changes

table.csv Outdated Show resolved Hide resolved

rvagg reviewed Oct 14, 2021

View reviewed changes

table.csv Outdated Show resolved Hide resolved

Apply suggestions from code review

af6be4f

Co-authored-by: Rod Vagg <[email protected]>

rvagg approved these changes Oct 14, 2021

View reviewed changes

no commas in comments

3d0d269

rvagg approved these changes Oct 14, 2021

View reviewed changes

vmx approved these changes Oct 14, 2021

View reviewed changes

warpfork mentioned this pull request Oct 14, 2021

murmur3 support multiformats/go-multihash#150

Merged

warpfork approved these changes Oct 14, 2021

View reviewed changes

Stebalien approved these changes Oct 14, 2021

View reviewed changes

rvagg reviewed Oct 15, 2021

View reviewed changes

table.csv Outdated Show resolved Hide resolved

Update table.csv

04706db

rvagg merged commit 36789e0 into master Oct 15, 2021

rvagg deleted the murmur3 branch October 15, 2021 02:32

aschmahmann reviewed Oct 15, 2021

View reviewed changes

Brn2bern reviewed Oct 16, 2021

View reviewed changes

rvagg mentioned this pull request Nov 4, 2021

Options for adding docs and test fixtures for multicodec entries #240

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update murmur3 length / family to match current uses #236

update murmur3 length / family to match current uses #236

willscott commented Oct 14, 2021 •

edited

Loading

aschmahmann Oct 14, 2021 •

edited

Loading

willscott Oct 14, 2021

rvagg Oct 14, 2021

rvagg commented Oct 14, 2021

rvagg left a comment

vmx left a comment

warpfork left a comment

Stebalien left a comment

aschmahmann Oct 15, 2021 •

edited

Loading

rvagg Oct 18, 2021 •

edited

Loading

rvagg Oct 18, 2021

aschmahmann Oct 18, 2021 •

edited

Loading

Brn2bern left a comment

		murmur3-32, multihash, 0x23, draft,
		murmur3-128, multihash, 0x24, draft,

update murmur3 length / family to match current uses #236

update murmur3 length / family to match current uses #236

Conversation

willscott commented Oct 14, 2021 • edited Loading

aschmahmann Oct 14, 2021 • edited Loading

Choose a reason for hiding this comment

willscott Oct 14, 2021

Choose a reason for hiding this comment

rvagg Oct 14, 2021

Choose a reason for hiding this comment

rvagg commented Oct 14, 2021

rvagg left a comment

Choose a reason for hiding this comment

vmx left a comment

Choose a reason for hiding this comment

warpfork left a comment

Choose a reason for hiding this comment

Stebalien left a comment

Choose a reason for hiding this comment

aschmahmann Oct 15, 2021 • edited Loading

Choose a reason for hiding this comment

rvagg Oct 18, 2021 • edited Loading

Choose a reason for hiding this comment

rvagg Oct 18, 2021

Choose a reason for hiding this comment

aschmahmann Oct 18, 2021 • edited Loading

Choose a reason for hiding this comment

Brn2bern left a comment

Choose a reason for hiding this comment

willscott commented Oct 14, 2021 •

edited

Loading

aschmahmann Oct 14, 2021 •

edited

Loading

aschmahmann Oct 15, 2021 •

edited

Loading

rvagg Oct 18, 2021 •

edited

Loading

aschmahmann Oct 18, 2021 •

edited

Loading