-
Notifications
You must be signed in to change notification settings - Fork 233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IPIP-0421: HTTP Delegated Routing Reader Privacy Upgrade #421
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for submitting this @ischasny.
I'm OoO for the rest of the week, but dropping some quick questions after very rushed first pass (fine to link to prior discussions, if these were already asked before).
- **`MH`** is the [Multihash](https://github.com/multiformats/multihash) contained in a `CID`. It corresponds to the | ||
digest of a hash function over some content. `MH` is represented as a 32-byte array. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean CIDs with longer hash functions are truncated at 32-byte mark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the 32-byte length as indeed it can be longer.
Co-authored-by: Marcin Rataj <[email protected]>
Co-authored-by: Marcin Rataj <[email protected]>
LGTM! |
* Refine paragraphs for better readability. * Change section on router selection based on code, since from multihash code alone we cannot determine wheterh whether the request is encrypted or not. * Update alternatives section to explain how the IPIP can be enhanced with OHTTP and Tor.
Refine routing specification and add byte frame diagram to clearly illustrate the content of SALT values.
|
||
#### Backwards Compatibility | ||
|
||
Users will need to deliberately activate Reader Privacy on their nodes. A new flag could be introduced into IPFS implementations such as Kubo's HTTP Delegated Content Router configuration to streamline this process. Users on older nodes can continue using the existing API and switch on Reader Privacy later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd hope this doesn't need to be the case in an application that has some IPFS smarts (rather than a simple HTTP client). If enough features are expressed through something like #388 then the client should be able to have plausible defaults here (e.g. if my delegated router supports IPNI + DHT, but only IPNI has double-hashing support and the client can run its own DHT client it could choose to send double-hashed requests to the delegated router for IPNI and do the DHT lookups itself).
Obviously some clients will still offer configurability (e.g. would you rather ask the delegated router to do DHT lookups for you in cleartext, or not do them at all) but having reasonable default behavior should be possible.
|
||
All salts below are 64-bytes long and represent a string padded with `\x00`. | ||
|
||
- `SALT_DOUBLEHASH`: The string value `CR_DOUBLEHASH`, where each if the 13 characters are represented by their byte value. The remainder of the 64 bytes is filled with null bytes represented by `\x00`. This results in 51 null bytes after the `CR_DOUBLEHASH` string. The following illustrates its corresponding byte frame diagram: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- `SALT_DOUBLEHASH`: The string value `CR_DOUBLEHASH`, where each if the 13 characters are represented by their byte value. The remainder of the 64 bytes is filled with null bytes represented by `\x00`. This results in 51 null bytes after the `CR_DOUBLEHASH` string. The following illustrates its corresponding byte frame diagram: | |
- `SALT_DOUBLEHASH`: The string value `CR_DOUBLEHASH`, where each of the 13 characters are represented by their byte value. The remainder of the 64 bytes is filled with null bytes represented by `\x00`. This results in 51 null bytes after the `CR_DOUBLEHASH` string. The following illustrates its corresponding byte frame diagram: |
43 52 5F 44 4F 55 42 4C 45 48 41 53 48 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ||
``` | ||
|
||
- `SALT_ENCRYPTIONKEY`: The string value `CR_ENCRYPTIONKEY`, where each if the 15 characters are represented by their byte value. The remainder of the 64 bytes is filled with null bytes represented by `\x00`. This results in 49 null bytes after the `CR_ENCRYPTIONKEY` string. The following illustrates its corresponding byte frame diagram: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- `SALT_ENCRYPTIONKEY`: The string value `CR_ENCRYPTIONKEY`, where each if the 15 characters are represented by their byte value. The remainder of the 64 bytes is filled with null bytes represented by `\x00`. This results in 49 null bytes after the `CR_ENCRYPTIONKEY` string. The following illustrates its corresponding byte frame diagram: | |
- `SALT_ENCRYPTIONKEY`: The string value `CR_ENCRYPTIONKEY`, where each of the 16 characters are represented by their byte value. The remainder of the 64 bytes is filled with null bytes represented by `\x00`. This results in 48 null bytes after the `CR_ENCRYPTIONKEY` string. The following illustrates its corresponding byte frame diagram: |
43 52 5F 45 4E 43 52 59 50 54 49 4F 4E 4B 45 59 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ||
``` | ||
|
||
These magic values are utilized to compute distinct digests from identical values for varying purposes. For instance, a hash of a Multihash employed for lookups should differ from the one used for key derivation, despite originating from the same value. To achieve this, the Multihash is concatenated with different magic values before applying the hash function: `SALT_DOUBLEHASH` for lookups and `SALT_ENCRYPTIONKEY` for key derivation as elaborated in the `Glossary`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not strictly needed, but might be nice to explain a bit why the hashes should be different for when people coming looking at this later.
- **`CID`** stands for [Content IDentifier](https://github.com/multiformats/cid). | ||
- **`MH`** refers to the [Multihash](https://github.com/multiformats/multihash) contained in a `CID`. It corresponds to the hash function's digest over certain content. | ||
- **`HASH2`** is a second hash over the multihash. Second Hashes must follow the `Multihash` format with `SHA2_256` codec. The digest must be calculated as `hash(SALT_DOUBLEHASH || MH)`. | ||
- **`ProviderRecord`** is a JSON with Provider Record as described in the [HTTP Delegated Routing Specification](http-routing-v1.md). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
- **`ProviderRecord`** is a JSON with Provider Record as described in the [HTTP Delegated Routing Specification](http-routing-v1.md). | |
- **`ProviderRecord`** is a JSON object with Provider Record as described in the [HTTP Delegated Routing Specification](http-routing-v1.md). |
|
||
- `EncProviderRecordKeys` is a list of base64 encoded `EncProviderRecordKey`; | ||
|
||
#### `GET /routing/v1/encrypted/metadata/{HashProviderRecordKey}` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question about encoding as for HASH2
- **`ProviderRecord`** is a JSON with Provider Record as described in the [HTTP Delegated Routing Specification](http-routing-v1.md). | ||
- **`ProviderRecordKey`** is a concatenation of `peerID || contextID`. Explicit encoding lengths are unnecessary as they are inherently encoded as part of the multihash format. Max `contextID` length is 64 bytes. | ||
- **`EncProviderRecordKey`** is `Nonce || enc(deriveKey(multihash), Nonce, ProviderRecordKey)`. Max `EncProviderRecordKey` is 200 bytes. | ||
- **`HashProviderRecordKey`** is a hash over `ProviderRecordKey`, calculated as `hash(SALT_DOUBLEHASH || ProviderRecordKey)`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a SHA256 multihash as well?
- **`CID`** stands for [Content IDentifier](https://github.com/multiformats/cid). | ||
- **`MH`** refers to the [Multihash](https://github.com/multiformats/multihash) contained in a `CID`. It corresponds to the hash function's digest over certain content. | ||
- **`HASH2`** is a second hash over the multihash. Second Hashes must follow the `Multihash` format with `SHA2_256` codec. The digest must be calculated as `hash(SALT_DOUBLEHASH || MH)`. | ||
- **`ProviderRecord`** is a JSON with Provider Record as described in the [HTTP Delegated Routing Specification](http-routing-v1.md). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it? The routing-v1 spec allows for opaque blobs in the provider record. Where's the line between "metadata" and "provider record" here?
```json | ||
{ | ||
"EncProviderRecordKeys": [ | ||
"EBxdYDhd.....", | ||
"IOknr9DK....." | ||
] | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be helpful/illustrative for the encrypted data to also show what it's expected the unencrypted form would look like.
|
||
### Notes | ||
|
||
Assembling a full `ProviderRecord` from the encrypted data requires multiple server roundtrips. The first fetches a list of `EncProviderRecordKey`s, followed by one for each `EncProviderRecordKey` to retrieve `EncMetadata`. To minimize the number of roundtrips to one, the client implementation should use the local libp2p peerstore for multiaddress discovery and [libp2p multistream select](https://github.com/multiformats/multistream-select) for protocol negotiation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@masih we've talked about this before and it's more exploratory, but I'm noting here to make it public and see if folks have any thoughts about how this might impact the interface/API here in the future.
Note: This is not a "please rewrite everything" request.
How well this works is a function of how important the metadata is to performing a useful retrieval, and how important the metadata is depends on the distribution of information between the "ProviderRecord" and the "(ProviderRecord)Metadata".
IIUC the reason it's implemented this way is to keep data storage in routing backends like IPNI from needing to store the same data but encrypted many many times (i.e. once per multihash advertised).
At the extremes we have:
- A lot of space could be saved by making
ProviderRecord
just contain a small pointer andMetadata
contain all the actual provider record information (e.g. peerID, multiaddrs, protocols, ...). This means two round-trips unless the client already has the pointer information locally (e.g. if the pointer was a peerID, then having the multiaddrs, etc. locally, and if the pointer was some system-specific unique-ID then having that cached from a prior lookup). Also, unless there's some aggregation service/proxy it allows correlation between many different requests that use the same metadata (not necessarily a big deal here). - The second round-trip could disappear if we store all the information in the
ProviderRecord
portion. However, this means encrypting all the data for every advertised multihash
I could situationally see reasons to shuffle data between these two, depending on things like:
- how reusable the metadata is
- how frequently the metadata information is to be cached
- cost models for routing systems
...
Two areas where I could see the extremes being in use:
- Save storage by just returning the target's "identifier": e.g. libp2p peerID or an unauthenticated HTTP+libp2p URL (that has .well_known for protocol negotiation) since peer routing makes sense to be separable in libp2p, and protocol negotiation to be separate for unauthenticated HTTP + libp2p.
- Save round-trips by returning all the information: e.g. a webseeds-like advertisement that points to an outboard blake3 HTTP URL and an HTTP URL for the data (data could be a separate advertisement)
This makes me wonder if there's a better way to do this. For example:
encrypted/providers
returns a (JSON) blob mimicking the routing-v1 results- It contains some information indicating if there's metadata and/or what might be in the metadata
encrypted/metadata
provides the metadata
This could allow systems like IPNI to optimize data layouts on ingestion and have some flexibility without breaking downstream clients.
CC @gammazero @masih @guillaumemichel @lidel