-
Notifications
You must be signed in to change notification settings - Fork 385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MSC3468: MXC to Hashes #3468
base: main
Are you sure you want to change the base?
MSC3468: MXC to Hashes #3468
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,197 @@ | ||
# MSC3468: MXCs to Hashes | ||
|
||
Currently, matrix media/content repositories work with a MXC to blob mapping, fetching the media | ||
from the domain embedded in the MXC to present it to the user. | ||
|
||
However, this becomes a problem when media retention, redaction, and resiliency come into play, | ||
the singular MXC URI becoming a point of failure once the backing server retracts the URI, either | ||
deliberately (aforementioned redaction), or accidentally (via server reset, or losing the backing media). | ||
|
||
This is in opposition to how MXCs are used in matrix today, much like Discord media URLs; | ||
immutable and always online, links are copied and reused across rooms. | ||
|
||
## Proposal | ||
|
||
I propose for MXCs to be reworked into being a pointer to hashes. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hashes cause problems when we want to delete media: because media is referenced only by itself without the context of an event, we need a unique identifier to allow users to delete their uploaded copy of the media. This further plays into terms of service stuff where typically the user has the intellectual property rights of their upload, which may not be the case in a shared identifier. For further context, matrix-media-repo originally used hashes as identifiers but quickly changed away from that to maintain those intellectual property rights as well as ensuring that in the future it will be possible for people to delete their own uploads. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hashes can be garbage-collected once no locally-known MXC points to them, in that aspect, they only serve as a performance detail, to make sure multiple MXCs dont duplicate the same data, and/or also duplicate the same data in transit. I don't exactly understand what you mean with property rights in the context of practicality, what's to stop someone from downloading and uploading the media behind a MXC from another server? In that aspect, you have the exact same operation Users can still delete media IDs, only if their media has been copied somewhere, the underlying hash may not be garbage collected on other servers. I think that's indeed a problem when thinking about property rights and copyright, however, I think that it practically makes absolutely no difference, as media is copied, cached, and downloaded in the exact same fashion across the federation. If anything, this'd give more tools to police media, as media can then be banned by it's hash on local servers. And shared moderation lists can then be used to propagate bans across multiple servers. If you want to say that shared identifiers, like hashes, arent an option because of copyright issues, then that's practically not enforceable, and thus moot, as MXCs already act as such a 'shared identifier', with servers being able to query media from a proxy server by a MXC, returning its local cache. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't really understand the purpose of the hash, especially on the client-server side. Can you explain why a client would care about the hash of a file? Also, it seems to me like this MSC is proposing two independent things: exposing the hash of an mxc:// url, and allowing for cloning of media. I don't really see them as being related. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This MSC formally transforms MXCs into aliases to content hashes, and the clone operation just copies the hash in a performant manner, I think that one would not make sense without the other. (Why only a MSC to change MXCs to being hash-based? what's the purpose behind that? And if i'd propose a clone endpoint only, then it'd be dubious utility without it being low-cost here as it is.)
Clients might care about hashes for moderation, de-duplication, or debugging purposes. A bot like mjolnir can submit to ban a hash on a shared list, as a "known" bad image, so not even cloning can propagate it. The clone endpoint exists then to allow clients to easily reference media under a new MXC when forwarding messages to new rooms, or for whatever other reason, to preserve the underlying media for longer than the original URI would have. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't understand how the clone operation can just copy a hash. Surely if you want to clone a file from a remote server, you'd want to copy the whole file locally. And if you're cloning a local file, then I don't think that anyone should care what mechanism you're using internally to deduplicate.
So, it might be useful for clients to be able to easily get a hash for a file, but I don't think that they need to care about the hash used internally by the server. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
The file is already cloned, but this is also about easy deduplication and resilliency, as files dont have to be copied and uploaded/downloaded with a whole new "ID" every time it is cloned, it reduces federation traffic, reduces disk space needed, at the benefit of added resiliency (if the server the file was originally uploaded from goes down)
The "hash" here is also the identifier, so you're essentially asking the server "hey, that namespaced identifier? what is the underlying shared identifier for that?", in this case that hash, and it might be interesting to ban or tag that shared identifier for moderation purposes, and/or publish them onto public warn lists, for stuff like known abusive material (if it is entirely unmodified). Also, utilising could have the benefit of deduplicating even on download->reupload, if there is no pre-processing on the client or server-side. Say, for example, Alice uploads a file on server A, and then Bob downloads it on server B, Bob then downloads the file, and posts it to a public list. Now, Charlie likes the file, and uploads it to their own server C, in a room that is shared with servers A and B, now A and B only have to ask C what hash the file is, and if it is one they already have locally, then they can serve from disk, instead of downloading the file separately from C. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Right, but if you're cloning a file that your server has already copied, then it doesn't have to refetch, but nobody should care what method it's using to de-duplicate. That was what my second sentence was trying to get at.
My main issue here is that in this MSC, you're saying that each file will be identified by one single hash, but we don't have to just stick with one single hash per file, and it doesn't have to be tied to how the media repository does deduplication. I'm not saying that hashes are a bad idea; I'm saying that we don't need to have a concept of "what is the hash for this file?" Instead, we could just allow users/servers to ask, for example, "What is the sha256 hash for this file", or "What is the sha512 hash for this file", or whatever other algorithm the server supports, and the server that's being asked could calculate the hash on-demand, or it could store it in advance, or do whatever. The requester shouldn't really care how the server does it internally; all it should really needs to care about is whether the server supports the hash algorithm that we want. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I already tackle this in the proposal (see
Alright, good point, but i'm just saying that its easiest for the server to simply copy the hash received from the other server onto a new local MXC. However, in conjunction to the first response, i think that I should relax that wording, as servers might "know about" a hash (lets say MD5), but cannot resolve a hash with it, so it cannot download the remote file, perform a hash, and link a MXC with that hash. In the same vein, it cannot "trust" the remote server's hash to a degree (or it shouldn't, anyways). So, what about the following; The server internally represents a MXC with 1 or multiple hashes, with which it resolves and verifies the file, (implementations could have a "master" hash with which it can easily key files, but those are details), other servers can request hashes (one "requested" type, and multiple "understood" types), and the server must return the requested hash type (and "understood" ones if it has them cached locally), then the server can fetch the file via this hash+type, and verify locally. The benefit of all of this is that it allows describing the "required-to-provide-on-demand" hashes (sha256 and sha512 for now), which an implementation must always support, while it allows them to experiment or work with additional hash-types. This is probably a bit of overengineering, and i probably have to size it down, but the basic idea of a file represented by multiple hashes, and an MXC linking to those multiple hashes, would be the core of the proposal. This could potentially also (accidentally) solve another problem; collision, where a paranoid server might verify the file via the multiple hash-types before serving it to the user. (I'm increasingly waning off of multihash by thinking about this though, its much easier and much more expansive to re-invent it as a tuple of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
It may be, but I'd consider it an implementation detail.
I don't think we need an endpoint to get a file by its hash. I think all we need is an endpoint to get the hash for a file, given the hash algorithm. Then, when a client/server sees an This avoids the problem of server A using internally sha256 for deduplication, but server B not trusting sha256 and wanting to use sha3 instead, since the internal hash algorithm isn't exposed at all. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Alright, I think we've come to a conclusion then;
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Might as well add sha512 as an optional hash too. Or maybe required? I think any library that gives you sha256 will also give you sha512. |
||
|
||
This gives the extra benefit of decoupling aliasing pointers (such as the MXC is) with the underlying media. | ||
|
||
Alongside this change, I also propose for an additional client-side endpoint which can quickly "clone" | ||
a MXC. This being done by having the server look up the MXC's hash, | ||
and then creating a new MXC also referencing that hash. | ||
|
||
The client-server content API would expose a method for the client to retrieve the hash of a | ||
particular MXC, alongside aforementioned method to clone it. | ||
|
||
The server-server content API would add a dedicated fetch method for fetching the hash to a MXC, and | ||
fetching the media to a hash. | ||
|
||
### Specification | ||
|
||
#### Client-Server | ||
|
||
This proposal would like to add the following two methods to CS; | ||
|
||
``` | ||
POST _matrix/media/v1/clone/{serverName}/{mediaId} | ||
|
||
Rate-limited: Yes | ||
Authentication: Yes | ||
|
||
Responses: | ||
200: JSON (see below) | ||
429: Ratelimited | ||
503: Could not fetch remote MXC-to-hash mapping | ||
``` | ||
200 response: | ||
```json | ||
{ | ||
"m.clone.mxc": "mxc://local.server/media_id" | ||
} | ||
``` | ||
|
||
``` | ||
GET _matrix/media/v1/hash/{serverName}/{mediaId} | ||
|
||
Rate-limited: Yes | ||
Authentication: Yes | ||
|
||
Responses: | ||
200: JSON (see below) | ||
429: Ratelimited | ||
503: Could not fetch remote MXC-to-hash mapping | ||
``` | ||
|
||
200 response: | ||
```json5 | ||
{ | ||
"m.mxc.hash": "1234567890abcdef" // hex-encoded hash | ||
} | ||
``` | ||
|
||
#### Server-Server | ||
|
||
This proposal would like to add the following two endpoints to S2S; | ||
|
||
``` | ||
GET _matrix/federation/v1/media/hash | ||
|
||
Rate-limited: No | ||
Authentication: Yes | ||
|
||
Query parameters: | ||
media_id: string, the local part of an MXC for which the hash is queried | ||
|
||
Responses: | ||
200: Pure-binary encoding of corresponding hash | ||
404: Media ID does not exist | ||
``` | ||
|
||
``` | ||
GET _matrix/media/v1/media/fetch/{hash} | ||
|
||
Rate-limited: Yes | ||
Authentication: Yes | ||
|
||
Responses: | ||
200: Blob of data corresponding to hash | ||
404: Hash-media not found | ||
429: Ratelimited | ||
``` | ||
|
||
### "Which hash?" | ||
|
||
*Note: this is an area of feedback, this'll be removed in the final draft* | ||
|
||
So far, the definition of "hash" has been vague. I think converging on a specific hash function | ||
could be a lock-in for future expansion. | ||
|
||
So, i'd like to propose using [`multihash`](https://github.com/multiformats/multihash) for these | ||
purposes, this would allow a common format self-describing the hashes used. | ||
|
||
For now, only a set series of hashes would be included (see | ||
[here](https://github.com/multiformats/multicodec/blob/master/table.csv) for a full table), which | ||
can be expanded/deprecated with subsequent matrix spec releases, without changing up the format of | ||
the hash, or documenting checks to differentiate the types of hash used, or to reinvent multihash. | ||
|
||
However, this is up for debate. | ||
|
||
## Motivation | ||
|
||
This MSC wishes to unblock efforts for media retention and redaction; | ||
- https://github.com/matrix-org/synapse/issues/6832 | ||
- https://github.com/matrix-org/matrix-doc/issues/701 | ||
|
||
By addition of the `/clone` endpoint, any client wishing to preserve media, can do so by simply | ||
fetching/storing media locally, reducing the linkrot effect that remote servers redacting media | ||
could have. | ||
|
||
This MSC would also wish to make matrix more flexible for diverse media delivery systems. | ||
|
||
Mapping MXCs to hashes could allow the hashes themselves to become self-verifying keys in any | ||
(centralized or distributed) KV store. | ||
|
||
This, in turn, could prepare matrix better for P2P efforts. | ||
|
||
This MSC also wishes to make matrix content delivery more resilient, with the exception of mapping a | ||
MXC alias to a hash, a hash could be retrieved from anywhere, and still be self-verifying, | ||
considerably lessening the bus factor, and allowing for better distributed load (see the first | ||
"future extension" in below section) | ||
|
||
## Potential issues | ||
|
||
This could have a slight performance hit, as an extra RTT between servers is needed to fetch the | ||
media actual, after fetching the hash corresponding to that bit of media. | ||
|
||
I think this is a more acceptable tradeoff, an alternative would be to side-channel the hash in a | ||
header, in an endpoint fetching directly from a MXC. | ||
|
||
## Future extensions | ||
|
||
*Note: this is free-form speculation, and serves to illustrate how future MSCs can extend the | ||
behavior this MSC is enabling.* | ||
|
||
A possible extension would be a server-server endpoint which requests what recommended content | ||
endpoints would be to fetch hashes from. | ||
|
||
(I.e. a server would ask `/media/endpoints`, and the server can respond with | ||
`["https://common.caching.server", "https://matrix.org"]`, in decreasing order of priority) | ||
|
||
This can be helpful when servers share a common "media server", as is the case today with | ||
[matrix-media-repo](https://github.com/turt2live/matrix-media-repo), which "tricks" federation by | ||
redirecting any request for media to itself. This future extension would formalize this process. | ||
|
||
This would also be helpful with dealing with "thundering herds", as servers can be redirected to | ||
multiple servers to fetch media from a hash from. | ||
|
||
(However, as-is, this could have security problems with DoS-ing, issues with cache invalidation | ||
after redacting media, and possibly more. This is only to illustrate flexibility.) | ||
|
||
Another possible extension could be to allow to tap in natively to decentralized media stores, which | ||
often key their data to hashes. This could make media P2P easier to implement and work with. | ||
|
||
One last possible extension is to add `410` to every endpoint pertaining fetching media, this could | ||
help with communicating that media has been deleted to servers and clients. | ||
|
||
## Security considerations | ||
|
||
A big part of this MSC's motivation is to unblock media redaction/retention efforts. However, that | ||
does not mean this MSC should be blind to the struggle of containing unsavory media across | ||
federation. | ||
|
||
This MSC adds a `/clone` endpoint, by which a client, on any server, could easily "copy" media, | ||
seemingly making containment efforts useless. | ||
|
||
However, at a room-level, and possibly a server-level, hashes themselves could be banned. This can | ||
be implementation-specific, or be built-into bots like mjolnir. | ||
|
||
## Unstable prefix | ||
|
||
This MSC uses the unstable prefix `nl.automatia.msc3468`; | ||
|
||
- `_matrix/media/nl.automatia.msc3468/clone/{serverName}/{mediaId}` | ||
- `_matrix/media/nl.automatia.msc3468/hash/{serverName}/{mediaId}` | ||
- `_matrix/federation/nl.automatia.msc3468/media/hash` | ||
- `_matrix/media/nl.automatia.msc3468/media/fetch/{hash}` | ||
- `nl.automatia.msc3468.clone.mxc` | ||
- `nl.automatia.msc3468.mxc.hash` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This MSC conflicts with the ideas of #3911