-
Notifications
You must be signed in to change notification settings - Fork 385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MSC4083: Delta-compressed E2EE file transfers #4083
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,113 @@ | ||||||
# MSC4083: Delta-compressed E2EE file transfers | ||||||
|
||||||
## Problem | ||||||
|
||||||
When collaborating on a large file of some kind, it's common to store that file in Matrix, and then need a way to | ||||||
express incremental changes to it. For instance, in Third Room, you might store a large glTF scene graph as a GLB | ||||||
file, and then want to express a small change to it (e.g. using the editor to transform part of the scene graph). Or | ||||||
you might want to store a change to a markdown or HTML file. | ||||||
|
||||||
Currently, your only option is to save a whole new copy of the file - or invent your own delta-compression scheme at | ||||||
the application layer. Instead, we could make Matrix itself aware of delta-compression, letting the content repository | ||||||
help users efficiently collaborate around updates to binary files, regardless of what the file is. | ||||||
|
||||||
## Solution | ||||||
|
||||||
When uploading a file, specify that it's a delta against a previous piece of content, using a given algorithm. | ||||||
|
||||||
* `delta_base` is the mxc URL of the content the delta applies to | ||||||
* `delta_format` is the file format of the binary diff | ||||||
* This MSC defines `m.vcdiff.v1.gzip` to describe gzipped RFC3284 compatible binary VCDIFF payloads, picked for | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
computation efficiency rather than patch size (whereas bsdiff + bzip might provide better patch size at worse | ||||||
computation complexity; other MSCs are welcome to propose different diff formats). | ||||||
|
||||||
Clients should upload a new snapshot of a piece of content if the sum of the deltas relative to the last snapshot | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Clients must also upload a new snapshot when needed to ensure that secrecy is preserved in encrypted rooms. e.g. if a new user joins, a new snapshot must be uploaded, otherwise the new user would need to be able to decrypt the file state from before they joined the room. |
||||||
is larger than 50% of the original piece of content | ||||||
|
||||||
For instance: | ||||||
|
||||||
`POST /_matrix/media/v3/upload?delta_base=mxc://matrix.org/b4s3v3rs10n&delta_format=m.vcdiff.v1.gzip` | ||||||
|
||||||
returning: | ||||||
```json | ||||||
{ | ||||||
"content_uri": "mxc://matrix.org/n3wv3rs10n" | ||||||
} | ||||||
``` | ||||||
|
||||||
(or with the same parameters for MSC2246-style `POST /_matrix/media/v3/create`). | ||||||
|
||||||
The server tracks the graph of which deltas apply to which files, so it can only hand the relevant deltas to clients | ||||||
when they download them. | ||||||
|
||||||
For instance, when downloading a delta-compressed piece of content, a client might ask to pull in any delta dependencies | ||||||
it doesn't already have stored locally, relative to the last version that it has a full copy of: | ||||||
|
||||||
`GET /_matrix/media/v3/download/matrix/org/n3wv3rs10n?delta_base=mxc://matrix.org/b4s3v3rs10n` | ||||||
|
||||||
This would return an ordered multipart download of the deltas (once unencrypted, if needed) to apply to the base-version | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For encrypted files, how do clients discover the encryption keys for each delta and the base file? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i just realised the same thing :) i guess this pushes it back towards putting the delta links on the m.file events rather than the content repository, and using aggregations perhaps as a way to grab all the events needed to download a given file. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Alternatively, could be evil and specify the same IV & Key for every event which is a diff on a given file - but calculate the actual IV used to encrypt/decrypt the diff as I'm not convinced this is better than using an aggregation API to say "give me all the events for the diffs needed to construct this $event_id", and then firing off a tonne of parallel reqs to the media repo to grab the required media files (which is arguably only 2 roundtrips too). But it avoids having to fiddle around with events at all. |
||||||
to get a copy of the new-version. | ||||||
|
||||||
## Alternatives | ||||||
|
||||||
### Track deltas on events rather than media repository | ||||||
|
||||||
Alternatively we could store the delta info on the `m.file` event itself as a mixin. This would allow us to shift the | ||||||
task of tracking deltas purely to clients, and protect the delta info within the E2EE payload. However, this would then | ||||||
force the client to do many more roundtrips to spider the events (if needed) and files (if needed) one by one in order | ||||||
to calculate diffs, which would be O(N) latency with the number of diffs rather than O(1) for the above API. Given the | ||||||
traffic pattern of these requests would reveal the delta graph to the server anyway, it's not clear that it provides a | ||||||
sufficient advantage. This would look like this: | ||||||
|
||||||
```json | ||||||
{ | ||||||
"content": { | ||||||
"filename": "something-important.doc", | ||||||
"info": { | ||||||
"mimetype": "application/msword", | ||||||
"size": 46144 | ||||||
}, | ||||||
"msgtype": "m.file", | ||||||
"url": "mxc://example.org/n3wv3rs10n", | ||||||
"delta_base": "$1235135aksjgdkg", | ||||||
"delta_format": "m.vcdiff.v1.gzip" | ||||||
}, | ||||||
} | ||||||
``` | ||||||
|
||||||
We could go even further down this path by defining an arbitrary CRDT for tracking these deltas, a bit like the | ||||||
(Saguaro CRDT-over-Matrix)[https://github.com/matrix-org/collaborative-documents/blob/main/docs/saguaro.md] proposal, | ||||||
with files decorating each event - effectively modelling the problem as a collaborative document problem (with binary | ||||||
diffs attached) rather than a binary file diffing problem. | ||||||
|
||||||
### Other alternatives | ||||||
|
||||||
We could use HTTP PATCH rather than POST when sending diffs. This feels needlessly exotic, imo. | ||||||
|
||||||
Rather than having a delta_format field, we could use the MIME type of the upload to indicate that it's a patch to a | ||||||
given underlying MIME type. However, Matrix doesn't currently have to parse MIME types anywhere, so it's more matrixy | ||||||
to destructure this in JSON. | ||||||
|
||||||
For unencrypted files, the server could apply the diffs serverside as a convenience to clients who don't know | ||||||
how to apply the diffs themselves (or who don't have CPU to apply the diffs, or want to benefit from the server caching | ||||||
diff results). This could be proposed as a separate MSC. | ||||||
|
||||||
## Security considerations | ||||||
|
||||||
This exposes the metadata of which file is a delta to which other file to the server. | ||||||
|
||||||
DoS by too many deltas | ||||||
|
||||||
DoS by using async uploads to create a cycle | ||||||
|
||||||
## Unstable prefix | ||||||
|
||||||
| Param | Unstable prefixed param | | ||||||
| ------------ | -------------------------------- | | ||||||
| delta_base | org.matrix.msc4083.delta_base | | ||||||
| delta_format | org.matrix.msc4083.delta_format | | ||||||
|
||||||
## Dependencies | ||||||
|
||||||
None. Although [MSC4016](https://github.com/matrix-org/matrix-spec-proposals/pull/4016) was sketched out at the same | ||||||
time and the two are siblings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't you also use an existing delta format like the one used by OTA updates on android and encrypt that separately here? Or is the concern that due to e2ee shenanigans, intermediate deltas are lost here? (I am not saying that this a good approach. Just an alternative that also came to mind for me)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but this proposal does propose using an existing delta format (vcdiff - rfc3284) and encrypting the diffs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I must have been asleep while writing this comment I guess 😱 sorry.