Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSC4083: Delta-compressed E2EE file transfers #4083

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
113 changes: 113 additions & 0 deletions proposals/4083-delta-compressed-file-transfers.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
# MSC4083: Delta-compressed E2EE file transfers

## Problem

When collaborating on a large file of some kind, it's common to store that file in Matrix, and then need a way to
express incremental changes to it. For instance, in Third Room, you might store a large glTF scene graph as a GLB
file, and then want to express a small change to it (e.g. using the editor to transform part of the scene graph). Or
you might want to store a change to a markdown or HTML file.

Currently, your only option is to save a whole new copy of the file - or invent your own delta-compression scheme at
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't you also use an existing delta format like the one used by OTA updates on android and encrypt that separately here? Or is the concern that due to e2ee shenanigans, intermediate deltas are lost here? (I am not saying that this a good approach. Just an alternative that also came to mind for me)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but this proposal does propose using an existing delta format (vcdiff - rfc3284) and encrypting the diffs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must have been asleep while writing this comment I guess 😱 sorry.

the application layer. Instead, we could make Matrix itself aware of delta-compression, letting the content repository
help users efficiently collaborate around updates to binary files, regardless of what the file is.

## Solution

When uploading a file, specify that it's a delta against a previous piece of content, using a given algorithm.

* `delta_base` is the mxc URL of the content the delta applies to
* `delta_format` is the file format of the binary diff
* This MSC defines `m.vcdiff.v1.gzip` to describe gzipped RFC3284 compatible binary VCDIFF payloads, picked for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* This MSC defines `m.vcdiff.v1.gzip` to describe gzipped RFC3284 compatible binary VCDIFF payloads, picked for
* This MSC defines `m.vcdiff.v1.gzip` to describe gzipped [RFC3284](https://datatracker.ietf.org/doc/html/rfc3284) compatible binary VCDIFF payloads, picked for

computation efficiency rather than patch size (whereas bsdiff + bzip might provide better patch size at worse
computation complexity; other MSCs are welcome to propose different diff formats).

Clients should upload a new snapshot of a piece of content if the sum of the deltas relative to the last snapshot
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clients must also upload a new snapshot when needed to ensure that secrecy is preserved in encrypted rooms. e.g. if a new user joins, a new snapshot must be uploaded, otherwise the new user would need to be able to decrypt the file state from before they joined the room.

is larger than 50% of the original piece of content

For instance:

`POST /_matrix/media/v3/upload?delta_base=mxc://matrix.org/b4s3v3rs10n&delta_format=m.vcdiff.v1.gzip`

returning:
```json
{
"content_uri": "mxc://matrix.org/n3wv3rs10n"
}
```

(or with the same parameters for MSC2246-style `POST /_matrix/media/v3/create`).

The server tracks the graph of which deltas apply to which files, so it can only hand the relevant deltas to clients
when they download them.

For instance, when downloading a delta-compressed piece of content, a client might ask to pull in any delta dependencies
it doesn't already have stored locally, relative to the last version that it has a full copy of:

`GET /_matrix/media/v3/download/matrix/org/n3wv3rs10n?delta_base=mxc://matrix.org/b4s3v3rs10n`

This would return an ordered multipart download of the deltas (once unencrypted, if needed) to apply to the base-version
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For encrypted files, how do clients discover the encryption keys for each delta and the base file?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i just realised the same thing :) i guess this pushes it back towards putting the delta links on the m.file events rather than the content repository, and using aggregations perhaps as a way to grab all the events needed to download a given file.

Copy link
Member Author

@ara4n ara4n Dec 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, could be evil and specify the same IV & Key for every event which is a diff on a given file - but calculate the actual IV used to encrypt/decrypt the diff as IV' = H(IV, $content_id). This would mean that diffs have to be created as async uploads so you know their content_id before they can be encrypted by the client though; and the multipart download would have to include content IDs.

I'm not convinced this is better than using an aggregation API to say "give me all the events for the diffs needed to construct this $event_id", and then firing off a tonne of parallel reqs to the media repo to grab the required media files (which is arguably only 2 roundtrips too). But it avoids having to fiddle around with events at all.

to get a copy of the new-version.

## Alternatives

### Track deltas on events rather than media repository

Alternatively we could store the delta info on the `m.file` event itself as a mixin. This would allow us to shift the
task of tracking deltas purely to clients, and protect the delta info within the E2EE payload. However, this would then
force the client to do many more roundtrips to spider the events (if needed) and files (if needed) one by one in order
to calculate diffs, which would be O(N) latency with the number of diffs rather than O(1) for the above API. Given the
traffic pattern of these requests would reveal the delta graph to the server anyway, it's not clear that it provides a
sufficient advantage. This would look like this:

```json
{
"content": {
"filename": "something-important.doc",
"info": {
"mimetype": "application/msword",
"size": 46144
},
"msgtype": "m.file",
"url": "mxc://example.org/n3wv3rs10n",
"delta_base": "$1235135aksjgdkg",
"delta_format": "m.vcdiff.v1.gzip"
},
}
```

We could go even further down this path by defining an arbitrary CRDT for tracking these deltas, a bit like the
(Saguaro CRDT-over-Matrix)[https://github.com/matrix-org/collaborative-documents/blob/main/docs/saguaro.md] proposal,
with files decorating each event - effectively modelling the problem as a collaborative document problem (with binary
diffs attached) rather than a binary file diffing problem.

### Other alternatives

We could use HTTP PATCH rather than POST when sending diffs. This feels needlessly exotic, imo.

Rather than having a delta_format field, we could use the MIME type of the upload to indicate that it's a patch to a
given underlying MIME type. However, Matrix doesn't currently have to parse MIME types anywhere, so it's more matrixy
to destructure this in JSON.

For unencrypted files, the server could apply the diffs serverside as a convenience to clients who don't know
how to apply the diffs themselves (or who don't have CPU to apply the diffs, or want to benefit from the server caching
diff results). This could be proposed as a separate MSC.

## Security considerations

This exposes the metadata of which file is a delta to which other file to the server.

DoS by too many deltas

DoS by using async uploads to create a cycle

## Unstable prefix

| Param | Unstable prefixed param |
| ------------ | -------------------------------- |
| delta_base | org.matrix.msc4083.delta_base |
| delta_format | org.matrix.msc4083.delta_format |

## Dependencies

None. Although [MSC4016](https://github.com/matrix-org/matrix-spec-proposals/pull/4016) was sketched out at the same
time and the two are siblings.
Loading