-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider providing a facility for integer-fraction timescales #422
Comments
…I just realized I'd misremembered where timescales are specified when writing this (they're on the segment, not the track). Still, the same concept applies, just requiring common-multiple rates (though the |
Hi,
I see this was brought up as an issue in the GitHub repository and am cross-posting to the cellar working group.
On Sep 7, 2020, at 12:02 AM, rcombs ***@***.***> wrote:
It's pretty well-established that Matroska's poor timebase support is one of the format's worst properties. While it support very precise timestamps (down to the nanosecond), it's very inefficient to do so (and the resulting values still aren't exact for most input rates), so muxers tend to default to 1ms timestamps, which can lead to a variety of subtle issues, especially with high-packet-rate streams (e.g. audio) and VFR video content. Muxers can choose rates that are closer to the time base of their inputs (or the packet rate of the content), but exactly how best to do so has always been unclear, and some of the possible options would lead to either worse player behavior, or timestamp drift. I'm proposing a format addition to remedy this.
The only actual normative change I propose is this: in addition to the classic nanosecond-denominator time scale, muxers could provide 2 additional integers, serving as a numerator and denominator time base value, which is required to round to the existing nanosecond-scaled value.
This should be paired with some advice for muxer implementations on how to make use of this feature. This depends on the properties of the input. For reference, here are some examples of the error produced by rounding a variety of common time bases to the nearest nanosecond, scaled by 3 hours (a reasonable target for the duration of a film):
nearest_ns(x) = round(x * 1,000,000,000) / 1,000,000,000
ceil_ns(x) = ceil(x * 1,000,000,000) / 1,000,000,000
floor_ns(x) = floor(x * 1,000,000,000) / 1,000,000,000
nearest_error(x) = 1 - (x / nearest_ns(x))
ceil_error(x) = 1 - (x / ceil_ns(x))
floor_error(x) = 1 - (x / floor_ns(x))
nearest_error_3h(x) = nearest_error(x) * 60 * 60 * 3
ceil_error_3h(x) = ceil_error(x) * 60 * 60 * 3
floor_error_3h(x) = floor_error(x) * 60 * 60 * 3
e(x) = nearest_error_3h(1 / x)
ce(x) = ceil_error_3h(1 / x)
fe(x) = floor_error_3h(1 / x)
# Integer video frame rates
e(24) => 8.64e-5
e(25) => 0
e(30) => -0.0001
e(48) => -0.0002
e(50) => 0
e(60) => 0.0002
e(120) => -0.0004
# NTSC video frame rates
e(24/1.001) => -8.6314e-5
e(30/1.001) => 0.0001
e(48/1.001) => 0.0002
e(60/1.001) => -0.0002
e(120/1.001) => 0.0004
# TrueHD frame rates
e(44100/40) => -0.0057
e(48000/40) => -0.0043
e(88200/40) => 0.0062
e(96000/40) => 0.0086
# AAC frame rates
e(44100/960) => -0.0002
e(48000/960) => 0
e(88200/960) => 0.0003
e(96000/960) => 0
e(44100/1024) => 0.0002
e(48000/1024) => -0.0002
e(88200/1024) => -0.0003
e(96000/1024) => 0.0003
# MP3 frame rates
e(44100/1152) => 8.4375e-6
e(48000/1152) => 0
e(88200/1152) => -0.0004
e(96000/1152) => 0
# Other audio frame rates
e(44100/128) => -0.0012
e(48000/128) => 0.0013
e(88200/128) => -0.0012
e(96000/128) => -0.0027
e(44100/2880) => -7.425e-5
e(48000/2880) => 2.3981e-12
e(88200/2880) => -7.425e-5
e(96000/2880) => 2.3981e-12
# GCF of common short-first audio frame sizes
e(44100/64) => -0.0012
e(48000/64) => -0.0027
e(88200/64) => 0.0062
e(96000/64) => 0.0054
# Raw audio sample rates
e(44100) => 0.1253
e(48000) => -0.1728
e(88200) => 0.1253
e(96000) => 0.3456
fe(44100) => -0.351
ce(48000) => 0.3456
fe(88200) => -0.8273
fe(96000) => -0.6912
# MPEGTS time base
e(90000) => -0.108
ce(90000) => 0.8639
# Common multiples
e(30000) => -0.108
e(60000) => 0.216
e(120000) => -0.432
e(240000) => 0.8639
e(480000) => -1.7283
ce(30000) => 0.216
fe(60000) => -0.432
ce(120000) => 0.8639
fe(240000) => -1.7283
ce(480000) => 3.4549
As we can see, rounding common video and audio frame rates (including e.g. the least common multiple of 24 and 60 for that VFR case) produces a negligible amount of error over a reasonable duration. This means that for content where all timestamps can reasonably be expressed in integer values of those rates, there would be no significant error over common file durations, even if different streams were muxed with different time bases.
There are a few real-world time bases that would produce significant rounding error (upwards of 100ms) over the course of 3 hours when used in existing players: MPEGTS's 90000Hz, all common raw audio sample rates, and least-common-multiples between integer and NTSC video frame rates. This essentially means that mixing these rates with others would produce significant desync over a reasonable duration for static on-disk content; the same issue could occur when muxing very lengthy content (e.g. streaming).
All of these issues can be addressed in one of the following ways:
Using a lower rate (e.g. 90,000Hz isn't usually the real content rate but instead an artifact of its previous container; expressing timestamps in samples rather than frames is usually unnecessary)
Choosing the highest of the input rates for all streams (e.g. 48000 is a multiple of many common frame rates, including 24/1.001)
Choosing a more precise common-multiple rate that may create a larger total drift, but does so equally for all streams (see the "Common multiples" section; 1/30000 is suitable for mixing 24fps and 30/1.001fps content alongside most common framed audio rates, while the later listed bases are suitable for increasingly large sets).
Round some tracks' nanosecond timescales in the opposite direction, creating a larger drift, but potentially one with the same sign (and thus a closer value) as the drift in other tracks (this is probably too complex and niche to have substantial use)
Fall back to classic rounded nanosecond-based timestamps (and don't write an integer-fraction time base at all)
Use the extension, resulting in significant sync drift in older players that haven't implemented the change
This last option is usually unacceptable, but may be fine for files that use codecs that become available after the change is made (and thus are unavoidably non-backwards-compatible anyway).
If combined with clear advice in the spec on how muxers SHOULD (or MAY) decide on time bases for various possible input cases, I think this extension could get actual adoption in muxers and solve one of the format's longest-standing problems.
This has been discussed on the list before though I don’t remember clear consensus on how to address this. Steve even compiled a list of discussions on this at https://mailarchive.ietf.org/arch/msg/cellar/ZpZxhG1gML9xVx_ir1Jf6_gcI8U/ <https://mailarchive.ietf.org/arch/msg/cellar/ZpZxhG1gML9xVx_ir1Jf6_gcI8U/>.
I proposed an option in this https://mailarchive.ietf.org/arch/msg/cellar/mTprgjNqVbe20e6hyYxns8ZnVwY/ <https://mailarchive.ietf.org/arch/msg/cellar/mTprgjNqVbe20e6hyYxns8ZnVwY/> where one of the existing reserved bits of the Block Header (in the byte that contains the keyframe, invisible, and lacing flags) be used as a flag for Timescale Alignment.
With this approach, new elements could be added to the track header with a numerator and denominator of a rationale time scale and if Timescale Alignment were set to true, then the nearest increment of the rationale time scale would be used. Example:
Thus if the frame rate of the track header is 120000/1001, then
If Matroska timecode is 4 and Enable TimeScale Alignment is 0, than it is at 4 / (1000000000 / TimecodeScale ).
If Matroska timecode is 4 and Enable TimeScale Alignment is 1, than it is at 0 / 1200000 (nearest increment of the rationale frame rate).
If Matroska timecode is 17 and Enable TimeScale Alignment is 0, than it is at 17 / (1000000000 / TimecodeScale ).
If Matroska timecode is 17 and Enable TimeScale Alignment is 1, than it is at 2002 / 1200000 (nearest increment of the rationale frame rate).
In a Matroska demuxer doesn’t understand the new nom/denom elements or the Alignment flag then it would simply use the existing nanosecond timestamp system.
In that thread there were other proposals, for example Steve discussed using a float to depict a point in time.
Dave
|
Did anyone suggest storing the rounding error as a fraction? (With denominator stored in the track header, this is only 3 bytes per packet in the best case.) Of course all of these ideas are terrible hacks compared to just storing it in the correct way. |
On Sep 8, 2020, at 11:52 AM, wm4 ***@***.***> wrote:
Did anyone suggest storing the rounding error as a fraction? (With denominator stored in the track header, this is only 3 bytes per packet in the best case.)
That sounds interesting: to have the rounding error numerator in each block and the rounding error denominator in the track header. Perhaps a rounding error denominator could also be in the block but defaults to the one within the track header.
Of course all of these ideas are terrible hacks compared to just storing it in the correct way.
Yes, it is a challenge to fix this and maintain reverse compatibility.
Dave
|
I don't like this as storing rounding errors ist imprecise as well (unless the global timestamp scaling factor is a multiple of the rounding error's denominator). I'm also quite unsure which denominator a multiplexer should chose. In order to express a rounding error precisely it must have a much higher resolution that the usual 1ms resolution of Matroska timestamps. For example, with 1001/30000 FPS content the rounding error will always be below one frame duration, therefore you'll have to make the denominator much larger. Something else that came to mind when reading our previous discussion that Dave linked to: please keep in mind that any solution that sets values for the whole track in the track header will inevitably fail with mixed frame rate content or content with different interlacing, e.g. when multiplexing from an MPEG transport stream recorded from a DVB broadcast. Those bloody streams change frame rates all the time when the program changes, e.g. when transitioning to and from commercials (or just from an announcement to the movie). With our new and shiny precise timestamp calculation we'll either have to forbid such changes (unrealistic) or provide facilities to signal such changes in the form of some type of global index similar to cues. Unlike cues, though, such an index would have to be mandatory (a file without cues can be played just fine, even seeking works similar to seeking in Ogg files — meaning some kind of binary search). File types whose timestamps are based solely on a stream's regular sampling frequency (MP4 usually is, but doesn't have to; Ogg does, too) all share those issues. MPEG TS on the other hand uses a 90 KHz-based clock which is fine for most video stuff but doesn't have enough resolution for sample-precision timing of audio tracks with high sampling frequency.
Due to what I've written above I'm pretty sure that there is no one correct way to store timestamps for a general purpose container that allows its content to change its time base in the middle. In theory Matroska's timestamps can have sample-precision already (just make global timestamp scale small enough to match all of the tracks' time bases). The problem is with the waste of space that follows due to the bloody 16-bit integer offset in Block & SimpleBlock. So if we're thinking about breaking compatibility anyway, why not think about a whole new SimpleBlock V2 that allows for much larger relative timestamps? Would make all existing players incompatible, though. Another idea that only wastes space but doesn't destroy existing players' ability to play the file: adding a new child to |
It can be 100% exact. It's the rounding error after all - the number that needs to be added to the "classic" ms timestamp to get the fractional timestamp
It seems the denominator of the rounding error is simply the denominator of the original timestamp. E.g. in this case, the rounding error would have denominator 30000 and nominator
What does Matroska do if the codec changes? Transport streams can do that, Matroska can't do that. I feel like bringing up such cases just complicates the whole discussion. You can't fix everything at the same time. But you can stall any progress by wanting to consider every possible future feature and requirement. Besides, as was suggested in a previous post, the denominator part could be overridden per packet. This would cause some bytes of overhead in such obscure cases as mixing multiple framerates that are not known in advance.
I guess you mean the fact that every packet will need its own cluster. But AFAIK that still doesn't give a way to get fractional timestamps? So, not an option.
Obviously not an option. If it were specified, it's likely everyone would disable this by default, except people who use Matroska in special setups where they control producer and consumer.
I thought that was what I proposed here (except I wanted to use fractional numbers). |
PS: I think obsessing about a few bytes per packet isn't useful. Having precise timestamps, even if it introduces overhead, is much more important. Nobody will discard Matroska as an option because it doesn't go to the edge of the theoretically possible for saving overhead. |
True. The difference is that having multiple time bases in the same track is something that exists & works today. I'm really not trying not prevent progress here, and I'm not talking about each and every possible situation. I am talking about one specific situation that is in wide-spread use today. What I am trying to prevent is implementing a scheme that's supposed to improve one aspect that simultaneously makes another aspect worse. Hence me talking about ways to signal a change in time base mid-stream. We'd also have to signal a precise timestamp at the point of change in time base so that the player can reconstruct the whole timeline properly without having to read blocks at each change in time base. |
It seems like there are a few ways discussed to correct this:
All of these would still require the current timestamp to still exist and thus would be compatible with current demuxers but newer demuxers would be able to read/derive more precise timestamps.
Close. When I saw this first suggested, I did some math and figured out what it would be for the case of 44.1kHz AAC audio (this is what really sparked this conversation; see below). In this case the samples are 1024/44100 seconds long with the MKA using 1ms precision on the timestamps and so the error can be expressed as
Also worth noting: the duration would likely need the same treatment. Aside: I sparked this conversation in an internal discussion with @rcombs about AAC 44.1kHz audio in an MKA format. I was remuxing this to MPEG-TS and the MKA had only 1ms precision timestamps. Well, a simple remux would be multiplying these timestamps by 90 to match MPEG-TS. This simple remux resulted in packets at times 0, 2070, 4140, 6300, 8370 … which gave them effective durations of 0, 2070, 2070, 2160, 2070 and this inconsistency would cause stuttering audio in Apple's HLS demuxer. So this meant that remuxing MKA -> MPEG-TS required opening a codec in lavc to get more precise durations and thus derive timestamps without error. P.S. These imprecise timestamps were one of the more annoying things we had to deal in Perian's MKV demuxer and that was over a decade ago. |
The reason not to introduce a SimpleBlock v2 is that any hardware/software player that doesn't know it won't be able to play the files. It can be done in Matroska v5. Such files being unreabable by v5 parsers will also be marked as such. We might as well call it Matroska2 or something, just like WebM shares a lot with Matroska. The practical question is whether there is a convenient way to have precise timestamps in v4 and make it work in existing players (and WebM, I know that's something they want as well). The question about VFR (Variable Frame Rate) is not really an issue IMO. In the end you only have 1 or 2 frame rates mixed, and maybe with the same denominator. All you need is a fraction that handle both. Facebook even created a timebase that covers most common timebases for video. As long as you know the timebases you'll have to deal with before muxing you should be fine. An important thing to note is that floating point should not be used at all (we want precision). All we have is the Matroska timebase x/1.000.000.000 s (x=TimestampScale) and the source material timebase(s) (a/44.100, b/48.000, c/24, d*10001/30000, etc). They are all fractions. So we should be able to find something that works with just fractions, using common denominators, fraction reduction, etc. It can get to large numbers very quickly as there are multiple tracks with different timebase (or odd fractions when a track uses VFR, see above). What we have now is a timestamp for each Block as a fraction of TimestampScale/1.000.000.000. What we want is a timestamp for each Block as a fraction of the source material. The difference between the two values is still a fraction. We can store this difference as a fraction. And we must also store the source material fraction. Now we just have to do the math the find this "difference as a fraction". In particular to minimize the storage needed to do so if possible (if not, mandating BlockGroup for precise tracks is always an option). If we can fit in inside the (3) reserved bits of the SimpleBlock it would be perfect. |
ISO/IEC 14496-12 "ISO base media file format" uses a "timescale" (counts per second) and "media sample durations". If timescale=30000 and media sample duration is 1001, you get NTSC fractional frame rate. Similarly, ISO/IEC 14496-10 "Advanced Video Coding" has a clock tick defined as num_units_in_tick divided by the time_scale (see equation C-1). The presence of these in VUI is indicated by the timing_info_present_flag. For NTSC, time_scale may be 30000 and num_units_in_tick may be 1001. |
Following my "pure rational numbers" approach we can say the following, for a Track sampled at the original The real timestamp for each sample S is: The Matroska timestamp for the same sample is The Cluster timestamp is just a value to add to S to get the proper value, so we can skip it for now. The difference between the real timestamp and the one we get from Matroska is:
We can already deduce a few things from this:
That gives some sampling frequencies where it's possible to achieve 0 error per sample:
That leaves out a lot of common ones:
The other way to reduce the error, is to reduce the value of S. We already effectively reduce the value we store to a 16 bits integer, so the value is always between -32,768 and 32,767. If we were to store the error in the remaining 3 bits of a By limiting the possible values of S in a Cluster to A I think the scope where it works, even with the proper muxing guidelines, is too narrow to be worth using all the reserved bits. In particular because common frequencies like 44100 Hz or 30000/1001 fps will introduce errors no matter what and will need to use this system. There could be other clever ways to do this. We could use a bit in the Block that says the timestamp "shift" is stored after/before the Block data, but that would be incompatible with all existing readers. That would be equivalent to using a new lacing format. Another way would be to force using a |
It seems one of the aspect of this not discussed is how the rounding of the current system works and how it could be adapted. We assume that we start with the current system and try to fit the correct fraction in there. We may do the other way around, ie have the fraction and use that to set the Block/Cluster timestamp value. The rounding error is then on older parsers assuming a timestamp value when in fact it's another value. But the old system is already known to be imprecise/inaccurate. It's not assumed to be sample precise. So a little more, a little less rounding error should not be a big deal. What we cannot really do is add some information per-track to modify how the Block/SimpleBlock values are interpreted. That would break backward compatibily. For that we would need BlockV2 and SimpleBlockV2. So we could store the TimestampScale and a fraction that is the actual fraction it's based one. Let's see what happens for 29.97fps video, or 30000/1001 Hz. The most accurate TimestampScale is 33,366,667 (nanosecond per frame/lace, rounded). We also store the Segment timestamp fraction as {30000, 1001}:
The Old Parser timestamp is the timestamp older parsers would see: Block Value * TimestampScale. The Real timestamp is the one using the fraction: Block Value * 1001 / 30000. For 44100 Hz audio we get the following, with a TimestampScale of 22,676 (nanosecond per frame/lace, rounded).
The difference is less than one sample.
We get less than 1 sample error with 47393 frames stored, or 42s worth of samples in a Cluster. The worst case scenario is the highest, not easily divisible, frequency 352800. It gives:
Here we achieve less than on sample error when there is less than 6067 samples in a Cluster. This can be doubled by using signed values for the Block timestamp value. The range to get less than one sample error becomes [-6067,6066]. And by packing samples by at least 11 samples, we always get less than 1 sample error. With 22 samples we get less than half a sample duration error, which should be enough with rounding. So with single track files we can probably achieve sample precision easily. With mixed frequencies it becomes more complicated. For example the 29.97 fps video with the 44100 Hz audio. We have 1001/30000 and 1/44100 so the fraction to use would be 1001/reduced(30000, 44100) where reduced(A, B) is each number multiplied and divided by their Greatest Common Denominator. In this case (30000*44100)/100 = 13230000. That gives a round TimestampScale of 75661 ns/tick. That gives these Blocks:
For the video track we would get something like this:
We can store almost 5s In a Cluster. For the audio track, on the other hand, we cannot recover each sample easily
The Block Value doesn't map to an exact sample timestamp (and vice versa). It seems that if we apply a factor of 3 we may get better results. So we could have a Segment fraction of 1001/3*13230000, with a rounded TimestampScale of 25220 ns/tick.
We lose about 1 sample precision every 10 samples, or 10%. For a full Block that's about 166ms shift (or rather half when using signed 16 bits). That's a lot. Even packed at 40 samples per frame that's still about 20ms, when such a frame is 1 ms. If we use the full fraction {1001, 30000*44100} we cannot store more than one video frame per Cluster. There doesn't seem to be a system where it works by storing the Block value as a real fraction value. At least when mixing "heterogeneous" frequencies. It works with single tracks or frequency that are easily divisible. And not if we want to keep backward compatibility (Block/SimpleBlock). |
A little background on this, for adaptive streaming it's important when you switch to one "quality" (representation) to another one to switch exactly the frame and audio you want. I don't know if they are sample exact for audio, especially as each codec (or different encoding parameters) may pack a different amount of samples per frame. So the boundaries don't totally overlap. Maybe there's an offset that tells on which sample to start. Or an exact clock gives the exact timestamp for each sample in each representation anyway. Give that, the important phrase here is
In adaptive streaming you don't (usually) use muxed tracks. So you can pick each channel independently with the best possible choice at any given time. So in these conditions we can be sample precise. All we need is to tell the original clock (numerator/denominator) of the Track. A new parser would use that value with the I'll send a proposal for new elements to store this fraction and the necessary changes on how to interpret the timestamps. |
The larger problem is because we want a rationale number that works for all the tracks (theoretically possible) and at the same time have a sensible value that will not require huge values of the numerator for each timestamp in a Block. We only have 16 bits there. As seen above in most cases it doesn't work. And that's because we have one global "clock", defining all Block (and Cluster and more) "ticks". We could however alter the interpretation of each Block value to adjust to a better "clock" that works for that track. So that we end up with a better range of values for the numerator. And luckily we already have But just like we introduce a rational number to use instead of the So a Block timestamp would be In a new parser |
So let's take the previous example that didn't work: 29.97 fps video with the 44100 Hz audio. Now the critical part is the Cluster tick value. To have sample accurate values on each Block it also has to provide ticks that are sample accurate for both tracks. In this case a (rational) TimestampScale of 1/(30000 * 441) should do it. There is a slight problem though. The rounded Now what is the magic formula to get the proper rational TimestampScale ( with this value the rational
For video the Block ticks would result in:
It seems we have a system that works well for two tracks. It work just as much with more tracks as long as the GCD of all SamplingFreq Denominator is big enough, resulting in a rounded legacy |
There is a small problem with the audio in the example above, we only get 65536/44100 second possible per Cluster. But given audio samples are usually packed by a fixed number of samples or a variable number of samples with a base common number, or even a multiple of 4. That packing unit number can be set as the numerator of the audio |
So what happens when using only the legacy values to compute the timestamps. In the example above, the
The Block value being the integer stored in the Block based on the real timestamp, the In the end, in the whole Cluster, the difference is always less than 11392 ns. That's less than one audio tick. |
We could however add the original clock in each track to give an accurate way for the reader to round the values (ie, get the values from the second column, when the values of the fourth column are computed). We have the |
I did a test program to run different scenarii: all the audio/video sampling frequencies listed above mixed (1 audio/1 video). The program can be found here. The result of the run of this program is found in this dirty Markdown file. In some cases there are some rounding errors that can't be recovered. There are also many cases where the possible duration of audio in a The video errors are always negligible as their occur after very long durations. Durations that are impossible to reach given the duration constraints on the audio. Maybe I can add an extra layer and try the common packing sizes mentioned by @rcombs. But from a first look it seems to solve both the limited duration of audio in a Cluster and the possible rounding errors. |
Following the investigation in #422 it seems like a good tool to achieve sample accurate timestamp in many (more) cases.
Following the investigation in #422 it seems like a good tool to achieve sample accurate timestamp in many (more) cases.
After computing the In the end the only errors (half a tick, so the wrong sample/tick would be assumed on the output of the demuxer) only occurs on video tracks, in rare cases and after long duration in a Cluster (145s minimum which is a lot). The only problem remaining is that the amount of audio samples possible in a Cluster with such small |
In the end the packing problem is directly related to the sampling frequency of the audio. This problem exists regardless of the sample accuracy of timestamps. High sampling frequency requires enough packing of samples to fit a useful duration in a Cluster. This problem aside, we can always use Mixing more than one audio track might cause some problems if the sampling frequency differ too much (doesn't fit the GCD). But for 2 tracks it's achievable all the time. |
I made a small calculation error in my tests as the original sample frequency numerator was not used to compute the real timestamp. With examples where the numerator was artificially inflated (1/24 = 1000/24000 = 2000/48000) to try to match audio ranges it gave an incorrect error. In fact in all cases, there is no error on the audio or video tracks. The |
So the real problem left if the amount of audio possible per Cluster. As said before, High sampling frequency requires enough packing of samples to fit a useful duration in a Cluster. This is not a new problem. And the Audio codec usually pack samples with a fixed amount of samples (or a few possible fixed values that may change in the same stream). Raw audio can do the same. In this case we can compute the If we consider it should always be possible to store at least 5s of audio per Cluster, then the problem starts at frequencies higher than 13107 Hz (65536 ticks / 5 s). That's pretty much all the time. With packing we don't get the timestamp of each individual sample. We only get the timestamp of the first sample of each pack of sample. But since we know the sampling frequency of the audio ( |
For Variable Framerate of video (I suppose it's rare for audio) there is another problem. There isn't one frequency. For example there might be film source (24 fps) and NTSC video source (29.97 fps) mixed in the same Segment. Or video captures that are sometimes at 60 fps, sometimes 144 fps and sometimes values that just occur when they can (if the game is able to control the V-Sync directly). I suppose other containers will also have a hard time giving an exact timestamp value for each frame. In the case of 2 fixed sources mixed together, it should be possible to accommodate the For too many or too heterogeneous sources there's not really a good solution. But these sources are doomed to never have accurate timestamps anyway. In that case a resolution of 0.1 ms (10000 Hz) should give a good estimate and enough duration (6.5 s) per Cluster. |
Given all this I think #437 is a good all around solution. It may not even require to store the exact fraction of the original (although it's probably needed to remux into other containers).
The proposed solution radically changes that. Almost all the time the |
I think libavformat and the libmatroska based demuxers (including VLC) should handle this properly. That already covers a lot of players, demuxers, muxers.
Most TV/streaming boxes are probably not using libavformat or libmatroska so I'm not sure they handle this properly either. |
The fact that each So I think with all (audio) codec we should mention the number of samples per frame that can be used safely. (sampling frequency / that number = packing frequency). That should be done in the codec specs. |
There is also the question of Cues and Chapters. The timestamps are stored in "absolute" values (which means in nanoseconds). Introducing the So Cues/Chapters referencing a particular frame (or audio block) should use the same value that will come out of the demuxer. The value is not always exactly the same value that was written but the error is small enough not to mistake it for another sample. But since demuxers/players will compare values when seeking it's better to match exactly the value that will be read in the file. Using packed audio with a factor on the |
Actually once you have the timestamp of the first sample, you don't need to know the rest of the timing in the packed audio. It's outside of the container level that the right sample will be picked. The timestamp in Cues/Chapters can either use the real timestamp (in nanosecond) of the sample. Or they could shift it the same way the timestamp of the first sample in the Block is shifted. The difference is a rounding error that in the end will result in the sample referenced. Except for this the container doesn't use the values for exact comparison so it will have no impact there. I think it's better to use the real sample timestamp in that case. |
I modified libavformat to be able to create files with the right To enable it add the Files created this way play in VLC after some changes in VLC and libmatroska. It turns out libmatroska is not as ready to use the |
Related libmatroska and VLC patches |
Following the investigation in #422 it seems like a good tool to achieve sample accurate timestamp in many (more) cases.
I think we can close this issue since now it's a matter of We also need to provide the number of samples per block per codec but that's for #439. |
Following the investigation in #422 it seems like a good tool to achieve sample accurate timestamp in many (more) cases.
Edit: This appears to already have been implemented in #425. I'm fairly sure that the proposed fix is not a fix at all, just a workaround around the underlying issue. Approximating framerates in multiples of nanoseconds will not give you accurate results, instead a
This permanently fixes the issue of poor time bases in Matroska by using simple math. |
This sounds like you think your proposal to be a solution to this problem. I disagree: a) I thought that we were looking for a compatible change, not for something that won't work with older players. Relying on players supporting a deprecated feature that is intended to solve something else and hardly ever used is bad enough, but you are using it with semantics that differ from the current specifications. The latter point alone is IMO reason enough not to reuse this field. I think it has already been mentioned earlier, but there is a way to add precise timestamps in a compatible way, by adding a rational, track-dependent timebase to each TrackEntry and by specifying a precise way to convert from the time as currently parsed to the exact time: Just round the inexact time to the nearest integral multiple of the track's timebase (the rounding in case it is exactly in the middle of two such integral multiples also needs to be exactly defined (e.g. "always round up in this case")). (The above procedure has the advantage that we don't need the least-common-multiple of the timebases of multiple tracks; yet it shares with the other procedures the disadvantage that if a track's timebase is small, then we need to use a small global timebase (it must be smaller than the track's timebase so that every of the track's possible timestamps is attainable; of course, if we already know that the content is cfr, then we can use this knowledge to choose a bigger timebase, thereby allowing longer clusters. This is similar to your proposal.) Of course, we would also need to add a rational analogue for the default duration (or we use a method just like the above for it: Round the ordinary default duration to the nearest multiple of the track's timebase). PS: Sorry for ignoring this issue for so long. |
Correct, it's a workaround that's not purely mathematically correct. But it's compatible with existing players, at least spec wise. I'm fine with adding a mathematically correct feature in Matroska v5. In the end it will likely be a fractional value of the |
To be honest, I'm not a fan of the proposed method. As we all seem to agree, it's at best a workaround. And here's where I don't like it: workarounds are a tradeoff between different concerns, e.g. the amount of time required to fix it properly vs. the remaining incorrectness of the workaround. The thing is, the workaround doesn't actually work with existing players out there. While it is comparatively easy to implement support for the method in the most notable software implementations (VLC, ffmpeg/libav*, MKVToolNix), it is completely unrealistic to assume that existing hardware players will ever be updated to support this method, making all files created with this method unplayable on hardware devices. For me this tilts the balance so far into the wrong direction that I'm not in favor of this method. |
I agree with @mbunkus, and I thought we will implement the RationalType. |
Following the investigation in #422 it seems like a good tool to achieve sample accurate timestamp in many (more) cases.
For the record, I'm also fond of the solution @mkver mentions (keeping the backwards-compatible imprecise timestamps in Cluster, Block, etc, but rounding them to the nearest multiple of the per-track rational timebase). This just means that when handling files with very short frames, we'd need to require that the legacy timebase be at least twice as precise as the codec rational timebase (so e.g. 453514ns for 48kHz TrueHD). |
Let q and q' be timebases with q finer than q' (i.e. q < q'). Let t be exactly representable in q', i.e. t = k' * q' with integral k' and let t = (k + r) * q with integral k and -0.5 <= r <= 0.5. Then k * q = k' * q' - r * q = (k' + r') * q' with r' = -r * q / q'. Also |r'| < |r| <= 1/2 due to the assumption of q being finer than q'. This implies that if q' is a timebase in which all timestamps of a given track are exactly representable and if q is a finer timebase than q' and if we use "round-to-nearest" for the transformation from q' to q, then the "round-to-nearest" transformation from q' to q will exactly recover the original timestamps. It does not matter what value is used in case the nearest timestamp representable in q is not uniquely determined. It also follows that there is a unique nearest value for the transformation from q to q' as long as one restricts oneself to timestamps exactly representable in q that emanate from timestamps in q' by "rounding-to-nearest". (As an example, consider q = 1/4 ms and q' = 1/3 ms. Then the timestamp 1/3 ms will be rounded to 1 * 1/4 ms and 2/3 ms will be rounded to 3 * 1/4 ms; when transformed back, the original values of 1/3 ms and 2/3 ms are recovered and there is no uniqueness problem; but there is for 2 * 1/4 ms, but no value representable in q' leads to this value.) It is not necessary for this that q is at least double as precise as q'. If it is, then there will never be multiple nearest values during the transformation from q' to q. But as seen above, this is not even a problem. The same reasoning as above also establishes the following variants: If the muxer always rounds up/down, then the timestamps can be exactly recovered if the demuxer always rounds down/up. This is due to r and r' having different sign (if nonzero). From the above it follows that the common case of cfr content can be easily supported via an additional rational timebase; moreover, for common codecs, it is not even necessary to remux the files, as the common 1ms timebase is good enough: All that is left to do is add some header elements via mkvpropedit. |
It's pretty well-established that Matroska's poor timebase support is one of the format's worst properties. While it support very precise timestamps (down to the nanosecond), it's very inefficient to do so (and the resulting values still aren't exact for most input rates), so muxers tend to default to 1ms timestamps, which can lead to a variety of subtle issues, especially with high-packet-rate streams (e.g. audio) and VFR video content. Muxers can choose rates that are closer to the time base of their inputs (or the packet rate of the content), but exactly how best to do so has always been unclear, and some of the possible options would lead to either worse player behavior, or timestamp drift. I'm proposing a format addition to remedy this.
The only actual normative change I propose is this: in addition to the classic nanosecond-denominator time scale, muxers could provide 2 additional integers, serving as a numerator and denominator time base value, which is required to round to the existing nanosecond-scaled value.
This should be paired with some advice for muxer implementations on how to make use of this feature. This depends on the properties of the input. For reference, here are some examples of the error produced by rounding a variety of common time bases to the nearest nanosecond, scaled by 3 hours (a reasonable target for the duration of a film):
As we can see, rounding common video and audio frame rates (including e.g. the least common multiple of 24 and 60 for that VFR case) produces a negligible amount of error over a reasonable duration. This means that for content where all timestamps can reasonably be expressed in integer values of those rates, there would be no significant error over common file durations, even if different streams were muxed with different time bases.
There are a few real-world time bases that would produce significant rounding error (upwards of 100ms) over the course of 3 hours when used in existing players: MPEGTS's 90000Hz, all common raw audio sample rates, and least-common-multiples between integer and NTSC video frame rates. This essentially means that mixing these rates with others would produce significant desync over a reasonable duration for static on-disk content; the same issue could occur when muxing very lengthy content (e.g. streaming).
All of these issues can be addressed in one of the following ways:
This last option is usually unacceptable, but may be fine for files that use codecs that become available after the change is made (and thus are unavoidably non-backwards-compatible anyway).
If combined with clear advice in the spec on how muxers SHOULD (or MAY) decide on time bases for various possible input cases, I think this extension could get actual adoption in muxers and solve one of the format's longest-standing problems.
The text was updated successfully, but these errors were encountered: