Skip to content

Conceptualizing subtitles

Alexander Cerutti edited this page Feb 28, 2023 · 8 revisions

To understand things, our mind needs to create an image of how things are. This image can be created by reading, by getting things explained, or by seeing some pictures. And sometimes things might be required to be explained from scratch even if they are already known concepts so that new points of view can be unlocked. We are going all the way in and using all the ways 😄 🚀.

Subtitles can be considered as a sequence of phrases and words distributed in a non-linear manner over time. These words are usually associated with the audio track, but it may depend on the media content. We can theoretically create a distinction:

  • Subtitles are for people that do not understand a language and therefore need some support to understand them;
  • Captions are for people that have hearing issues and that cannot hear correctly the audio. For them, additional details might be required;

In reality, technically speaking, both will be rendered in the same way: the only difference is the amount of information that captions provide. Captions might have more details, like sound descriptions. Hence, the difference stands on the "data" provided to be shown. The typical situation can be idealized with the scheme below:

subtitles and captions sets

Once data becomes identified and not just random binary data, this data can be seen as a "track", a sequence of data with a specific meaning that is subsequently ordered.

As we were saying, Tracks are therefore composed of a non-linearly-distributed set of cues.

Each cue has its properties, which may or may not differ from the properties of another cue, but the most important ones are the starting time, ending time and the content itself.

It comes naturally to think, then, that some cues overlap and be shown at the same time or with some time offset from the previous one.

We might draw them down. What comes out? A timeline like those on video editing software.

timeline

So, the principle is that when we are at a certain point, we must show all the cues belonging to that moment. Hence, we need to check for a suitable data structure that will be able to contain cues correctly and allow us to retrieve only those we are looking for, at a specific time.

Interval Binary Tree

After some research, the most suitable structure to be adopted felt to be the Interval Binary Tree because, unlike normal binary trees, each node (or leaf) owns its starting point and its ending point. Also, each node has a maximum value consisting of the maximum value of its subtree, so it can easily tell if its subtree should be queried or not to look for possible cues.

In the IBT, we are going to navigate the left leaf if the new node low (in our case the startTime) is lower than the current node's low. We are going to navigate right otherwise. This will happen up to the point at which there won't be any more elements to navigate. There, our node will get positioned.

timeline-ibt

Each node owns a "max value" which represents the max between its own high and the next node that is going to be inserted (so not necessarily the next in navigation order). Hence, if a new longer node is inserted, all the parent nodes will update with a new maximum.

This structure allows us to protect ourselves from unordered cues. It comes by itself that if all the cues are ordered and perfectly sequential, a right-only IBT will be created.

Querying the IBT in the example for the range [ 15.300, 18.100 ] (300ms and 100ms), the interrogation will proceed only right and retrieve the root and the first children on right.