Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AV1 OBU Support #28

Open
morgabm opened this issue May 2, 2024 · 4 comments
Open

AV1 OBU Support #28

morgabm opened this issue May 2, 2024 · 4 comments

Comments

@morgabm
Copy link
Contributor

morgabm commented May 2, 2024

Sorry to open an issue for this, but I was wondering if you have considered using work in h264nal and h265nal as the basis for a similar project to parse AV1 obu structures? If so, I may have a need and would be interested in helping with such a project.

Please close this at your discretion.

@chemag
Copy link
Owner

chemag commented May 2, 2024

Quick answer is yes.

Long answer is what is the value. I use h265nal and h264nal in a regular basis, mostly to understand the contents of a Annex B (h265 or h264) streams. I also script around the tools (e.g. see https://github.com/chemag/itools, which uses h265nal to provide information on the SPS colorimetry ot HEIC images).

Now, there are other parsers that do a similar job. First, ffmpeg BSF filter does a very similar job. My main issues with ffmpeg are two:

  • (a) ffmpeg does not believe in the porcelain/plumbing command separation. All commands are indeed porcelain. You need to get the output, parse it, and hope that there are no changes in the BSF syntax that break your scripts.
  • (b) ffmpeg is painful to extend, for several motives.

For h265, there is also https://github.com/strukturag/libde265. It is a deeper parser than either h265nal or ffmpeg's BSF. For example, the itools project mentioned above uses a fork of libde265 (https://github.com/chemag/libde265) to analyze the QP values used in a per-CTU, per-frame basis. Main issue IMO is that the author is not interested (see strukturag/libde265#201).

Now, what I would like is a parser that goes in both directions. Basically a way to convert a binary format (preferably defined using a structured format for binary definition, like kaitai) into a structured text format, and back.

Use cases:

  • You start with an Annex B stream, you convert it to text, and there you cherry-pick values you are interested on.
  • Sometimes you change one value, and then want to write back to the binary format. For example, I would like to be able to change the color range (SPS header) in an Annex B file, and then experiment with the results.

In order to implement this, the idea is to use an intermediate structured format that allows (a) binary to structured-format conversion, and (b) structured-format to text conversion. For the structured format, I like protobuf, which gives you (b) for free by using protobuf-text (I also like the protobuf text format).

+----------+ Mpeg2TsParser::Parse* +---------------+ protobuf +-----------+
|bin mpegts|---------------------->|protobuf mpegts|<-------->|text mpegts|
+----------+<--------------------- +---------------+          +-----------+
            Mpeg2TsParser::Dump*

This is the Section 4 Figure in https://github.com/chemag/m2pb, where the idea was applied to mpeg2-ts streams.

Now, the Parse/Dump functionality in Mpeg2TsParser is hand-written. I would prefer to use something like https://github.com/kaitai-io/kaitai_struct.git, but last time I checked, kaitai did not have (a) the functionality to implement structures based on the RBSP syntax (or whatever it is called in AV1), (b) a dumper, and (c) a way to produce protobuf structures. If we get this, adding support for a new codec would be as easy as feeding the RBSPs to a generic tool.

@morgabm
Copy link
Contributor Author

morgabm commented May 2, 2024

I like these thoughts, and while my interests are primarily related to implementing vulkan video extensions which mandate apps handle muxing/demuxin and parsing, I think those use cases would benefit as well from your proposed solution. Is this something that has an established effort anywhere? If not, have you have considered owning such a project? Regardless, I also like protobufs for this approach and would be interested in joining efforts.

Have you looked at Hammer? I tried looking at it briefly, and it seems to implement a software defines intermediate structure. But from the limited documentation I was unable to determine if it supported defining grammars capable of handling entropy encoded values such as the ones present in h264/h265.

I have experience with libav and ffmpeg as well but in addition to some of the reasons you mentioned, I take issue with it due to the code base being ancient, and as a result it seems to suffer from many issues including the somewhat ideological nature of its development as well as the horrid documentation.

I think a tool built on modern technologies such as protobuf & modern c++, with an emphasis on code quality, and a succinct api would perform greatly in this market.

@morgabm
Copy link
Contributor Author

morgabm commented May 2, 2024

Rereading your last paragraph again, it occurs that maybe some of my comments were not super clear. Basically it seems like some existing projects make a similar approach to this problem, but a generic solution may solve this issue for all of these use cases and beyond. One where a developer may bring their own grammar (regardless of the form factor of such a grammar), and within a succinct a lightweight framework provided by this solution they are able to provide any additional complexities needed by their parsing algorithm. Something like protobufs alone does not allow for the control needed to parse dynamic/entropy encoded values to my knowledge, please correct me if you know differently.

@chemag
Copy link
Owner

chemag commented May 5, 2024

I like these thoughts, and while my interests are primarily related to implementing vulkan video extensions which mandate apps handle muxing/demuxin and parsing, I think those use cases would benefit as well from your proposed solution. Is this something that has an established effort anywhere? If not, have you have considered owning such a project? Regardless, I also like protobufs for this approach and would be interested in joining efforts.

I'm not sure exactly what you are describing here ("video extensions which mandate apps handle muxing/demuxin and parsing"), and how h265nal (or an AV1 parser) would work. These parsers are just a binary-to-text converter. Right now we're just printing the text values we convert to. My idea is to get the reverse conversion, so as to allow editing the binary streams (Annex B).

Have you looked at Hammer? I tried looking at it briefly, and it seems to implement a software defines intermediate structure. But from the limited documentation I was unable to determine if it supported defining grammars capable of handling entropy encoded values such as the ones present in h264/h265.

I took a look at it. It looks like a layer that facilitates the traditional C parsing approach by defining a series of functions so that you do not have to read byte by byte, and then do hton/ntoh[ls] conversions.

What I really want is to be able to feed an video bitstream syntax. The MPEG formats use the acronym "RBSP", and AV1's is very similar. For example, for AV1, I'd like to start with all the syntax definitions, e.g.

obu_header() {
  obu_forbidden_bit  : f(1)
  obu_type  : f(4) 
  obu_extension_flag  : f(1) 
  obu_has_size_field  : f(1)
  obu_reserved_1bit  : f(1) 
  if (obu_extension_flag == 1) {
    obu_extension_header() 
  }
}

This should autogenerate (a) a parser that accepts a raw (Annex B) AV1 stream and produces a set of protobuf objects representing OBUs and the descendent objects, and (b) a dumper that does the opposite operation. That means that the only work for creating an AV1 parser would be to collect the whole list of syntax definitions from the standard, and then write the skeleton of a full parser/dumper.

The closest thing I've seen for this is the kaitai syntax. Syntax is not very nice IMO, but e.g. it allows defining an ethernet header like this:

meta:
  id: ethernet_frame
  license: CC0-1.0
  ks-version: 0.7
  imports:
    - ipv4_packet
    - ipv6_packet
seq:
  - id: dst_mac
    size: 6
  - id: src_mac
    size: 6
  - id: ether_type
    type: u2be
    enum: ether_type_enum
  - id: body
    size-eos: true
    type:
      switch-on: ether_type
      cases:
        'ether_type_enum::ipv4': ipv4_packet
        'ether_type_enum::ipv6': ipv6_packet
-includes:
  - ipv4_packet.ksy
enums:
  # http://www.iana.org/assignments/ieee-802-numbers/ieee-802-numbers.xhtml
  ether_type_enum:
    0x0800: ipv4
    0x0801: x_75_internet
    0x0802: nbs_internet
    0x0803: ecma_internet
    0x0804: chaosnet
    0x0805: x_25_level_3
    0x0806: arp
    0x86dd: ipv6

I think a better syntax would allow a more generic if/then mechanism to drive the parser or set default values. In fact, I think a good solution will start with a better syntax for a language like this.

Rereading your last paragraph again, it occurs that maybe some of my comments were not super clear. Basically it seems like some existing projects make a similar approach to this problem, but a generic solution may solve this issue for all of these use cases and beyond. One where a developer may bring their own grammar (regardless of the form factor of such a grammar), and within a succinct a lightweight framework provided by this solution they are able to provide any additional complexities needed by their parsing algorithm. Something like protobufs alone does not allow for the control needed to parse dynamic/entropy encoded values to my knowledge, please correct me if you know differently.

The parsing process needs to produce something that can be operated upon. My idea of "operation" includes getting text-based versions of that something (so we get the binary-to-text conversion feature), editing that something (changing values, removing items, etc.), and writing back to binary.

  • ffmpeg BSF, for example, just prints the values it reads. That's a limited part of what I'm looking for.
  • We could get adhoc C/C++ structures here. That would be likely the most efficinet set of objects to keep the parsed binary objects. Then we'd have to implement a mechanism to autogen object-to-text and text-to-object features.
  • What protobuf gives you here is a free object-to-text and text-to-object features, at the cost of heavy objects: Protobuf objects allow for serialization, which is something I haven't found a need for here.

The usual processing in my case does not have much performance requirements: I typically use small Annex B streams, so I don't mind paying the overhead that protobufs force.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants