-
Notifications
You must be signed in to change notification settings - Fork 630
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
convert dwa encoder to use algorithm quantize #1948
convert dwa encoder to use algorithm quantize #1948
Conversation
Convert the quantization from a massive table to a bit-fiddling algorithm which is more code but significantly faster (in some cases, almost 2x). This also eliminates the giant lookup tables from the binary, which reduces the compiled size of the core library by half on x86, with avx2 / f16c enabled anyway. Signed-off-by: Kimball Thurston <[email protected]>
Signed-off-by: Kimball Thurston <[email protected]>
Signed-off-by: Kimball Thurston <[email protected]>
Signed-off-by: Kimball Thurston <[email protected]>
Signed-off-by: Kimball Thurston <[email protected]>
also propagate cross platform function fixes to test suite Signed-off-by: Kimball Thurston <[email protected]>
Signed-off-by: Kimball Thurston <[email protected]>
@@ -103,12 +103,68 @@ static exr_result_t LossyDctDecoder_execute ( | |||
// value in the buffer - with the index into zig zag | |||
// order data. If we return 0, we have DC only data. | |||
// | |||
static int LossyDctDecoder_unRleAc ( | |||
// | |||
// This is assuminging that halfZigBlock is zero'ed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"assuminging"
@@ -215,7 +271,7 @@ LossyDctDecoder_base_construct ( | |||
d->_toLinear = toLinear; | |||
d->_width = width; | |||
d->_height = height; | |||
if (d->_toLinear == NULL) d->_toLinear = dwaCompressorNoOp; | |||
//if (d->_toLinear == NULL) d->_toLinear = dwaCompressorNoOp; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this commenting out intentional
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, yeah, I missed fully removing that. I also removed the no-op table in favor of an outer loop conditional to dispatch to do nothing instead of always looking through an identity table.
I think, based on the code that on non-intel the giant look up tables are eliminated as well, just checking because I could read this comment as the giant lookup tables still exist on not-intel. I'm asking though, because maybe some detail is hidden in platform switches not in the diff. We still have the giant table for unit testing. Do we need any touch on the cmake script to govern the creation of the table according to whether testing is enabled? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm approving on the assumption that the elimination of the tables is cross platform, not intel-only.
Happy New Year!
Happy New Year! Yes, sorry, bad wording: this is entirely a cross-platform / architecture change, just removing 1.3MB of binary tables. I mentioned avx2 / f16c only to indicate that is where the performance numbers I might quote are indicative of (my older core i9-9900k). I have not yet been able to test on newer / alternate hardware, but expect there to be a good boost for all systems with increasingly disparate memory speed vs. instruction execution. This is could be viewed as a step towards enabling a gpu version of this, or easier implementation in an embedded system, as examples, not that I have explicit plans to do any of that. This is not fully branchless unfortunately, as I could not work out the math yet to reason about the error tolerance being full 32-bit float precision, but wanting to quantize a 16-bit float. This code replicates the results precisely for the allowed range of quantization levels (0-100) for DWAA/B. However, if we were to change the error tolerance to be to a half float value, we could greatly simplify the code (faster still) and I believe have it be fully deterministic and mostly branchless (the odd check for inf / nan). But I didn't want to change the output values just yet... |
AFAIK, we long ago removed rules to create these files, only leaving the code which did so as reference, which I have not changed.... |
Signed-off-by: Kimball Thurston <[email protected]>
Convert the quantization from a massive table to a bit-fiddling algorithm which is more code but significantly faster (in some cases, almost 2x). This also eliminates the giant lookup tables from the binary, which reduces the compiled size of the core library by half on x86, with avx2 / f16c enabled anyway.
This will help with the dwa performance problems noted in #1915 where the compressor got much slower