-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimization potential #138
Comments
Ah, that does sound like SSE and friends could do a good job speeding things up. I guess the most portable way to do that is using gcc intrinsics? |
Either GCC intrinsics (as portable as they are) or we try to look into things and see if OpenCV/XNN already offers those two methods with an optimized version we could use. |
I'm aware that those were written because OpenCV doesn't do them.. but happy to be proven wrong if we up the minimum OCV version 😄 Found this Q&A: https://answers.opencv.org/question/211941/how-to-use-simd-feature-of-opencv-as-third-party-library/ I'm also aware that OCV will try to use OpenCL and GPU if they are available (it collided with TFLite when used in multiple threads!).. more research needed! [edited to add] Found this which looks near for There is also libvolk for portable SIMD kernels (as used by Gnuradio and others), which is a standard package on many Linuxen. |
I don't have any objections on using a library for those particular vectorized kernels, as doing those optimizations ourselves would be more work than sensible. Also check if other places in the current code base (e.g. the conv_bias stuff) could benefit from using the library. |
Just talked with @BenBE, time for my regularly scheduled one-off maybe-useful comment :)
OpenCV has a well-hidden function for that, blendLinear. Only difficulty here would be that it requires the mask and anti-mask, both as float. I don't actually know if this will be much faster, but they do have vectorised and OCL implementations.
Really surprised that there is no |
OK! Thanks for the pointer towards |
If the initial conversion from uint8_t to float takes so much time, there is a good chance we can remove that conversion time when generating the mask in the first place, as the output from TFlite actually is float already, thus we already spend time to create the uint8_t version, but could just skip this part … |
Yep - when we get down to shaving a few % that's worth it - looking for 10x improvement through SIMD first 😄 |
So: I tried MMX, loading a pixel at a time (3 channels) into an I went back to the original loop code, separated into it's own source file and applied |
Doing some proper calculations, it's quite obvious from hardware architecture alone that loading 3 bytes at a time into one MMX register for processing is not gonna cut it: You'll have at least tons of unaligned memory accesses and half the pipeline usually unoccupied. Given that we know based on the size restrictions introduced by MJPG from the capture device, we know we'll always have a multiple of 16 pixels*, thus could unroll the per-pixel operation 16x and process 48 bytes at a time. If we then use some SSE magic, we get an inner kernel that could perform this blending operation with the constraints of just 7 registers** needed.
Proper pipelining of the instructions inside the kernel loop should allow for the actual memory fetches and register assignments to run with minimal congestion.
** Technically we have 8 available for SSE, but with clever ordering of things you might be able to get down to only requiring 7 to hold all (intermediate) data. *** If you happen to get this move scheduled before loading the second image you could start linear-blending the source image while still loading the data for the target image. |
Sounds like you know waay more than I ever will @BenBE 😄 I've never looked into SIMD stuff before now, so my feeble attempt was through cut/paste/fixup from stack overflow of an example MMX alpha blend function. My current pr in #139 reduces the blend time to just under 1ms for the majority of executions (I'm simply using the execution timing built into the app), I would be delighted to see SIMD code based on the above analysis, then we can decide if the additional complexity is worth the time saving? |
Just noticed you could even skip the channel reordering, because the pixel scaling is uniform across all channels, which makes the kernel even simpler … Will have to look into the actual GCC intrinsic stuff though. |
Doing some rudimentary profiling I noticed, we are wasting a huge amount of time of the main thread in essentially 3 functions:
alpha_blend
(~40%)convert_rgb_to_yuyv
(~25%)cap.retrieve
(~20%)Values are relative time for
main
, which takes up roughly the same amount of time, we actually spend processing images (main
~32.75%,bs_maskgen_process
~33.95%, total runtime).On the positive:
bs_maskgen_process
spends ~95% of the time waiting for processing by TFLite.FWIW: Timings heavily affected by running under callgrind, but looking at the code of
alpha_blend
andconvert_rgb_to_yuyv
I'm not really surprised of these results. A rewrite of these functions using some vectoring should yield quite a bit of improvement.The text was updated successfully, but these errors were encountered: