Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deprecate XorBytes #315

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Deprecate XorBytes #315

wants to merge 1 commit into from

Conversation

jech
Copy link
Member

@jech jech commented Sep 20, 2024

Now that we no longer support Go 1.19, we might as well deprecate XorBytes, remove the ARM-32 assembler version, and use the stdlib version instead.

 go.mod                               |   2 +-
 utils/xor/{xor_generic.go => xor.go} |   4 +-
 utils/xor/xor_arm.go                 |  60 ------------------
 utils/xor/xor_arm.s                  | 116 -----------------------------------
 utils/xor/xor_old.go                 |  77 -----------------------
 5 files changed, 3 insertions(+), 256 deletions(-)

This should not change anything on architectures other than 32-bit ARM. The only concern is that Go does not implement assembler code on 32-bit ARM (see https://go-review.googlesource.com/c/go/+/409394, which @adriancable never finished). The XorBytes benchmarks look pretty horrible (note that only the aligned versions are relevant to Pion):

name                        old time/op    new time/op    delta
XORAligned/8Bytes              107ns ± 0%     307ns ± 0%  +187.59%  (p=0.008 n=5+5)
XORAligned/128Bytes            148ns ± 0%     504ns ± 0%  +239.53%  (p=0.008 n=5+5)
XORAligned/2048Bytes          1.35µs ± 0%    3.43µs ± 0%  +153.87%  (p=0.016 n=5+4)
XORAligned/32768Bytes         22.1µs ± 1%    58.7µs ± 1%  +165.86%  (p=0.008 n=5+5)
XORUnalignedDst/8Bytes         116ns ± 0%     227ns ± 0%   +95.78%  (p=0.008 n=5+5)
XORUnalignedDst/128Bytes       164ns ± 0%     972ns ± 0%  +494.43%  (p=0.008 n=5+5)
XORUnalignedDst/2048Bytes     1.98µs ± 0%   12.68µs ± 0%  +541.67%  (p=0.016 n=4+5)
XORUnalignedDst/32768Bytes    32.7µs ± 1%   209.6µs ± 1%  +540.19%  (p=0.008 n=5+5)

name                        old speed      new speed      delta
XORAligned/8Bytes           75.0MB/s ± 0%  26.1MB/s ± 0%   -65.23%  (p=0.008 n=5+5)
XORAligned/128Bytes          862MB/s ± 0%   254MB/s ± 0%   -70.55%  (p=0.008 n=5+5)
XORAligned/2048Bytes        1.52GB/s ± 0%  0.60GB/s ± 0%   -60.61%  (p=0.016 n=5+4)
XORAligned/32768Bytes       1.49GB/s ± 1%  0.56GB/s ± 1%   -62.39%  (p=0.008 n=5+5)
XORUnalignedDst/8Bytes      69.2MB/s ± 0%  35.3MB/s ± 0%   -48.97%  (p=0.016 n=4+5)
XORUnalignedDst/128Bytes     782MB/s ± 0%   132MB/s ± 0%   -83.18%  (p=0.008 n=5+5)
XORUnalignedDst/2048Bytes   1.04GB/s ± 0%  0.16GB/s ± 0%   -84.42%  (p=0.016 n=4+5)
XORUnalignedDst/32768Bytes  1.00GB/s ± 1%  0.16GB/s ± 1%   -84.38%  (p=0.008 n=5+5)

However, 32-bit ARM does not implement hardware crypto, so once you factor in the cost of encryption, the difference is just 10% for CTR (which is seldom negotiated nowadays), and, as expected, no difference for GCM (which is used by modern browsers):

name                        old time/op    new time/op    delta
GenerateCounter                175ns ± 0%     192ns ± 0%   +9.91%  (p=0.008 n=5+5)
EncryptRTP/CTR-100            28.8µs ± 1%    30.7µs ± 0%   +6.47%  (p=0.008 n=5+5)
EncryptRTP/CTR-1000            132µs ± 1%     147µs ± 1%  +11.35%  (p=0.008 n=5+5)
EncryptRTP/GCM-100            35.9µs ± 5%    35.6µs ± 1%     ~     (p=0.841 n=5+5)
EncryptRTP/GCM-1000            214µs ± 1%     213µs ± 0%     ~     (p=0.095 n=5+5)
EncryptRTPInPlace/CTR-100     26.8µs ± 1%    28.4µs ± 0%   +5.70%  (p=0.008 n=5+5)
EncryptRTPInPlace/CTR-1000     120µs ± 2%     133µs ± 0%  +10.85%  (p=0.008 n=5+5)
EncryptRTPInPlace/GCM-100     33.5µs ± 1%    33.6µs ± 1%     ~     (p=0.151 n=5+5)
EncryptRTPInPlace/GCM-1000     199µs ± 1%     199µs ± 1%     ~     (p=0.421 n=5+5)
DecryptRTP/CTR                18.7µs ± 1%    20.1µs ± 1%   +7.55%  (p=0.008 n=5+5)
DecryptRTP/GCM                16.8µs ± 0%    17.1µs ± 2%   +2.10%  (p=0.008 n=5+5)
Write/CTR-100                 31.4µs ± 0%    33.5µs ± 1%   +6.75%  (p=0.008 n=5+5)
Write/CTR-1000                 124µs ± 1%     138µs ± 1%  +11.53%  (p=0.008 n=5+5)
Write/GCM-100                 37.5µs ± 1%    38.0µs ± 0%   +1.41%  (p=0.008 n=5+5)
Write/GCM-1000                 203µs ± 0%     204µs ± 0%   +0.24%  (p=0.032 n=5+5)
WriteRTP/CTR-100              27.1µs ± 2%    29.6µs ± 1%   +9.23%  (p=0.008 n=5+5)
WriteRTP/CTR-1000              120µs ± 1%     134µs ± 1%  +12.38%  (p=0.008 n=5+5)
WriteRTP/GCM-100              34.0µs ± 1%    33.8µs ± 1%     ~     (p=0.222 n=5+5)
WriteRTP/GCM-1000              200µs ± 1%     198µs ± 0%   -0.65%  (p=0.016 n=5+4)

name                        old speed      new speed      delta
EncryptRTP/CTR-100          3.88MB/s ± 1%  3.65MB/s ± 0%   -6.08%  (p=0.008 n=5+5)
EncryptRTP/CTR-1000         7.66MB/s ± 1%  6.88MB/s ± 1%  -10.19%  (p=0.008 n=5+5)
EncryptRTP/GCM-100          3.12MB/s ± 5%  3.15MB/s ± 1%     ~     (p=0.754 n=5+5)
EncryptRTP/GCM-1000         4.72MB/s ± 1%  4.75MB/s ± 0%     ~     (p=0.206 n=5+5)
EncryptRTPInPlace/CTR-100   4.17MB/s ± 1%  3.95MB/s ± 0%   -5.37%  (p=0.008 n=5+5)
EncryptRTPInPlace/CTR-1000  8.47MB/s ± 0%  7.61MB/s ± 0%  -10.13%  (p=0.016 n=4+5)
EncryptRTPInPlace/GCM-100   3.35MB/s ± 1%  3.33MB/s ± 1%     ~     (p=0.214 n=5+5)
EncryptRTPInPlace/GCM-1000  5.09MB/s ± 1%  5.08MB/s ± 1%     ~     (p=0.452 n=5+5)
DecryptRTP/CTR              1.50MB/s ± 1%  1.39MB/s ± 0%   -7.21%  (p=0.000 n=5+4)
DecryptRTP/GCM              2.03MB/s ± 0%  1.98MB/s ± 2%   -2.07%  (p=0.008 n=5+5)
Write/CTR-100               3.57MB/s ± 0%  3.34MB/s ± 1%   -6.44%  (p=0.016 n=4+5)
Write/CTR-1000              8.17MB/s ± 1%  7.33MB/s ± 1%  -10.35%  (p=0.008 n=5+5)
Write/GCM-100               2.99MB/s ± 1%  2.95MB/s ± 0%   -1.40%  (p=0.016 n=5+5)
Write/GCM-1000              4.98MB/s ± 0%  4.97MB/s ± 0%   -0.28%  (p=0.048 n=5+5)
WriteRTP/CTR-100            4.13MB/s ± 2%  3.78MB/s ± 1%   -8.48%  (p=0.008 n=5+5)
WriteRTP/CTR-1000           8.46MB/s ± 0%  7.53MB/s ± 1%  -11.01%  (p=0.008 n=5+5)
WriteRTP/GCM-100            3.30MB/s ± 0%  3.31MB/s ± 1%     ~     (p=0.183 n=5+5)
WriteRTP/GCM-1000           5.07MB/s ± 1%  5.10MB/s ± 0%     ~     (p=0.079 n=5+4)

Copy link

codecov bot commented Sep 20, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 82.99%. Comparing base (e8ac9ae) to head (e62d789).

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #315      +/-   ##
==========================================
- Coverage   83.03%   82.99%   -0.04%     
==========================================
  Files          39       39              
  Lines        2729     2729              
==========================================
- Hits         2266     2265       -1     
- Misses        335      336       +1     
  Partials      128      128              
Flag Coverage Δ
go 82.90% <ø> (-0.04%) ⬇️
wasm 65.02% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Now that we no longer support Go 1.19, we might as well
deprecate XorBytes and use the stdlib version instead.
@jech jech force-pushed the deprecate-xorbytes branch from cc2a84f to e62d789 Compare September 20, 2024 09:56
@adriancable
Copy link
Contributor

adriancable commented Sep 20, 2024

@jech - it's a shame that the Go team weren't interested in picking up my PR for XorBytes. It just stalled after I completed the submission and as far as I can tell it was never reviewed. I would be happy to hear any suggestions you may have to get it picked up.

Our application (which runs on 32-bit ARM, still very common for embedded systems) is connecting to IoT cameras, many of which are old and most of which use CTR. This PR would decrease performance in those cases and we would have to switch to a fork of pion with optimized XorBytes manually add back in. I'm generally against performance regressions so I don't think this PR should be merged.

@jech
Copy link
Member Author

jech commented Sep 20, 2024

Please check the benchmarks, Adrian.

The patch causes a 10% regression in WriteRTP. Assuming your code spends 40% of its CPU doing WriteRTP, this is a 4% regression to your code's performance, which should be negligible.

Perhaps you could share some benchmarks of your application that show whether the change is significant?

@adriancable
Copy link
Contributor

@jech - my code actually spends a lot less than 40% CPU on WriteRTP (this is used only for the camera talkback function), but it spends almost all of its CPU time on ReadRTP of which most of DecryptRTP. The WebRTC receiver is more or less a tight ReadRTP loop and on our platform (Allwinner H3) a single camera consumes 14.0% of one core in processing. We have a budget of 1.5 cores to support as many cameras as we can (the rest is needed for non-WebRTC processing functions), which means we support up to 10 cameras streaming in parallel, consuming a total of 1.4 cores.

Without the ARM32 implemention of XorBytes, single camera CPU consumption increases to 15.2% (which matches your own benchmark closely enough), which means we need to drop the number of supported cameras from 10 to 9 to meet the total budget for WebRTC processing of 1.5 cores. In practice we wouldn't actually make things less capable in a product 'update' so we would need to maintain a separate pion fork with the XorBytes optimization back in. This is just extra work.

In summary, this PR causes a performance regression in circumstances which can be meaningful for some applications (like ours) and doesn't add any benefit, so I don't see a reason to merge it.

Copy link
Member

@Sean-Der Sean-Der left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feedback from Adrian

@Sean-Der
Copy link
Member

Thank you @jech! I should start a list of all the things that got added and we hope to remove someday. I completely forgot about this.

@adriancable as long as you need this it will not be removed! I don't want to see you on a fork.

@jech
Copy link
Member Author

jech commented Sep 20, 2024

a single camera consumes 14.0% of one core

[...]

Without the ARM32 implemention of XorBytes, single camera CPU consumption increases to 15.2%

These figures make sense.

Adrian, may I encourage you to push https://go-review.googlesource.com/c/go/+/409394 again? It would be great if we could remove some code from Pion.

@adriancable
Copy link
Contributor

@jech - yes, I will happily give it another go. I would love to see this merged into Go - that is where it really should live, rather than Pion.

@jech
Copy link
Member Author

jech commented Sep 20, 2024

doesn't add any benefito

The xor_asm.s file contains assembler code which, as far as I know, has never been properly reviewed by anyone other than the author (please correct me if I'm wrong). If there is a security issue in that code, it will endanger the whole Pion community, not just your application.

Moving this code into the Go stdlib, will ensure that it gets reviewed. Failing that, removing it altogether is the safe way out.

@adriancable
Copy link
Contributor

Hi @jech - the code in xor_asm.s (as something written specifically for Pion) has been reviewed in much the same way as all code written for the Pion code base has been. In general though, I would expect that Pion PRs get fewer eyes on them than Go PRs, which is somewhat inevitable given that we have a much smaller contributor base than Go has, but still want to maintain reasonable development velocity.

Security issues are definitely a valid concern but I don't think apply more to this piece of code in Pion than other contributions, beyond the fact I suppose that being written in ARM32 assembly language makes it a little more 'esoteric' than most contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants