Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce bundle size of ens_normalize function #28

Open
dawsbot opened this issue Feb 18, 2024 · 6 comments
Open

Reduce bundle size of ens_normalize function #28

dawsbot opened this issue Feb 18, 2024 · 6 comments

Comments

@dawsbot
Copy link

dawsbot commented Feb 18, 2024

Hey @adraffy, I'm not sure if there are any easy wins on the bundle size, but I'm happy to help if so! Do you have any suggested areas to work on?

Seeing as both ethers and viem rely on this library and the ens_normalize function carries somewhere around 25kb into these packages, it might be possible to reduce that? You're the expert, let me know if or how I can help!

@adraffy
Copy link
Owner

adraffy commented Feb 18, 2024

There are some thoughts here: #21 (comment)

With browser detection, it would be possible to use features (like Unicode regex patterns.)

I'll update this when I get some time.

@adraffy
Copy link
Owner

adraffy commented Feb 20, 2024

Possible ideas:

  1. Compress spec.json using ANY technique

    • My current implementation takes this from 2.99 MB14673 bytes for ENS data and 5588 bytes for Unicode NF data. There is also some overhead for the decompression code.
    • It's already using a bunch of tricks like arithmetic coding, various forms of run-length compression, and many domain-specific things (emoji encoded into a trie), etc. For reference, just the raw list of valid emoji as a string are larger than this entire library, yet this library has a function that produces that list.
    • Likely to make any progress, you'll need to understand the structure of spec.json and how it's used. make.js is responsible for turning spec.json into the compressed data.
  2. Compress my compressed data using ANY technique. I don't include these files in the repo, but if you uncomment that file and run npm run make, you'll get two JSON files with byte[]. Those are the bytes that get turned into base64 (4/3 expansion factor) and fed into the decoder.

  3. Compress my uncompressed data using ANY techique. Same as above, but instead of writing out data which corresponds to the arithmetic coder format, write out enc.values instead, which will be int[]. These are the symbols that are fed into the arithmetic coder. They are biased towards low values (see the histogram in the link above) but it also includes large values like a codepoint or Δcodepoint in situations where I had to encode an one-off value. My compressor deals with this by encoding [0-60ish] verbatim and then a separate symbol to imply that you should read the next value as a "large" value.

  4. It would be pretty easy for me build other variants of ens-normalize that make assumptions about client features, like dynamically load a version for browsers that are sufficiently modern. The bulk of the data is in the script data. If the browser is using a sufficient version of Unicode, this data can be derived from \p.

  5. If the bundle is being served with compression, one layer of compression can be removed, although from my calculations, my compressor code + compressed data was still smaller than gzipped output.

  6. Some size is related to producing correct error messages. If no error message is required beyond "not normalized", additional reductions can be made.

  7. ens_tokenize() should be tree-shook from your bundle assuming you only need ens_normalize.

I'd estimate (4) is the easiest and could trim the entire NF payload + the bulk of the character data.

Also, there might be something I missed in (1) w/r/t compression, as I wrote the compressor in beginning of this project as I had to confirm that it was feasible to jam the entire Unicode character data into the library. Whereas now, the structure of spec.json is stable.

@dawsbot
Copy link
Author

dawsbot commented Feb 20, 2024

Incredible; I appreciate the breakdown, @adraffy! I'm going to be bold here and claim that I likely cannot make the in-the-weeds fixed you've recommended as quality as you could, but I'm happy to take a stab at (4) if it seems this is high-value and you don't have time.

Again, my attempt likely won't come close to yours, seeing as I don't have this type of character encoding experience, but I'm happy to learn!

@tmm
Copy link

tmm commented Jun 4, 2024

Seeing as both ethers and viem rely on this library and the ens_normalize function carries somewhere around 25kb into these packages, it might be possible to reduce that? You're the expert, let me know if or how I can help!

@dawsbot Worth noting that Viem inverts control so you can use another normalize function (or skip normalization if you know what you are doing). This also means that if you aren't using ENS with Viem, ens_normalize won't impact the final bundle!

import { normalize } from 'viem/ens' // proxies `ens_normalize` export
 
const ensAddress = await client.getEnsAddress({
  name: normalize('wevm.eth'),
})
import { custom_normalize } from 'custom-normalize' // use whatever normalize you like!
 
const ensAddress = await client.getEnsAddress({
  name: custom_normalize('wevm.eth'),
})

@adraffy
Copy link
Owner

adraffy commented Jun 5, 2024

I would be very careful with using non-standard normalization. While the bulk of ENS names currently ASCII, I would expect that trend to change in the future.

Spoofing names is a serious attack vector—look how many people get scammed by address poisoning.

Is the current library size an actual problem? Relative to nearly every site I see, asset and code bloat dwarfs actual library code by a huge margin.

The goal of this library is to produce the correct result for ALL inputs across any engine that can run the library. It accomplishes that by internalizing everything (which includes the full Unicode spec).

FYI: The major browser vendors can't even agree on URL parsing.

When I update the library to Unicode 16 for the September release, I'll revisit the compression logic.

@adraffy
Copy link
Owner

adraffy commented Jun 6, 2024

I can supply a stub function if you want async import(...) the rest of the library when the fastpath isn't sufficient? But it would make the callsite async.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants