Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build: remove docstring examples when publishing packages? #1742

Open
MarcoGorelli opened this issue Jan 6, 2025 · 8 comments
Open

build: remove docstring examples when publishing packages? #1742

MarcoGorelli opened this issue Jan 6, 2025 · 8 comments

Comments

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Jan 6, 2025

Docstring examples account for 30% 40% of Narwhals' size

We love docs, but maybe this is a bit excessive 😄

I think it's pretty important to ship docstrings with source code, so that people can inspect docstrings while they're coding. But do we also need the docstring examples? I think they're kind of hard to read just from the IDE's preview anyway

Perhaps only removing the examples from the docstrings when building the wheel strikes a balance between:

  • minimising size (completely minify all code)
  • good developer experience

The docstring examples would still be visible in the API Reference


I just tried this out, and removing all examples from functions is enough to reduce the package size by 25% 40%

I also tried removing the "single-line-imports" rule, but that made a barely detectable difference (less than 0.1MB) - sometimes it actually becomes longer to have imports on multiple lines

@dangotbanned
Copy link

@MarcoGorelli reducing the size of narwhals by 40% does sound impressive, but how does that compare to the size of each backend?
I'm thinking this could be a relatively small gain in comparison (but that's just my hunch)

Personally, I quite like having examples in my IDE and with polars I've found it quite rare that I'll visit the API docs - really only when I'm trying to get a link to share

@MarcoGorelli
Copy link
Member Author

MarcoGorelli commented Jan 9, 2025

Thanks @dangotbanned for your insights!

The wheel size is currently about 1% of DuckDB's, and on disk it's about 6%. Which is a bit more than what I'd like it to be...

One compromise could be to just have much shorter docstring examples, such as

>>> import polars as pl
>>> import narwhals as nw
>>> df_native = pl.DataFrame({'a': [1,2,3], 'b': [4,5,-6]})
>>> df = nw.from_native(df_native)
>>> df.with_columns(nw.all().abs().name.suffix('_abs')).to_native()
shape: (3, 4)
┌─────┬─────┬───────┬───────┐
│ aba_absb_abs │
│ ------------   │
│ i64i64i64i64   │
╞═════╪═════╪═══════╪═══════╡
│ 1414     │
│ 2525     │
│ 3-636     │
└─────┴─────┴───────┴───────┘

(with a roughly even split between backends, e.g. 1/3 of eager functions use pandas, 1/3 polars, 1/3 pyarrow) and then leave the longer "here's how to write a dataframe-agnostic function"-style ones to the website

🤔 not sure yet, curious to hear thoughts

@dangotbanned
Copy link

The wheel size is currently about 1% of DuckDB's, and on disk it's about 6%. Which is a bit more than what I'd like it to be...

ooh I see 🤔

You might be able to cut down on the import narwhals as nw lines by doing something like:

I'm not 100% on if this would be compatible with the stable.v(1|2) namespacing - or even if it is a desirable approach though.

@MarcoGorelli
Copy link
Member Author

I think that just removing a single import narwhals as nw line wouldn't make too much of a dent, it becomes significant when we're removing 40-50 across the whole api

@dangotbanned
Copy link

On the content of docstring examples, it reminded of a little writeup I did here vega/altair#3500 (comment)

Shout out to https://diataxis.fr/

@dangotbanned
Copy link

I think that just removing a single import narwhals as nw line wouldn't make too much of a dent, it becomes significant when we're removing 40-50 across the whole api

I'm not sure if I'm misunderstanding here, but what #1742 (comment) meant (or I suppose implied) was you'd have 1 import per file vs 1 per docstring

@MarcoGorelli
Copy link
Member Author

MarcoGorelli commented Jan 9, 2025

Yup, clear, thanks! I just tried measuring that, and removing all the lines which match .*>>> import narwhals as nw reduces the size from 1683 bytes to 1667 bytes.

For contrast, removing all docstring examples reduces the size to 1078 bytes

We can probably infer that if we were to rewrite all docstrings in the style of #1742 (comment), then the size would reduce to ~1300 bytes.

@MarcoGorelli
Copy link
Member Author

Been thinking about this further

I think that:

  • instead of dynamically removing docstring examples before publishing to PyPI (which would end up with a mismatch between what's on github and what's on PyPI...not great)
  • we should dynamcally add docstring examples before publishing the docs

This way

  • the size problem is addressed just as well, but if something goes wrong, then it's only the docs that are affected (which is less critical than the package source)
  • PyPI and github stay in-sync
  • if leaves open the door to keep short-and-sweet docstring examples like the one in build: remove docstring examples when publishing packages? #1742 (comment) in the source code, whilst having longer ones in the published docs (so that devs like Dan can still view at least some docstring examples whilst in their IDE)

So, my suggestion would be:

  1. in the next release, we change things so that docstring examples are dynamically added when generating docs (and when running doctests)
  2. after that, we can do some work on adding short-and-sweet docstring examples to the source code (there's a chance I may be getting some interns to work on Narwhals at some point, so this may be a good first task to get them familiar with the project before they move onto harder things)

Any thoughts / objections?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment