Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI][Python] Improve vcpkg caching #43951

Closed
pitrou opened this issue Sep 4, 2024 · 17 comments
Closed

[CI][Python] Improve vcpkg caching #43951

pitrou opened this issue Sep 4, 2024 · 17 comments

Comments

@pitrou
Copy link
Member

pitrou commented Sep 4, 2024

Describe the enhancement requested

We use vcpkg to build bundled dependencies for Python wheels. Unfortunately, it often happens that the Docker image gets rebuilt, and therefore all the dependencies are recompiled from scratch. This makes build times very long (random example here).

It would be nice to use a vcpkg binary cache on CI here, especially as we always build the same dependency versions regardless of the targeted Python version. There are exemples here: https://learn.microsoft.com/en-us/vcpkg/consume/binary-caching-github-actions-cache

Component(s)

C++, Continuous Integration, Python

@pitrou
Copy link
Member Author

pitrou commented Sep 4, 2024

cc @kou @raulcd @assignUser

@assignUser
Copy link
Member

assignUser commented Sep 4, 2024

It's a bit tricky to do this within docker but should be doable, there is a similar issue open for java-jars cc @danepitkin
We are using vcpkg binary caching for the MacOS jobs #cache vcpkg (see #43438 and apache/arrow-java#457)

@sjperkins
Copy link
Contributor

FWIW got vcpkg caching working within cibuildwheel here:

https://github.com/ratt-ru/arcae/blob/cd3e7e8f7057a66aad7fedf7e4adf18334fbf2c9/.github/workflows/ci.yml#L157-L194

It mostly seems to depend on passing:

  • ACTIONS_CACHE_URL
  • ACTIONS_RUNTIME_TOKEN

into the container

@pitrou
Copy link
Member Author

pitrou commented Sep 4, 2024

And also VCPKG_BINARY_SOURCES="clear;x-gha,readwrite" I suppose.

@pitrou
Copy link
Member Author

pitrou commented Sep 4, 2024

By the way, it seems other sources of binary artifacts are supported:
https://github.com/microsoft/vcpkg-docs/blob/main/vcpkg/reference/binarycaching.md#configuration-syntax

@sjperkins
Copy link
Contributor

sjperkins commented Sep 4, 2024

It's a bit tricky to do this within docker but should be doable

Also, CIBW_CONTAINER_ENGINE: "docker; create_args: --network=host" might help vcpkg access a cache external to the container. This is useful in other CI's but I haven't needed this in GHA.

@jorisvandenbossche
Copy link
Member

especially as we always build the same dependency versions regardless of the targeted Python version.

We would also consider building all wheels for the various Python versions in a single build (which is what typically happens when eg using cibuildwheel). It would make a single build longer of course, but reduce the overall CI time.

Now, our pyarrow build and test run take quite a while, so maybe this will get too long for a single build

@pitrou
Copy link
Member Author

pitrou commented Sep 5, 2024

It's quite bad for developer productivity to make the wheel build slower. I would rather we make the existing builds faster.

Currently, when the vcpkg step runs, a manylinux wheel build run takes 1h15. When the vcpkg step is cached in the Docker image, a manylinux wheel build run takes 20 minutes. vcpkg binary caching would hopefully achieve similar results (probably not as good, but still).

@jorisvandenbossche
Copy link
Member

a manylinux wheel build run takes 20 minutes

Of which half is setting up the image and building Arrow C++, which also strictly does not need to be repeated for every Python version.

But yes, if there is a build failing for a specific Python version, it would be annoying they are all combined and you couldn't easily trigger a single Python version (it's always a trade-off)

@pitrou
Copy link
Member Author

pitrou commented Sep 5, 2024

Building Arrow C++ (and perhaps PyArrow) could be made faster using ccache/sccache. Apparently that's not the case currently:
https://github.com/apache/arrow/blob/f545b90748d5196af547abcec19d63a7b14e4daa/dev/tasks/python-wheels/github.linux.yml

@assignUser
Copy link
Member

We could consolidate all wheels into a single workflow (or one per os), where arrow C++ and deps are built once with best possible caching and the artifacts distributed to multiple wheel build jobs, I think this would be the best compromise of over all runtime and efficent use of CI time.

@assignUser
Copy link
Member

a build failing for a specific Python version

Thinking back on last few releases I think mostly all wheel jobs fail together vs. issues with a specific action, excluding maybe RCs of new python versions.

@pitrou
Copy link
Member Author

pitrou commented Sep 5, 2024

We could consolidate all wheels into a single workflow (or one per os), where arrow C++ and deps are built once with best possible caching and the artifacts distributed to multiple wheel build jobs, I think this would be the best compromise of over all runtime and efficent use of CI time

That would also make local reproduction using archery docker ... more difficult, unless there's a way to automate that too.

@pitrou
Copy link
Member Author

pitrou commented Sep 5, 2024

Also, perhaps there could be several sources: a GHA one and a file-based one as fallback (if not on GHA?).

Something like: VCPKG_BINARY_SOURCES=clear;x-gha,rw;files,/vcpkg-cache,rw with /vcpkg-cache being mapped as a Docker volume to the host's ~/.cache/vcpkg?

@kou
Copy link
Member

kou commented Nov 10, 2024

Implemented: #44644 (review)

It uses VCPKG_BINARY_SOURCES=clear;nuget,GitHub,readwrite. We can't use VCPKG_BINARY_SOURCES=clear;x-gha,rw;files,/vcpkg-cache,rw (GitHub Actions cache) with Crossbow. Because Crossbow doesn't use the default branch in a normal way. (Do I need to explain this more?)

manylinux and java-jar jobs use NuGet + GitHub Packages based cache.
Exception: manylinux2014 + aarch64 jobs don't use it. Because NuGet doesn't work on the platform. (Mono is old.)

kou added a commit that referenced this issue Nov 15, 2024
### Rationale for this change

We're using only Docker level cache for vcpkg used for wheels. If we have any vcpkg related changes, all vcpkg ports are rebuilt. It's time consuming.

### What changes are included in this PR?

Enable NuGet + GitHub Packages based cache. It's port level cache. So we don't need to rebuild all ports when we have any vcpkg related changes.

See also: https://learn.microsoft.com/en-us/vcpkg/consume/binary-caching-github-packages

NuGet + GitHub Packages based cache isn't enabled with manylinux2014 + aarch64. Because EPEL for CentOS 7 + aarch64 provides old Mono. (FYI: EPEL for CentOS 7 + x86_64 provides newer Mono.) We can't use old Mono to run NuGet on Linux.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* GitHub Issue: #43951

Lead-authored-by: Sutou Kouhei <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Co-authored-by: Raúl Cumplido <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
@kou kou added this to the 19.0.0 milestone Nov 15, 2024
@kou
Copy link
Member

kou commented Nov 15, 2024

Issue resolved by pull request 44644
#44644

@kou kou closed this as completed Nov 15, 2024
@pitrou
Copy link
Member Author

pitrou commented Nov 15, 2024

Thanks a lot @kou !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants
@kou @jorisvandenbossche @pitrou @sjperkins @assignUser and others