Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-43299: [Release][Packaging] Only include pyarrow folder when finding packages on setuptools #43325

Merged
merged 16 commits into from
Sep 5, 2024

Conversation

raulcd
Copy link
Member

@raulcd raulcd commented Jul 18, 2024

Rationale for this change

Currently we include everything when building wheels, see:

$ pip install pyarrow
Collecting pyarrow
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (39.9 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39.9/39.9 MB 33.8 MB/s eta 0:00:00
Collecting numpy>=1.16.6
  Using cached numpy-2.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.3 MB)
Installing collected packages: numpy, pyarrow
Successfully installed numpy-2.0.0 pyarrow-17.0.0
(test-env)  $ ls test-env/lib/python3.10/site-packages/
benchmarks/                  distutils-precedence.pth     numpy-2.0.0.dist-info/       pip-22.0.2.dist-info/        pyarrow-17.0.0.dist-info/    setuptools-59.6.0.dist-info/ 
cmake_modules/               examples/                    numpy.libs/                  pkg_resources/               scripts/                     
_distutils_hack/             numpy/                       pip/                         pyarrow/                     setuptools/    

What changes are included in this PR?

Use include as seen here: https://setuptools.pypa.io/en/latest/userguide/package_discovery.html#finding-simple-packages

Are these changes tested?

Will check via the build wheel on CI

Are there any user-facing changes?

No and yes :)
We will remove unnecessary files

@raulcd
Copy link
Member Author

raulcd commented Jul 18, 2024

@github-actions crossbow submit -g python wheel

Copy link

⚠️ GitHub issue #43299 has been automatically assigned in GitHub to PR creator.

Copy link

Unable to match any tasks for `wheel`
The Archery job run can be found at: https://github.com/apache/arrow/actions/runs/9993787632

@raulcd
Copy link
Member Author

raulcd commented Jul 18, 2024

@github-actions crossbow submit -g python -g wheel

This comment was marked as outdated.

@raulcd
Copy link
Member Author

raulcd commented Jul 18, 2024

There seemed to be a minor issue:

$ ls -lrt
total 77296
drwxr-xr-x 2 raulcd raulcd     4096 jul 18 15:11 pyarrow-18.0.0.dev15.dist-info
drwxr-xr-x 2 raulcd raulcd     4096 jul 18 15:11 pyarrow.
drwxr-xr-x 9 raulcd raulcd     4096 jul 18 15:11 pyarrow
$ ls -lrt pyarrow./
total 0

@raulcd
Copy link
Member Author

raulcd commented Jul 18, 2024

@github-actions crossbow submit wheel-manylinux-2-28-cp310-amd64

Copy link

Revision: 46d1afc

Submitted crossbow builds: ursacomputing/crossbow @ actions-db1794c125

Task Status
wheel-manylinux-2-28-cp310-amd64 GitHub Actions

@raulcd raulcd changed the title GH-43299: [Release][Packaging] Only include pyarrow and pyarrow.* when finding packages on setuptools GH-43299: [Release][Packaging] Only include pyarrow folder when finding packages on setuptools Jul 18, 2024
@raulcd
Copy link
Member Author

raulcd commented Jul 18, 2024

@github-actions crossbow submit wheel-manylinux-2-28-cp310-amd64

Copy link

Revision: d954d75

Submitted crossbow builds: ursacomputing/crossbow @ actions-922921124d

Task Status
wheel-manylinux-2-28-cp310-amd64 GitHub Actions

@raulcd
Copy link
Member Author

raulcd commented Jul 18, 2024

@github-actions crossbow submit wheel-manylinux-2-28-cp310-amd64

Copy link

Revision: 2fa434f

Submitted crossbow builds: ursacomputing/crossbow @ actions-0cd7a8ae05

Task Status
wheel-manylinux-2-28-cp310-amd64 GitHub Actions

@raulcd
Copy link
Member Author

raulcd commented Jul 19, 2024

@github-actions crossbow submit wheel-manylinux-2-28-cp310-amd64

Copy link

Revision: 204a27b

Submitted crossbow builds: ursacomputing/crossbow @ actions-5d877a45d2

Task Status
wheel-manylinux-2-28-cp310-amd64 GitHub Actions

@raulcd
Copy link
Member Author

raulcd commented Jul 19, 2024

I am unsure on why an empty directory with pyarrow. is being added and why pyarrow.tests aren't being excluded, it will require more investigation but the initial examples, cmake_modules, benchmarks and scripts have been removed from the generated wheel:

$ ls 
pyarrow  pyarrow.  pyarrow-18.0.0.dev19-cp310-cp310-manylinux_2_28_x86_64.whl  pyarrow-18.0.0.dev19.dist-info
$ ls -lart pyarrow./
total 8
drwxr-xr-x 2 raulcd raulcd 4096 jul 19 10:59 .
drwxrwxr-x 5 raulcd raulcd 4096 jul 19 13:32 ..
$ ls pyarrow/tests/
arrow_16597.py                    interchange                 test_adhoc_memory_leak.py  test_cuda_numba_interop.py  test_exec_plan.py       test_io.py      test_scalars.py        test_udf.py
arrow_39313.py                    pandas_examples.py          test_array.py              test_cuda.py                test_extension_type.py  test_ipc.py     test_schema.py         test_util.py
arrow_7980.py                     pandas_threaded_import.py   test_builder.py            test_cython.py              test_feather.py         test_json.py    test_sparse_tensor.py  util.py
bound_function_visit_strings.pyx  parquet                     test_cffi.py               test_dataset_encryption.py  test_flight_async.py    test_jvm.py     test_strategies.py
conftest.py                       pyarrow_cython_example.pyx  test_compute.py            test_dataset.py             test_flight.py          test_memory.py  test_substrait.py
data                              read_record_batch.py        test_convert_builtin.py    test_deprecations.py        test_fs.py              test_misc.py    test_table.py
extensions.pyx                    strategies.py               test_cpp_internals.py      test_device.py              test_gandiva.py         test_orc.py     test_tensor.py
__init__.py                       test_acero.py               test_csv.py                test_dlpack.py              test_gdb.py             test_pandas.py  test_types.py

@raulcd
Copy link
Member Author

raulcd commented Jul 19, 2024

@github-actions crossbow submit wheel-manylinux-2-28-cp310-amd64

Copy link

Revision: a1d73a2

Submitted crossbow builds: ursacomputing/crossbow @ actions-ebc75ddd3d

Task Status
wheel-manylinux-2-28-cp310-amd64 GitHub Actions

@raulcd raulcd marked this pull request as ready for review July 19, 2024 12:20
@@ -73,7 +73,9 @@ zip-safe=false
include-package-data=true

[tool.setuptools.packages.find]
where = ["."]
include = ["pyarrow"]
exclude = ["pyarrow/tests", "pyarrow."]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't seem to have any effect, @jorisvandenbossche @pitrou any idea?
The other stray directories are filtered from the wheel now

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the syntax is pyarrow.tests. What does pyarrow. refer to?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a folder included in the wheel pyarrow. which is empty and I was trying to remove it. I'll try with pyarrow.tests

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it doesn't remove tests either

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't care about removing the tests, they are comparatively small and can serve to test PyArrow on a third-party machine.

$ du -hs venv-3.10/lib/python3.10/site-packages/pyarrow
187M	venv-3.10/lib/python3.10/site-packages/pyarrow
$ du -hs venv-3.10/lib/python3.10/site-packages/pyarrow/tests/
4,0M	venv-3.10/lib/python3.10/site-packages/pyarrow/tests/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about cpp and header files, should we do something like pandas here:
https://github.com/pandas-dev/pandas/blob/main/pyproject.toml#L131

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the files under pyarrow/src/arrow/python are included on the wheel

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll stress that the problem we're trying to solve is that installing a PyArrow wheel creates top-level directories outside of the pyarrow source tree. This is the urgency. Cleaning up the contents of pyarrow is a separate task, much less critical.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess those are negligible and we still require to distribute them on the source distitrbution:

pyarrow/src $ du -hs .
728K	.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll stress that the problem we're trying to solve is that installing a PyArrow wheel creates top-level directories outside of the pyarrow source tree.

ok, then, this is solved by this PR, are you ok with the current changes @pitrou ?

@raulcd raulcd requested a review from anjakefala July 19, 2024 12:22
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Jul 19, 2024
@timkpaine
Copy link
Contributor

timkpaine commented Jul 19, 2024

was something wrong with #43281 from 3 days ago?

@raulcd
Copy link
Member Author

raulcd commented Jul 19, 2024

was something wrong with #43281 from 3 days ago?

Sorry, I didn't see that issue, it would be good to mark the second one as a duplicate. I also think that as the issue was introduced with a wrongly configured pyproject.toml it might be worth also fixing it there

@raulcd
Copy link
Member Author

raulcd commented Jul 22, 2024

@github-actions crossbow submit wheel-manylinux-2-28-cp310-amd64

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jul 22, 2024
@raulcd
Copy link
Member Author

raulcd commented Sep 3, 2024

@github-actions crossbow submit wheel-manylinux-2-28-cp310-amd64

@raulcd
Copy link
Member Author

raulcd commented Sep 3, 2024

@github-actions crossbow submit wheel-windows-cp39-amd64

Copy link

github-actions bot commented Sep 3, 2024

Revision: 915d70f

Submitted crossbow builds: ursacomputing/crossbow @ actions-5b1cb7ff99

Task Status
wheel-manylinux-2-28-cp310-amd64 GitHub Actions

Copy link

github-actions bot commented Sep 3, 2024

Revision: 915d70f

Submitted crossbow builds: ursacomputing/crossbow @ actions-aca9360964

Task Status
wheel-windows-cp39-amd64 GitHub Actions

@raulcd
Copy link
Member Author

raulcd commented Sep 4, 2024

@github-actions crossbow submit wheel-windows-cp39-amd64

This comment was marked as outdated.

@raulcd
Copy link
Member Author

raulcd commented Sep 4, 2024

@github-actions crossbow submit -g wheel

This comment was marked as outdated.

@raulcd
Copy link
Member Author

raulcd commented Sep 4, 2024

@jorisvandenbossche the check is working on macOS, manylinux and Windows wheels tests now. If you can review again. Thanks!

@pitrou
Copy link
Member

pitrou commented Sep 4, 2024

I think I've found the source of the stray pyarrow. directory. It's because of pypa/auditwheel#488 and the -L . option.

@pitrou
Copy link
Member

pitrou commented Sep 4, 2024

@github-actions crossbow submit wheelcp312

Copy link

github-actions bot commented Sep 4, 2024

Revision: f87612a

Submitted crossbow builds: ursacomputing/crossbow @ actions-9544e3c33d

Task Status
wheel-macos-monterey-cp312-amd64 GitHub Actions
wheel-macos-monterey-cp312-arm64 GitHub Actions
wheel-manylinux-2-28-cp312-amd64 GitHub Actions
wheel-manylinux-2-28-cp312-arm64 GitHub Actions
wheel-manylinux-2014-cp312-amd64 GitHub Actions
wheel-manylinux-2014-cp312-arm64 GitHub Actions
wheel-windows-cp312-amd64 GitHub Actions

@pitrou
Copy link
Member

pitrou commented Sep 4, 2024

@github-actions crossbow submit -g wheel

Copy link

github-actions bot commented Sep 4, 2024

Revision: f87612a

Submitted crossbow builds: ursacomputing/crossbow @ actions-a951b5bfb8

Task Status
python-sdist GitHub Actions
wheel-macos-monterey-cp310-amd64 GitHub Actions
wheel-macos-monterey-cp310-arm64 GitHub Actions
wheel-macos-monterey-cp311-amd64 GitHub Actions
wheel-macos-monterey-cp311-arm64 GitHub Actions
wheel-macos-monterey-cp312-amd64 GitHub Actions
wheel-macos-monterey-cp312-arm64 GitHub Actions
wheel-macos-monterey-cp313-amd64 GitHub Actions
wheel-macos-monterey-cp313-arm64 GitHub Actions
wheel-macos-monterey-cp38-amd64 GitHub Actions
wheel-macos-monterey-cp38-arm64 GitHub Actions
wheel-macos-monterey-cp39-amd64 GitHub Actions
wheel-macos-monterey-cp39-arm64 GitHub Actions
wheel-manylinux-2-28-cp310-amd64 GitHub Actions
wheel-manylinux-2-28-cp310-arm64 GitHub Actions
wheel-manylinux-2-28-cp311-amd64 GitHub Actions
wheel-manylinux-2-28-cp311-arm64 GitHub Actions
wheel-manylinux-2-28-cp312-amd64 GitHub Actions
wheel-manylinux-2-28-cp312-arm64 GitHub Actions
wheel-manylinux-2-28-cp313-amd64 GitHub Actions
wheel-manylinux-2-28-cp313-arm64 GitHub Actions
wheel-manylinux-2-28-cp38-amd64 GitHub Actions
wheel-manylinux-2-28-cp38-arm64 GitHub Actions
wheel-manylinux-2-28-cp39-amd64 GitHub Actions
wheel-manylinux-2-28-cp39-arm64 GitHub Actions
wheel-manylinux-2014-cp310-amd64 GitHub Actions
wheel-manylinux-2014-cp310-arm64 GitHub Actions
wheel-manylinux-2014-cp311-amd64 GitHub Actions
wheel-manylinux-2014-cp311-arm64 GitHub Actions
wheel-manylinux-2014-cp312-amd64 GitHub Actions
wheel-manylinux-2014-cp312-arm64 GitHub Actions
wheel-manylinux-2014-cp313-amd64 GitHub Actions
wheel-manylinux-2014-cp313-arm64 GitHub Actions
wheel-manylinux-2014-cp38-amd64 GitHub Actions
wheel-manylinux-2014-cp38-arm64 GitHub Actions
wheel-manylinux-2014-cp39-amd64 GitHub Actions
wheel-manylinux-2014-cp39-arm64 GitHub Actions
wheel-windows-cp310-amd64 GitHub Actions
wheel-windows-cp311-amd64 GitHub Actions
wheel-windows-cp312-amd64 GitHub Actions
wheel-windows-cp313-amd64 GitHub Actions
wheel-windows-cp38-amd64 GitHub Actions
wheel-windows-cp39-amd64 GitHub Actions

@pitrou
Copy link
Member

pitrou commented Sep 4, 2024

CI failures are unrelated. @jorisvandenbossche Do you want to take a last look?

@jorisvandenbossche
Copy link
Member

I think I've found the source of the stray pyarrow. directory. It's because of pypa/auditwheel#488 and the -L . option.

Nice catch!

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Sep 5, 2024
@jorisvandenbossche jorisvandenbossche merged commit f545b90 into apache:main Sep 5, 2024
62 of 63 checks passed
@jorisvandenbossche jorisvandenbossche removed the awaiting merge Awaiting merge label Sep 5, 2024
@jorisvandenbossche
Copy link
Member

Thanks @raulcd

Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit f545b90.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 4 possible false positives for unstable benchmarks that are known to sometimes produce them.

zanmato1984 pushed a commit to zanmato1984/arrow that referenced this pull request Sep 6, 2024
… finding packages on setuptools (apache#43325)

### Rationale for this change

Currently we include everything when building wheels, see:
```
$ pip install pyarrow
Collecting pyarrow
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (39.9 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39.9/39.9 MB 33.8 MB/s eta 0:00:00
Collecting numpy>=1.16.6
  Using cached numpy-2.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.3 MB)
Installing collected packages: numpy, pyarrow
Successfully installed numpy-2.0.0 pyarrow-17.0.0
(test-env)  $ ls test-env/lib/python3.10/site-packages/
benchmarks/                  distutils-precedence.pth     numpy-2.0.0.dist-info/       pip-22.0.2.dist-info/        pyarrow-17.0.0.dist-info/    setuptools-59.6.0.dist-info/ 
cmake_modules/               examples/                    numpy.libs/                  pkg_resources/               scripts/                     
_distutils_hack/             numpy/                       pip/                         pyarrow/                     setuptools/    
```

### What changes are included in this PR?

Use `include` as seen here: https://setuptools.pypa.io/en/latest/userguide/package_discovery.html#finding-simple-packages

### Are these changes tested?

Will check via the build wheel on CI

### Are there any user-facing changes?

No and yes :)
We will remove unnecessary files
* GitHub Issue: apache#43299

Lead-authored-by: Raúl Cumplido <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Joris Van den Bossche <[email protected]>
khwilson pushed a commit to khwilson/arrow that referenced this pull request Sep 14, 2024
… finding packages on setuptools (apache#43325)

### Rationale for this change

Currently we include everything when building wheels, see:
```
$ pip install pyarrow
Collecting pyarrow
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (39.9 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39.9/39.9 MB 33.8 MB/s eta 0:00:00
Collecting numpy>=1.16.6
  Using cached numpy-2.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.3 MB)
Installing collected packages: numpy, pyarrow
Successfully installed numpy-2.0.0 pyarrow-17.0.0
(test-env)  $ ls test-env/lib/python3.10/site-packages/
benchmarks/                  distutils-precedence.pth     numpy-2.0.0.dist-info/       pip-22.0.2.dist-info/        pyarrow-17.0.0.dist-info/    setuptools-59.6.0.dist-info/ 
cmake_modules/               examples/                    numpy.libs/                  pkg_resources/               scripts/                     
_distutils_hack/             numpy/                       pip/                         pyarrow/                     setuptools/    
```

### What changes are included in this PR?

Use `include` as seen here: https://setuptools.pypa.io/en/latest/userguide/package_discovery.html#finding-simple-packages

### Are these changes tested?

Will check via the build wheel on CI

### Are there any user-facing changes?

No and yes :)
We will remove unnecessary files
* GitHub Issue: apache#43299

Lead-authored-by: Raúl Cumplido <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Joris Van den Bossche <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants