-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf(cargo-package): match certain path prefix with pathspec #14962
Conversation
3aabafb
to
4ccd9ae
Compare
To make error message more deterministic
`check_repo_state` checks the entire git repo status. This is usually fine if you have only a few packages in a workspace. For huge monorepos, it may hit performance issues. For example, on awslabs/aws-sdk-rust@2cbd34d the workspace has roughly 434 members to publish. `git ls-files` reported us 204379 files in this Git repository. That means git may need to check status of all files 434 times. That would be `204379 * 434 = 88,700,486` checks! Moreover, the current algorithm is finding the intersection of `PathSource::list_files` and `git status`. It is an `O(n^2)` check. Let's assume files are evenly distributed into each package, so roughly 470 files per package. If we're unlucky to have some dirty files, say 100 files. We will have to do `470 * 100 = 47,000` times of path comparisons. Even worse, because we `git status` everything in the repo, we'll have to it for all members, even when those dirty files are not part of the current package in question. So it becomes `470 * 100 * 434 = 20,398,000`! Instead of comparing with the status of the entire repository, this patch use the magic pathspec[1] to tell git only reports paths that match a certain path prefix. This wouldn't help the `O(n^2)` algorithm, but at least it won't check dirty files outside the current package. [1]: https://git-scm.com/docs/gitglossary#Documentation/gitglossary.txt-aiddefpathspecapathspec
4ccd9ae
to
2bc27e1
Compare
From #14955 (comment)
|
@@ -816,7 +816,8 @@ fn check_repo_state( | |||
// - ignored (in case the user has an `include` directive that | |||
// conflicts with .gitignore). | |||
let mut dirty_files = Vec::new(); | |||
collect_statuses(repo, &mut dirty_files)?; | |||
let pathspec = relative_pathspec(repo, pkg.root()); | |||
collect_statuses(repo, &[pathspec.as_str()], &mut dirty_files)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the readme is ../README.md
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this PR changed anything for relative paths. It is a bug itself.
They will be filtered out later in
cargo/src/cargo/ops/cargo_package.rs
Line 829 in 59b2ddd
.filter(|src_file| dirty_files.iter().any(|path| src_file.starts_with(path))) |
That said, I don't think either dirty symlinks or relative paths were correctly handled on master.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me create a test and verify it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is saying if a src_file
should start with a dirty_file
path, else filter it out. I'm not seeing how thats relevant to a ../README.md
. It would if Cargo.toml
is in the repo root but it is most likely to be in a sub-directory within the repo, so that README.md
could be among dirty_files
.
Or is src_files
not including these ../README.md
files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either way, isn't this cementing the bug in further, making it harder for us to fix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src_files
includes symlink files as ordinary files. By the time we get src_files
, file metadata has been lost.
I don't think this change worsen the situation, unless we consider rewriting list_files
to do more as a fix to #14955.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about the non-symlink case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
### What does this PR try to resolve? Dirty file paths in the original message were stripped relative to package root. User is not able to know the full path to find dirty files. This PR makes it relative to git workdir. ### How should we test and review this PR? This was found during #14962
What does this PR try to resolve?
Improve #14955
check_repo_state
checks the entire git repo status.This is usually fine if you have only a few packages in a workspace.
For huge monorepos, it may hit performance issues.
For example, on awslabs/aws-sdk-rust@2cbd34d the workspace has roughly 434 members to publish.
git ls-files
reported us 204379 files in this Git repository. That means git may need to check status of all files 434 times.That would be
204379 * 434 = 88,700,486
checks!Moreover, the current algorithm is finding the intersection of
PathSource::list_files
andgit status
. It is anO(n^2)
check. Let's assume files are evenly distributed into each package, so roughly 470 files per package. If we're unlucky to have some dirty files, say 100 files. We will have to do470 * 100 = 47,000
times of path comparisons.Even worse, because we
git status
everything in the repo, we'll have to it for all members,even when those dirty files are not part of the current package in question.
So it becomes
470 * 100 * 434 = 20,398,000
!Solution
Instead of comparing with the status of the entire repository,
this patch use the magic pathspec1 to tell git only reports
paths that match a certain path prefix.
This wouldn't help the
O(n^2)
algorithm,but at least it won't check dirty files outside the current package.
How should we test and review this PR?
Run this command against awslabs/aws-sdk-rust@2cbd34d,
and see if it is getting better.
Additional information
There are some other alternatives, like making
PathSource::list_files
additionally reports dirty files. I don't want to touch that part right now because it is already overly complicated.Some other approaches like
--no-vcs
to skip vcs at allO(n^2)
algorithm.This approach should be the most straighforward one.