Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add copybara file and the associated workflow #100

Closed
wants to merge 18 commits into from

Conversation

JehandadKhan
Copy link
Collaborator

@JehandadKhan JehandadKhan commented Oct 9, 2024

The cron for the workflow fires everyday at 0 12 * * * to fetch changes from upstream and create a PR in this repo. Which would automatically fire off the CI tests and can be merged manually.

@mrodden
Copy link
Collaborator

mrodden commented Oct 14, 2024

Lets run this at 6 or 7 am or after works hours in order to not have folks rebase in the middle of the day on the incoming changes.

Copy link
Collaborator

@mrodden mrodden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So after looking over how this is working and also examining the PR / commit that copybara is creating, I am not sure why we would want this over just doing a git pull / push to sync the history from upstream main to rocm/jax:main.

It appears that copybara is creating a new commit from all the file changes from the "origin" branch, which is suboptimal for a lot of reasons, but mainly I don't think its very useful if it is going to be creating a new "history" when it copies code from upstream commits into new commits on our main branch. This would pretty much always result in making it much more difficult for Git rebase to calculate the correct set of commits to move from our internal branches to upstream, as the new copybara commits will show up as "new" from the merge-base. This makes it much harder to port our internal changes to upstream when they are ready, as well as calculating differences from upstream to downstream easily.

Did you try just doing a straight git pull/push using an external script? You already have the credential management stuff for doing these kinds of operations on repos external to the one with Github Actions, so it would be easy to drop a new workflow in a different repo to have it do the sync if you still want to use GH Actions.

workflow_dispatch:
pull_request:
schedule:
- cron: '0 12 * * *' # Runs every day
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be best to run at 6 or 7 am.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that makes sense.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason? I have seen them upstreaming stuff anytime of the day

@JehandadKhan
Copy link
Collaborator Author

From your comment above, I see two primary issues that need addressing:

  • Copy Bara no longer maintaining history (as opposed to git pull/push)
  • New commits by Copybara making it harder to rebase to facilitate moving downstream ( rocm/jax) code to upstream
    The fundamental assumption here is that we would always be able to maintain clean history with upstream. Which is not going to be the case since we would always have some difference, for example:
  • CI specific files that have business upstream
  • The need to rename CI runners ( You can already see this in the copybara PR)
  • Workarounds fixes, for our local CI/ QA that have not been merged into upstream.
  • Patch Jax locally to point it to our preferred XLA commit. Likely, again due to differences between timelines for when ROCm specific changes make their way to upstream main branches.

Essentially, there will always be some diff with upstream that we need to maintain and manage for one reason or another. Which means that we need to copy code from upstream and apply our code transformations on top of those changes when necessary.

Currently, the approach to handle the above issues is to manage branches with diffs from upstream are maintained until upstream merges those changes, in which case they flow down automatically. However, this is error prone and unreliable unless there is some automation in place.

Finally, one might argue that why go through a new tool, when all we need is some python wrapped around a git pull and we may still be able to accomplish the goals above. However, I would argue that if we set out on that path, soon enough we would be rewriting copybara.

The history of our downstream repo is no longer relevant since we always directly merge stuff into upstream. This assumes that you use copybara again to automatically create a PR in upstream, since that is based on a diff, github's history is immaterial. Also, this is how upstream manages their repos as well, with changes flowing from their internal repo to the external one. For us the repos are reversed with changes flowing from upstream to downstream.

Please let me know your thoughts.

@charleshofer
Copy link
Collaborator

Finally, one might argue that why go through a new tool, when all we need is some python wrapped around a git pull and we may still be able to accomplish the goals above. However, I would argue that if we set out on that path, soon enough we would be rewriting copybara.

What do you envision us adding to that git pull that would be basically re-creating Copybara? Are there some specific features you think we'll need in the future? Otherwise, I'm a little skeptical that Copybara is necessary.

Given that we merge this and don't share a commit history with upstream, what're the git commands we'd use to open a PR upstream after we've got a working PR on our local? Would we pull upstream main into our local repo, create a new feature branch (or maybe just use main in our own fork of upstream), do a git cherry-pick onto a branch that shares the same history as upstream main, and then open a PR against that branch? I think not sharing the history with upstream isn't such a huge minus if we have a process that uses something like cherry-pick where the history doesn't matter.

Copy link

@Ruturaj4 Ruturaj4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM other than that some of @charleshofer questions are definitely valid.

workflow_dispatch:
pull_request:
schedule:
- cron: '0 12 * * *' # Runs every day

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason? I have seen them upstreaming stuff anytime of the day

@JehandadKhan
Copy link
Collaborator Author

What do you envision us adding to that git pull that would be basically re-creating Copybara? Are there some specific features you think we'll need in the future? Otherwise, I'm a little skeptical that Copybara is necessary.

Features such as

  • Fixing labels used by upstream CI to match what we have internally
  • Overriding the XLA commit and repo
  • Changes that have not yet been merged upstream but break our CI
  • Changes needed for new asics that we cannot merge to upstream yet

Given that we merge this and don't share a commit history with upstream, what're the git commands we'd use to open a PR upstream after we've got a working PR on our local? Would we pull upstream main into our local repo, create a new feature branch (or maybe just use main in our own fork of upstream), do a git cherry-pick onto a branch that shares the same history as upstream main, and then open a PR against that branch? I think not sharing the history with upstream isn't such a huge minus if we have a process that uses something like cherry-pick where the history doesn't matter.

I am hoping we should be able to use copybara to automate that process, such that when we approve a PR on our fork, coybara can cherry pick the diff to upstream and we would not have to manually do those steps. Creating a new PR automatically using copybara is trivial, however, rolling our own tool to do the same would need some work and maintenance on our part.

@charleshofer charleshofer self-requested a review October 18, 2024 16:25
Copy link
Collaborator

@charleshofer charleshofer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talked with Jehandad this morning at the standup. This all makes sense to me. I know on Thursday that Matt said that they had this pretty much figured out at IBM and that there's a better way than Copybara, so I'd still be curious to know how they were doing it.

@charleshofer
Copy link
Collaborator

Notes from my discussion with Matt on CI:

At IBM, there were several projects that were open source, but where IBM maintained a similar downstream repo to us. IBM would put some changes onto upstream releases, have long-running features that were specific to IBM code, and would push fixes to upstream. This was all doable with regular git commands and in a way that would make git handle most conflicts on its own. Part of what made these conflicts work well was preserving the history from upstream main in the downstream repo. This made it easy to see how far apart from upstream they were and let git handle conflict resolution.

JAX has some weird dependencies that are tricky to manage. These will pretty much always be ahead of upstream:

  • ROCm XLA - tracked by branch name?
  • PJRT - not sure how this gets specified
  • ROCm itself (not super problematic for code/git because this gets set a build time)

Features nice to have:

  • Preserve upstream history in our repo
  • Don't want random changes in upstream to break our CI all the time
  • Use GitHub Actions, not Jenkins

Solutions for the JAX team:

  • (this PR, basically) Use Copybara to pull upstream main to our main and patch with patch files, open up a PR against our main, close it, open the same PR against upstream
  • Have a main branch and a "develop" (probably don't call it this) branch. Our main mirrors upstream main. All of our patches belong on develop. Nightly update: pull to our main, rebase our develop to main.
  • Same as above, but nightly update is more complex:
    • Merge our develop with upstream main locally and make sure there are no conflicts and that tests pass. Notify us if they don't. If they pass, merge (or rebase) our develop and upstream main.

Going forward:

  • Think about the use-cases for different tasks (as a developer, I want to...)
    • Release
    • Nightly CI runs
    • PRs to downstream
    • PRs to upstream
    • Pull upstream code to our downstream repo
  • Add some ideas to the design document that compares the Copybara solution with some other ideas (probably pure git commands). Do some analysis and work through use-cases.
  • Should look at how other people handle this problem. It's pretty common for a team to manage a fork of a repo and contribute back to upstream a lot, so maybe there's a cookie-cutter way of doing it that we're missing.

@JehandadKhan
Copy link
Collaborator Author

We decided today to not use Copybara and to follow an alternate plan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants