Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create research-branch-policy.md #112

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft

Conversation

mairin
Copy link
Member

@mairin mairin commented Jul 4, 2024

Research engineers contributing to InstructLab have expressed the need for the ability to make rapid, experimental changes in a space where they can collaborate and move quickly without working through robust, production-level code review.

Software engineers contributing to InstructLab have expressed concern about allowing code into the main branch that does not meet the production-level quality standards enforced on main. We need to minimize the time from code merged into main to shipping in a release, and automating confidence in that shipment.

One of the more unique considerations InstructLab has as a project in contrast to other open source projects is the tight collaboration between research engineers and software engineers. There are cultural differences and different best practices in each discipline, and in order to enable this partnership we may need to take some new or less standard approaches to open source project management.

This devdoc proposes the creation of a policy in acknowledgement of the different ways of working inherent to the research engineer role in contrast to the production-level software engineer role. As a project, InstructLab has a goal of enabling upstream, open source collaboration between AI researchers, and this is one of the ways we can help achieve that goal.

Signed-off-by: Máirín Duffy <[email protected]>
@russellb russellb self-requested a review July 4, 2024 18:16
Copy link
Contributor

@leseb leseb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was there any alternative considered other than using a research branch?


# Research branch policy proposal

Regardless of the branching strategy of the repo, we propose allowing a `release` branch on each of the core `instructlab`. This branch would not be subject to the same PR review guidelines as the rest of the repository. The goal of this branch is to enable rapid, multi-person collaborative development to get the code to a state that can then go through a more stringent code review process that adheres to the project's standards:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Regardless of the branching strategy of the repo, we propose allowing a `release` branch on each of the core `instructlab`. This branch would not be subject to the same PR review guidelines as the rest of the repository. The goal of this branch is to enable rapid, multi-person collaborative development to get the code to a state that can then go through a more stringent code review process that adheres to the project's standards:
Regardless of the branching strategy of the repo, we propose allowing a `research` branch on each of the core `instructlab`. This branch would not be subject to the same PR review guidelines as the rest of the repository. The goal of this branch is to enable rapid, multi-person collaborative development to get the code to a state that can then go through a more stringent code review process that adheres to the project's standards:

Did you mean research?

- Research-oriented PRs must be made against the `research` branch.
- `research` branch maintainers will have the freedom to determine their PR merge policy. It will be documented in their branch's README or another easily-discoverable location of their choosing.
- (Need to address CI in this policy. Not sure how or what is fair.)
- When the `research` branch is in a state the maintainers are confident is good enough to attempt merging to main, they will coordinate with the repository core maintainers to start a review process.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me very nervous. I'm concerned that we'll end up with an enormous amount of low-quality code being proposed to main through PRs. How can we ensure this doesn't happen? Ideally, we will need small batches instead of trying to merge the entire research branch in main.

I think this is a very tedious process because we will likely not take every commit from research and push it into main. Some of that history will need to be rewritten to meet main quality standards. This sounds like a significant amount of effort for whoever will be responsible for pushing code from research into main.

I'd like to avoid a situation where research moves so quickly that merging into main cannot keep up without lowering the quality standards of main, thereby putting pressure on main to merge more rapidly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If i said s/research/devel on the name of the branch, would that change anything for you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, same problem.

## Background

Research engineers contributing to InstructLab have expressed the need for the ability to make rapid, experimental changes in a space where they can collaborate and move quickly without working through robust, production-level code review.
Software engineers contributing to InstructLab have expressed concern about allowing code into the `main` branch that does not meet the production-level quality standards enforced on main. We need to minimize the time from code merged into main to shipping in a release, and automating confidence in that shipment.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the current proposal "minimizes the time from code merged into main to shipping in a release"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't actually impact it since the proposal only impacts the process previous t the merge to main. Just trying to fairly reflect the various positions and values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If thats the case then probably should remove the sentence: "We need to minimize the time from code merged into main to shipping in a release, and automating confidence in that shipment.".

The software engineers have expressed concern that having a non-standard branch like this on the main upstream repository and not in a fork is against standard upstream open source practice:

- It creates social strata between contributors who have the permissions to create a branch in the main repo vs contributors who can only create a branch in a user fork. "Official branches in the repo, all development in forks."
- PRs for both the `main` (production-quality) and `research` (rapid and experimental) will live in the same repository. This might cause confusion for PR review workflows - it may not be clear to the PR reviewer which branch the PR is destined for clearly depending on the tooling being used.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a valid concern that can be reduced by applying labels to PRs once they are pushed to research.


Longer-term, however, the researchers have expressed concern that having this `research` repository in a user's fork makes it more difficult for would-be research-oriented contributors to discover. They would like to have a `research` branch (akin to the common practice of having a `devel` branch) on the main `instructlab` repositories such that their work is discoverable in addition to not negatively impacting the corresponding production-quality `main` branch.

### Concern about a specialized branch in main
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What concerns me most about this proposal is that while researchers work on the research branch, main will continue to evolve. This will lead to constant rebasing and potential breakages from main into research. We need to keep research in sync with main somehow, but the constant back and forth will cause a lot of pain.

Copy link
Member Author

@mairin mairin Jul 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now we have a lot of pain the researchers are taking on with continuous code reviews causing friction and delay. The status quo isn't ok. Any ideas on alternatives to address? Noting that researchers' role is not to produce production-ready code?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can pair up researchers with software engs so they can embrace software eng practice a little more? I don't think both practices (research and soft eng) are mutually exclusive. Additionally, pushing small changes instead of a large amount of code helps get faster reviews and merges. Rapid iterations do not mean low-quality code.

Copy link
Member Author

@mairin mairin Jul 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sdg pr #72 is an example of where this isn't happening as advertised

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That’s not a fair response, @mairin. That review was given priority attention and good faith responses trying to find the best solution quickly. By the end of the day, I took on that role of helping get it shape for main by filing two new PRs incorporating the important fix in one, and the desired feature in another incorporating review feedback and adding automated testing. This all happened within a day. What more do you want?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The feedback received regarding the cited PR is valid as well as what you are saying is valid. The problem is we have two different ways of working that we need to accomodate, and we need to find an aproach that accomodates all. Thus I filed this devdoc in an attempt to follow the process for working out such processes, so it could be discussed. I do think my response was fair. It was not made in bad faith. More, to point out, there are two proposed approaches - the research approach and the SWE approach. Neither taken as typically practiced is going to address the concerns of the other. So I am trying to counter the notion that following the standard SWE PR review process on small PRs is not going to address the concerns of the research engineers and I am citing a specific PR for more information. It was not meant to be unfair or a point of criticism. It was meant to point out the problem that needs to be solved.

@mairin
Copy link
Member Author

mairin commented Jul 4, 2024

Was there any alternative considered other than using a research branch?

Nope. Open to suggestions!

@russellb
Copy link
Member

russellb commented Jul 4, 2024

Was there any alternative considered other than using a research branch?

Nope. Open to suggestions!

I’d like to see it weighed against keeping all active development not ready for main in repo forks. We talked about this a bunch just today so surprised to see this answer?

to be fair I haven’t had a chance to read the docs itself yet given the holiday

@mairin
Copy link
Member Author

mairin commented Jul 4, 2024

Was there any alternative considered other than using a research branch?

Nope. Open to suggestions!

I’d like to see it weighed against keeping all active development not ready for main in repo forks. We talked about this a bunch just today so surprised to see this answer?

to be fair I haven’t had a chance to read the docs itself yet given the holiday

yes, that is the temporary solution marked in the document, which i thought we were aligned on


### Proposed long-term solution

Longer-term, however, the researchers have expressed concern that having this `research` repository in a user's fork makes it more difficult for would-be research-oriented contributors to discover. They would like to have a `research` branch (akin to the common practice of having a `devel` branch) on the main `instructlab` repositories such that their work is discoverable in addition to not negatively impacting the corresponding production-quality `main` branch.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please share one or more projects that use a devel branch besides their main branch? Thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://nvie.com/posts/a-successful-git-branching-model/

any project that uses the gitflow branching strategy?

Copy link
Contributor

@bjhargrave bjhargrave Jul 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we need an instructlab-research GitHub organization where research forks of the main instructlab repos are kept. This would be a central place for people to discover ongoing InstructLab research efforts and using forks there will facilitate making pull requests back to the main repos when ready to drive the research work back into production code. This would also allow for many forks of a given repo (e.g instructlab-research/sdg-idea1, instructlab-research/sdg-idea2), where multiple disjoint research efforts do not step on each other.

Having "research" branches in the main repos does not make them discoverable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bjhargrave I actually really like this idea! what are the main downsides?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only downside I can think of is having to manage another GH organization. But that is also a feature. We probably need to have different maintainers for production work than for research work. Using branches in a single repo makes things complex (probably requires CODEOWNERS which vary by branch which is a PITA). With a separate GH org, you can manage the teams and their application to research repos in a more straightforward and common manner perhaps with more relaxed rules for who can be a committer in research repos than you would want for production repos.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a separate branch (e.g. research) is doable and some organizations like Helm did it when developing the v3 release. They used a dev-v3 branch and developed it in parallel with the main branch which was v2. Once v3 was ready to be released, the main moved to dev-v2 and v3 became main. During development of v3, any fixes or features that were required were pulled in the dev-v3 branch. Labeling was used for PRs and issues because they are common for both branches. In this case, the research branch should be kept in sync with main so that they don't diverge.

This is a good alternative @bjhargrave from using research branch. Its more self contained and mirrors usual best GH practice of users forking a repo. In this case also, the fork should be kept in sync with thr main repo so that they don't diverge. There is however an overhead in maintaining 2 different organizations.

@hickeyma hickeyma self-requested a review July 8, 2024 09:18
Copy link
Member

@hickeyma hickeyma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mairin for pushing this proposal.

Before we go down the route of implementing a solution for working around this issue - can we discuss if we can bridge the gap of researches being able to make contributions to the main repositories regularly? Can we list the current blockages to see if we can find a way to make this work? My reasoning is that separate development branches and separate GH organizations are expensive to maintain and manage, and should be a last resort in my opinion.

## Background

Research engineers contributing to InstructLab have expressed the need for the ability to make rapid, experimental changes in a space where they can collaborate and move quickly without working through robust, production-level code review.
Software engineers contributing to InstructLab have expressed concern about allowing code into the `main` branch that does not meet the production-level quality standards enforced on main. We need to minimize the time from code merged into main to shipping in a release, and automating confidence in that shipment.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If thats the case then probably should remove the sentence: "We need to minimize the time from code merged into main to shipping in a release, and automating confidence in that shipment.".


### Proposed long-term solution

Longer-term, however, the researchers have expressed concern that having this `research` repository in a user's fork makes it more difficult for would-be research-oriented contributors to discover. They would like to have a `research` branch (akin to the common practice of having a `devel` branch) on the main `instructlab` repositories such that their work is discoverable in addition to not negatively impacting the corresponding production-quality `main` branch.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a separate branch (e.g. research) is doable and some organizations like Helm did it when developing the v3 release. They used a dev-v3 branch and developed it in parallel with the main branch which was v2. Once v3 was ready to be released, the main moved to dev-v2 and v3 became main. During development of v3, any fixes or features that were required were pulled in the dev-v3 branch. Labeling was used for PRs and issues because they are common for both branches. In this case, the research branch should be kept in sync with main so that they don't diverge.

This is a good alternative @bjhargrave from using research branch. Its more self contained and mirrors usual best GH practice of users forking a repo. In this case also, the fork should be kept in sync with thr main repo so that they don't diverge. There is however an overhead in maintaining 2 different organizations.

@nathan-weinberg
Copy link
Member

@mairin what is going on with this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants