Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create research-branch-policy.md #112

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 57 additions & 0 deletions docs/research-branch-policy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# Research Branch Policy for InstructLab

This document proposes a research branch policy for InstructLab repositories (sdg, eval, training) in the InstructLab organization.

## Background

Research engineers contributing to InstructLab have expressed the need for the ability to make rapid, experimental changes in a space where they can collaborate and move quickly without working through robust, production-level code review.

Check failure on line 7 in docs/research-branch-policy.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Trailing spaces

docs/research-branch-policy.md:7:240 MD009/no-trailing-spaces Trailing spaces [Expected: 0 or 2; Actual: 1] https://github.com/DavidAnson/markdownlint/blob/v0.34.0/doc/md009.md
Software engineers contributing to InstructLab have expressed concern about allowing code into the `main` branch that does not meet the production-level quality standards enforced on main. We need to minimize the time from code merged into main to shipping in a release, and automating confidence in that shipment.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the current proposal "minimizes the time from code merged into main to shipping in a release"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't actually impact it since the proposal only impacts the process previous t the merge to main. Just trying to fairly reflect the various positions and values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If thats the case then probably should remove the sentence: "We need to minimize the time from code merged into main to shipping in a release, and automating confidence in that shipment.".


One of the more unique considerations InstructLab has as a project in contrast to other open source projects is the tight collaboration between research engineers and software engineers. There are cultural differences and different best practices in each discipline, and in order to enable this partnership we may need to take some new or less standard approaches to open source project management.

### Current temporary solution

A temporary working solution to enable research engineers to make forward progress is creating a `research` branch in a fork of the InstructLab repository under a user account. This enables the researchers to move forward rapidly and experiment in a collaborative manner. This will also necessitate a more involved merge process once the branch is ready to merge into the `main` branch of the `instructlab` repo under the InstructLab GitHub organization. That is a known and accepted tradeoff of not doing continuous production-level code review.

### Proposed long-term solution

Longer-term, however, the researchers have expressed concern that having this `research` repository in a user's fork makes it more difficult for would-be research-oriented contributors to discover. They would like to have a `research` branch (akin to the common practice of having a `devel` branch) on the main `instructlab` repositories such that their work is discoverable in addition to not negatively impacting the corresponding production-quality `main` branch.

Check failure on line 18 in docs/research-branch-policy.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Trailing spaces

docs/research-branch-policy.md:18:467 MD009/no-trailing-spaces Trailing spaces [Expected: 0 or 2; Actual: 1] https://github.com/DavidAnson/markdownlint/blob/v0.34.0/doc/md009.md
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please share one or more projects that use a devel branch besides their main branch? Thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://nvie.com/posts/a-successful-git-branching-model/

any project that uses the gitflow branching strategy?

Copy link
Contributor

@bjhargrave bjhargrave Jul 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we need an instructlab-research GitHub organization where research forks of the main instructlab repos are kept. This would be a central place for people to discover ongoing InstructLab research efforts and using forks there will facilitate making pull requests back to the main repos when ready to drive the research work back into production code. This would also allow for many forks of a given repo (e.g instructlab-research/sdg-idea1, instructlab-research/sdg-idea2), where multiple disjoint research efforts do not step on each other.

Having "research" branches in the main repos does not make them discoverable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bjhargrave I actually really like this idea! what are the main downsides?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only downside I can think of is having to manage another GH organization. But that is also a feature. We probably need to have different maintainers for production work than for research work. Using branches in a single repo makes things complex (probably requires CODEOWNERS which vary by branch which is a PITA). With a separate GH org, you can manage the teams and their application to research repos in a more straightforward and common manner perhaps with more relaxed rules for who can be a committer in research repos than you would want for production repos.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a separate branch (e.g. research) is doable and some organizations like Helm did it when developing the v3 release. They used a dev-v3 branch and developed it in parallel with the main branch which was v2. Once v3 was ready to be released, the main moved to dev-v2 and v3 became main. During development of v3, any fixes or features that were required were pulled in the dev-v3 branch. Labeling was used for PRs and issues because they are common for both branches. In this case, the research branch should be kept in sync with main so that they don't diverge.

This is a good alternative @bjhargrave from using research branch. Its more self contained and mirrors usual best GH practice of users forking a repo. In this case also, the fork should be kept in sync with thr main repo so that they don't diverge. There is however an overhead in maintaining 2 different organizations.


### Concern about a specialized branch in main
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What concerns me most about this proposal is that while researchers work on the research branch, main will continue to evolve. This will lead to constant rebasing and potential breakages from main into research. We need to keep research in sync with main somehow, but the constant back and forth will cause a lot of pain.

Copy link
Member Author

@mairin mairin Jul 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now we have a lot of pain the researchers are taking on with continuous code reviews causing friction and delay. The status quo isn't ok. Any ideas on alternatives to address? Noting that researchers' role is not to produce production-ready code?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can pair up researchers with software engs so they can embrace software eng practice a little more? I don't think both practices (research and soft eng) are mutually exclusive. Additionally, pushing small changes instead of a large amount of code helps get faster reviews and merges. Rapid iterations do not mean low-quality code.

Copy link
Member Author

@mairin mairin Jul 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sdg pr #72 is an example of where this isn't happening as advertised

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That’s not a fair response, @mairin. That review was given priority attention and good faith responses trying to find the best solution quickly. By the end of the day, I took on that role of helping get it shape for main by filing two new PRs incorporating the important fix in one, and the desired feature in another incorporating review feedback and adding automated testing. This all happened within a day. What more do you want?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The feedback received regarding the cited PR is valid as well as what you are saying is valid. The problem is we have two different ways of working that we need to accomodate, and we need to find an aproach that accomodates all. Thus I filed this devdoc in an attempt to follow the process for working out such processes, so it could be discussed. I do think my response was fair. It was not made in bad faith. More, to point out, there are two proposed approaches - the research approach and the SWE approach. Neither taken as typically practiced is going to address the concerns of the other. So I am trying to counter the notion that following the standard SWE PR review process on small PRs is not going to address the concerns of the research engineers and I am citing a specific PR for more information. It was not meant to be unfair or a point of criticism. It was meant to point out the problem that needs to be solved.


The software engineers have expressed concern that having a non-standard branch like this on the main upstream repository and not in a fork is against standard upstream open source practice:

- It creates social strata between contributors who have the permissions to create a branch in the main repo vs contributors who can only create a branch in a user fork. "Official branches in the repo, all development in forks."
- PRs for both the `main` (production-quality) and `research` (rapid and experimental) will live in the same repository. This might cause confusion for PR review workflows - it may not be clear to the PR reviewer which branch the PR is destined for clearly depending on the tooling being used.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a valid concern that can be reduced by applying labels to PRs once they are pushed to research.


# Git branch strategies

Check failure on line 27 in docs/research-branch-policy.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Multiple top-level headings in the same document

docs/research-branch-policy.md:27 MD025/single-title/single-h1 Multiple top-level headings in the same document [Context: "# Git branch strategies"] https://github.com/DavidAnson/markdownlint/blob/v0.34.0/doc/md025.md

There are different strategies around git branching:

- [Release-based branching](https://medium.com/@pooyagohardani/git-release-candidate-branches-strategy-balancing-quality-and-agility-in-software-development-21cc00842e03)
- [Feature branching](https://martinfowler.com/articles/branching-patterns.html)
- [Git flow](https://nvie.com/posts/a-successful-git-branching-model/)
- [GitHub flow](https://docs.github.com/en/get-started/using-github/github-flow)
- [Task branching](https://unity.com/how-to/devops-task-branch-workflow)
- [Trunk-based development](https://trunkbaseddevelopment.com/)
- Probably others!

(More information on some of these including a comparison chart is available in [Git Branching Strategies for DevOps: Best Practices for Collaboration](https://dev.to/angelotheman/git-branching-strategies-for-devops-best-practices-for-collaboration-35l8) by Angel Oduro-Temeng Twumasi on dev.to.)

The CLI repository [instructlab/instructlab](https://github.com/instructlab/instructlab) has a [Release Strategy devdoc](https://github.com/instructlab/instructlab/blob/main/docs/release-strategy.md) that dictates a release-based branching strategy. We do not have defined branching strategies for the other repos.

We should probably decide on explicit branching policies for each of the core repositories in the `instructlab.` This will set clear expectations for our ways of working as a community project.

# Research branch policy proposal

Check failure on line 45 in docs/research-branch-policy.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Multiple top-level headings in the same document

docs/research-branch-policy.md:45 MD025/single-title/single-h1 Multiple top-level headings in the same document [Context: "# Research branch policy propo..."] https://github.com/DavidAnson/markdownlint/blob/v0.34.0/doc/md025.md

Regardless of the branching strategy of the repo, we propose allowing a `release` branch on each of the core `instructlab`. This branch would not be subject to the same PR review guidelines as the rest of the repository. The goal of this branch is to enable rapid, multi-person collaborative development to get the code to a state that can then go through a more stringent code review process that adheres to the project's standards:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Regardless of the branching strategy of the repo, we propose allowing a `release` branch on each of the core `instructlab`. This branch would not be subject to the same PR review guidelines as the rest of the repository. The goal of this branch is to enable rapid, multi-person collaborative development to get the code to a state that can then go through a more stringent code review process that adheres to the project's standards:
Regardless of the branching strategy of the repo, we propose allowing a `research` branch on each of the core `instructlab`. This branch would not be subject to the same PR review guidelines as the rest of the repository. The goal of this branch is to enable rapid, multi-person collaborative development to get the code to a state that can then go through a more stringent code review process that adheres to the project's standards:

Did you mean research?


- The `research` branch in each repository that has one will have named maintainers. Maintainer membership will be based on [the InstructLab maintainer policy already in place](https://github.com/instructlab/community/blob/main/GOVERNANCE.md#project-maintainers-overview).
- Research-oriented PRs must be made against the `research` branch.
- `research` branch maintainers will have the freedom to determine their PR merge policy. It will be documented in their branch's README or another easily-discoverable location of their choosing.
- (Need to address CI in this policy. Not sure how or what is fair.)
- When the `research` branch is in a state the maintainers are confident is good enough to attempt merging to main, they will coordinate with the repository core maintainers to start a review process.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me very nervous. I'm concerned that we'll end up with an enormous amount of low-quality code being proposed to main through PRs. How can we ensure this doesn't happen? Ideally, we will need small batches instead of trying to merge the entire research branch in main.

I think this is a very tedious process because we will likely not take every commit from research and push it into main. Some of that history will need to be rewritten to meet main quality standards. This sounds like a significant amount of effort for whoever will be responsible for pushing code from research into main.

I'd like to avoid a situation where research moves so quickly that merging into main cannot keep up without lowering the quality standards of main, thereby putting pressure on main to merge more rapidly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If i said s/research/devel on the name of the branch, would that change anything for you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, same problem.

- The responsibility for completing the merge process by responding to review feedback is on the software engineer maintainers of the `research` branch and is not the responsibility of the research engineer maintainers. The research engineer maintainers will assist the software engineers responsible for the merge, however.
- The merge code review for the `research` branch will adhere to the review requirements of the repository and follow the standard PR process already documented.

The creation of this policy is in acknowledgement of the different ways of working inherent to the research engineer role in contrast to the production-level software engineer role. As a project, InstructLab has a goal of enabling upstream, open source collaboration between AI researchers, and this is one of the ways we can help achieve that goal.
Loading