On pull request workflows for the GNU toolchain

Mon Sep 23 12:56:38 GMT 2024

On 19/09/2024 16:51, Joseph Myers via Gcc wrote:
> 1. Introduction
> 
> This message expands on my remarks at the Cauldron (especially the
> patch review and maintenance BoF, and the Sourceware infrastructure
> BoF) regarding desired features for a system providing pull request
> functionality (patch submission via creating branches that are then
> proposed using some kind of web interface or API, with a central
> database then tracking the status of each pull request and review
> comments thereon automatically), for use by the GNU toolchain (or one
> or more components thereof - there is no need for each component to
> make the same decision about moving to such software and workflow, and
> indeed we have no mechanism to make such decisions for the toolchain
> as a whole).
> 
> This does not advocate a particular choice of software for such
> functionality (though much of the discussion seemed to suggest Forgejo
> as the most likely starting point), or a particular choice of where to
> host it.  Hosting would of course need to meet appropriate security
> requirements, and to achieve a passing grade on the GNU Ethical
> Repository Criteria, and the software would need to be entirely free
> software.  Where relevant features are not already supported, it's
> important that the software is receptive to the addition of such
> features (including cases where part of the functionality is provided
> by software specific to the GNU toolchain or parts thereof - such as
> for the custom checks currently implemented with git hooks - and the
> underlying software provides appropriate interfaces to allow
> integration of such external pieces).  The list of features here may
> be a good basis for reviewing what particular forge software supports
> and whether other features can be added, directly or through use of
> appropriate APIs.
> 
> Forge software may provide other pieces such as bug tracking or wikis
> that we currently handle separately from git hosting.  In such cases,
> we should be able to disable those pieces and keep using the existing
> bug tracking and wiki software (while having the option to decide
> independently to migrate those if desired).
> 
> I consider the overall benefits of such a move to be having more
> structured data about all changes proposed for inclusion and their
> status (needing review, needing changes from the author, under
> discussion, needing merge from mainline, etc.), to help all people
> involved in the patch submission and review process to track such
> information and to find patches needing review as applicable, along
> with providing a more familiar workflow for many people that avoids
> many of the problems with email (which affect experienced contributors
> working around corporate email systems, not just new contributors).
> It would not of course by itself turn people with no interest in or
> understanding of systems software development into contributors (for
> example, people without knowledge of directories and hierarchical file
> storage, or people who only understand software development as web
> development).  Nor would it prevent the accumulation of large backlogs
> of unreviewed patches, as is evident from many large and active
> projects using PR workflows with large numbers of open PRs.
> 
> As Richard noted in his BoF, email sucks.  As I noted in reply, so do
> the web and web browsers when trying to deal with large amounts of
> patch review state (when one wishes to apply one's own view, not the
> forge's, of what is resolved and what needs attention).  As I also
> noted, in the Sourceware infrastructure BoF, tools such as patchwork
> and b4 are the right answer to the wrong question: trying to get
> structured data about patch submissions when working from the axiom
> that emails on a mailing list should be the primary source of truth
> for everything needing review, rather than starting from more
> structured data and generating emails as one form of output.
> 
> Moving to a pull request system is not expected to change policies
> regarding who can approve a change for inclusion, or the technical
> limits on who can cause a change to enter mainline (e.g. all people
> with commit access would be expected to be able to use a button in a
> web interface to cause to PR to be merged, though policy might limit
> when they should do so).  We can of course choose to change policies,
> either as part of adopting a PR system or later.
> 
> 
> 2. Key features
> 
> (a) Some forges have a design that emphasises the tree you get after a
> proposed contribution, but not the sequence of commits to get there.
> For the toolchain, we care about the clean, logical sequence of
> commits on the mainline branch.  (We also use linear history on
> mainline, but that's a secondary matter - though certainly we'd want
> any forge used to support such linear history so that property doesn't
> need to change as part of adopting pull request workflow.)  Having a
> clean sequence of commits has some implications for forge support:
> 
> * Support for reviewing the proposed commit message, not just the
>    diffs, is important (and it should be clear what commit message
>    would result from merging any pull request).
> 
> * Patch series and dependencies between patches are important.  In
>    such cases, fixes for issues from review should go into the
>    appropriate logical commit in the PR (with rebasing as necessary),
>    and it should be possible at all times to see what the sequence of
>    commits on mainline would look like.  (E.g. people use some
>    workarounds on GitHub to manage PR dependencies, but those result in
>    second and subsequent PRs in a series showing the full set of diffs
>    from a PR and those it depends on, rather than just the logical
>    diffs for one PR.)
> 
>    I consider patch series and dependencies to be separate but related
>    things: a patch series may not have strictly linear dependencies
>    (and it's often useful to merge the more straightforward patches
>    from a series while still discussing and revising others), while a
>    patch may depend on other patches that it's not logically part of
>    the same series as.  They are, however, closely related, and a
>    sufficient solution for dependencies might also be adequate for many
>    cases of series.
> 
>    Note that series can sometimes have hundreds of patches; any
>    solution for patch series and dependencies needs to scale that far.
> 
>    There is of course the common case of a single-patch submission,
>    where the patch is ready for inclusion after some number of fixes.
>    In those cases, it's probably convenient if it's not necessary to
>    rebase - provided it's clear that a particular PR would be
>    squash-merged, and also what the commit message would be for the
>    final fixed commit.
> 
> * Given the need for rebasing when working with patch series, it's
>    important to have good support for rebasing.  In particular, all
>    revisions of the changes for a PR that was rebased need to remain
>    permanently available (e.g. through appropriate documented refs to
>    fetch to get all revisions of all PRs).
> 
> (b) Various people have built efficient workflows for going through
> all patch submissions and comments (or all in a particular area), even
> when only reviewing a small proportion, and have concerns about
> efficiency of a web interface when working with many patches and
> comments.  It's important to have good API support to allow people to
> build tools supporting their own workflow like this without needing to
> use the browser interface (and following their own logic, not the
> forge's, for what changes are of interest).  Good API support might,
> for example, include a straightforward way to get all changes to PR
> and comment data and metadata since some particular point, as well as
> for actions such as reviewing / commenting / approving a PR.  Such API
> support might be similar to what's needed to ensure people can readily
> get and maintain a local replica of all the key data and its history
> for all PRs.
> 
> Replication like that is also important for reliably ensuring key data
> remains available even if the forge software subsequently ceases to be
> maintained.  Consider, for example, the apparent disappearance of the
> data from www-gnats.gnu.org (we occasionally come across references to
> old bug reports from that system in glibc development, but don't have
> any way to resolve those references).
> 
> Another use of such an API would be to allow maintaining local copies
> of all PR comments etc. in a plain text form that can be searched with
> grep.
> 
> (c) Given that a transition would be from using mailing lists, it's
> important to have good plain text outward email notifications for all
> new merge requests, comments thereon and other significant actions
> (changing metadata, approvals / merges, etc.) - through whatever
> combination of built-in support in a forge and local implementation
> using the API.  Such notifications would go to the existing mailing
> lists (note that the choice of mailing lists for a patch can depend on
> which parts of the tree it changes - for example, the choice between
> binutils and gdb-patches lists, or some GCC patches going to the
> fortran or libstdc++ lists) - as well as to individuals concerned with
> / subscribed to a given PR.  Some very routine changes, such as
> reports of clean CI results, might be omitted by default from
> notifications sent to the mailing lists.
> 
> "good" includes quoting appropriate diff hunks being comments on.
> It's OK to have an HTML multipart as well, but the quality of the
> plain text part matters.  Diffs themselves should be included in
> new-PR emails (and those for changes to a PR) unless very large.
> 
> I do not however suggest trying to take incoming email (unstructured)
> and turn it into more structured comments on particular parts of a PR
> in the database.
> 
> Similarly, commit emails should continue to go to the existing mailing
> lists for those.
> 
> (d) Rather than the forge owning the mainline branch in the sense of
> every commit having to go through an approved PR, at least initially
> it should be possible for people to push directly to mainline as well,
> so the transition doesn't have to be instantaneous (and in particular,
> if a change was posted before the transition and approved after it, it
> shouldn't be necessary to turn it into a PR).  Longer term, I think it
> would be a good idea to move to everything going through the PR system
> (with people able to self-approve and merge their own PRs immediately
> where they can commit without review at present), so there is better
> structured uniform tracking of all changes and associated CI
> information etc. - however, direct commits to branches other than
> mainline (and maybe release branches) should continue to be OK long
> term (although it should also be possible to make a PR to such a
> branch if desired).
> 
> Beyond putting everything through PRs, eventually I'd hope to have
> merges to mainline normally all go through a CI system that makes sure
> there are no regressions for at least one configuration before merging
> changes.
> 
> (e) All existing pre-commit checks from hooks should be kept in some
> form, to maintain existing invariants on both tree and commit contents
> (some hook checks make sure that commits don't have commit messages
> that would cause other automated processes to fall over later, for
> example).
> 
> (f) Existing cron jobs that read from or commit to repositories should
> use normal remote git access, not filesystem access to the repository,
> including using APIs to create and self-merge PRs when pushing to the
> repository is involved.
> 

Thanks for such a comprehensive write-up Joseph.  I need to read through 
this with a lot more care, but not until my Covid fug has cleared more 
(ugh!).

One thing we must do, however, is break requirements into a number of 
categories: must haves (red lines, we can't transition without this); 
should haves (important, but we can likely find acceptable 
work-arounds); and would like (this would make things really nice, but 
they won't really block a transition).

For the first two categories we also need to think hard about WHY this 
is the case - that is, we should be able to state clearly a 
justification for the requirement.

A quick read of your email suggests you've identified a number of these 
issues, but it's not quite as clear on how hard each requirements needs 
to be, or perhaps why you think it has to be considered.

I don't expect the list to be entirely static either: issues might arise 
as we do more detailed investigation, but this list looks like a good 
starting point.

R.

PS: Forgejo can support issues in bugzilla (and you can teach it a 
regexp rule to create links from our component/id format into a bugzilla 
link); it can also support external wikis.  So if (still quite a big if) 
we went the forgejo route, those issues can be handled.