#+TITLE: Packaging software using Git #+AUTHOR: Manoj Srivastava #+EMAIL: srivasta@acm.org #+DATE: <2008-04-01 Tue> #+LANGUAGE: en #+TEXT: A comparison of schemes people use for packaging #+OPTIONS: H:3 num:t toc:t \n:nil @:t ::t |:t ^:t -:t f:t *:t TeX:t LaTeX:nil skip:t d:nil tags:not-in-toc * Introduction "Are you rebasing or merging?" seems to be the 64 dollar question over in [[http://vcs-pkg.org/][vcs-pkg]] discussions. Various people have offered their preferences, and indeed, several case studies of work flows have been presented, what is lacking is an _analysis_ of the work-flow; an exploration of which methodology has advantages, and whether there are scenarios in which the _other_ work flow would have been better. Oh, what are all these work flows about, you ask? Most of the issues with packaging software for distributions have a few things in common: there is a mainline or upstream source of development. There are zero or more independent lines of development or ongoing bug fixes that are to be managed. And then there is the tree from which the distribution package is to be built. All this talk about packaging software work flows is how to best manage asynchronous development upstream and in the independent lines of development, and how to create a coherent, debuggable, integrated tree from which to build the distributions package. The rebasing question goes to the heart of how to handle the independent lines of development using git; since these lines of development are based off the main line of development, and must be periodically synchronized. What follows here is my a first look at a couple of important factors that will have bearing on that question. This is heavily geared towards git (nothing else does rebases so easily, I think), but some of the conepts should be generic. As a teaser, there is a third answer: neither. You can just add an independent line of development, and just let it sit: don't rebase, and don't merge; and in some circumstances that is a winning strategy. * Interested constituencies Take the rebasing issue. Whether or not you should rebase or not depends on a number of factors -- firstly, there is the question of which of the stake holder constituencies are most important to you. There are at least three constituencies involved here: 1. Upstream developers. These are the people who are the consumers of the independent lines of development. Usually people go out of their way to feed patches and code in the preferred form acceptable to upstream; and it is in our interests to do so: the more code pushed upstream, the less work there is to do ourselves. People upstream want the patch submissions to be clean, not have extraneous crud that has to be removed, and would like a nice, clean, uncomplicated history. They care about descriptions of each patch in a series, but are not interested history, really (they do not care that it took 15 typographical and logic fixes to arrive at this juncture.) Upstreams want topic branches to be rebased to their latest version, so that the patches apply cleanly. 2. Downstream topic developers. These are people who are basing their work on your topic branches, and they develop code, and feed their changes back to you. If you rewrite history and rebase your topic branches, downstream developers will find it very hard to merge or cherry pick commits from you back into their development tree. Indeed, if you publish your topic branches, rebasing is not an option. 3. The distribution and its users. This is a very important constituency; and most of us packaging software for distributions are doing all this work precisely for this constituency. Throufh the integration branch, this is a downstream branch -- though one interesting case study uses a single rebased patch branch and throw away integration branches, using a patch series in the integration branch. Since you control the integration branch, this is not strictly downstream, Now, depending on where you are on the totem pole, some of htese constituencies are more important than others. At the very top, you don't have an upstream. For example, Git development. They just use a mainline, and a rebased pending updates branch. All their downstream is cautioned never to base work on the pending updates series. So, rebased lines of development work for them, since they are upstream, and that is the most important constituency in their work-flow. If you are too low on the totem pole to have any downstream developers, and you can live with throw away integration branches, rebased lines of development work as well. * Patch flow characteristics on the topic branches Are the stake holders the only factor in you decision? Not by a long shot. Take, for instance, the issue of how active your independent lines of development are, and how big the patch series is, and whether you want to do a functional test for each topic branch. - Small, inactive topic branch If you have a single, small patch ( a simple bug fix, for example), you can just create a branch, apply the bug fix to the integration branch, and just let the branch be. As upstream development happens, it gets merged into the integration branch where you have already merged your bug fix branch -- if here is no conflict on the integration branch, don't do anything. If there is a conflict, resolve he conflict, and merge the mainline into the topic branch (aka bug fix branch), resolve the conflict the same way you did in the integration branch, and let it be. This is delayed integration into topic branches. - Larger patch-sets For a large patch on a topic branch, the chances are that you'll have to merge almost every upstream version, since the chances are higher of some changes overlapping; might as well get into a pattern of merging into the topic _and_ the integration branch. Also, if there are downstream developers (more likely with large features like this), you need to keep the topic branch up to date. - Active development If your topic branch is being actively developed, you need to constantly merge into your integration branch. Whenever you need to resolve a conflict on the integration branch, you need to merge the mainline into your topic branch and resolve the conflict there as well. Again, delayed integration into the topic branches is unlikely, and impossible if you have downstream developers. * Other factors - Testing each topic branch independently If you or a downstream developer need to compile and test each topic branch independently (instead of all together, in the integration branch), you must merge each upstream change into a feature branch. Rebasing or not depends on whether or not the branch is published. - Public or private topic branches If your tpoic branches are public, then rebasing is out -- unless you have strong warnings in place about people not basing their work on this branch. Rewrting history (which is what a rebase is) out from under people basing their work on yours is rude, and may cause a lot of work to stitch things back together. - team or collaborative development If you are packaging software as a part of a team, team members need to have acces to each others branches (to pick up if you are busy, to see pending changes, to avoid duplication of work). This means public Topic branches, and thus no rebasing. * Conclusion So, based on which stake-holders you have, and which ones are most important to you, and also how big your topic branch differences are, and how active the topic development is, decides whether or not you rebase or merge, and whether you do delayed integration into topic branches or not. Now, my personal preferences. I am low down on the totem pole, mostly, but I like to publish my topic branches. So I will not rebase my public topic branches. I will have persistent integration branches, since derived distribution folks are likely to need that. I will also always merge new upstream into my topic branches, just in case someone is basing their work off my public topic branch. But since I have to cater to upstream as well, I plan on having a private, rebasable submission branch for each topic; and cherry pick original commits from the topic branch on to that. The submission branches will be rebased before submitting to the latest upstream version before submission, or more often if I feel like doing so. The topic branches will be named "topic/foo". submission branches will be named "submission/foo", and there will be a "tmp/bar" name space for ephemeral branches. This will make it easier to script things like new upstream versions.