Sat, 05 Mar 2011

Why rewriting git history? And why should commits be in imperative present tense?

There are tons of articles describing how you can rewrite history with git, but they do not answer "why should I do it?". A similar question is "what are the tradeoffs / how do I apply this in my distributed workflow?".
Also, git developers strongly encourage/command you to write commit message in imperative present tense, but do not say why. So, why?
I'll try to answer these to the best of my abilities, largely based on how I see things. I won't get too detailed (there are enough manuals and tutorials for the exact concepts and commands).

::Read from here

Why rewriting git history?

Just like source code git history gets mostly read and relatively infrequently written.
You read history when you want to see what has changed, when searching a bug, what the difference is between branches, and so on.
The argument of "I want the history to look like exactly how it really happened" is flawed, because very often your history is suboptimal (you commit a feature, and shortly afterwards you commit a fix for that feature, or a commit that contains separate logical changes/bugfixes)
This makes history more complicated to read then it should be, so for all the folks who will ever look back at your history (even if you think that will only be yourself) a clean history is more easy to "get", just like clean source code.
Also, part of the awesomeness of git is that juggling with features (needed for debugging, trying things out, ..) in your code is so flexible (see the git commit/branch model), but if you have logical changes spread over multiple commits, or one commit containing multiple logical changes, this gets painful very quickly.
Once you figure out history rewriting (and it's pretty easy to learn, really!) it only costs a little time to clean up your history, which will pay off in a much greater extent for every time you or somebody else wants to look at, or needs to work with it. (again, just like source code itself!)
This also means that you don't need to spend so much time thinking about your commit messages for commits that are merely fixups or small additions to other logical changes. Because those will be squashed into the other commits anyway. I usually commit frequently, but end up squashing many commits together, my commit log easily gets compressed by a factor two or more. The less history, the better. (just like source code!)
The commits you actually push (especially when pushing to a master branch) should of course be clean, accuractly described and with correct author information, for obvious reasons such as readability.

Note that there is some kind of paradox: you can only achieve "perfect history" if your commits are well-tested and every introduced feature has no bugs (has all bugfix commits squashed into it), but at the same time, you can only properly expose new code by making it public, and it only gets widely used and tested if it's in your main (master) branch.
This is one of the reasons why a workflow model such as one based on topic branches (aka feature branches) works: you see, git by default doesn't allow non-fast-forward pushes. Because you obviously don't want to break the history of other people following your stable (master branch) development. So once you push to master, it should usually be there for good.
As far as I can see, it is accepted in most projects (those run by folks with git expertise?) to push non-fast-forward to topic branches. The idea being a topic branch is a "work in progress" branch, it is made public so multiple people can review/work on it. Based on that work/review, its history will often get rewritten through a non-fast-forward push. And if you're following/working on such a branch, you should be clever enough to deal with changed history.
So, a topic branch allows you to make changes public, get feedback, clean up the history of the patchset you (and maybe others) are working on, and when satisfied, you can push to master.
There is still a chance you'll later need to push bugfixes to master, but this will happen much more infrequently, so while there is no perfect workflow model that creates perfect history (in master) combined with perfect usability (no need to handle non-fastforward pushes) I find this model brings a quite good compromise.

To paraphrase, I would say:
You should care about clean vcs history for the same reasons you should care about clean code.
Just like using git is good to progressively help reaching better software, so is git history rewriting good for progressively reaching a better git history. Version control on top of version control, if you will. A very crude form of version control but I don't think it needs to be any more advanced then this.

Why should I write my commits in imperative present tense ('do foo') rather then past tense ('did foo')?

Git developers command doing this (at least for the git project), but they did not document why's. Some commonly cited reasons:

  • Consistency. That's how it is in many projects (including git itself). Also git tools that generate commits (like git merge or git revert) do it.
  • It's usually shorter
  • You can name commits more consistently with titles of tickets in your issue/feature tracker (which don't use past tense, although sometimes future)
Another reason I came up with: people not only read history to know "what happened to this codebase", but also to answer questions like "what happens when I cherry-pick this commit", or "what kind of new things will happen to my code base because of these commits I may or may not merge in the future". (Note that these are questions about the past,current and future) This is more a subjective topic, but I feel that the best way to capture this time-independence of a commit is to write down as time-agnostic as possible, and something like 'do foo' (which could be 'do foo in the future', for instance) is more generic then something with a sense of time hardwired in it ("did foo" or "will do foo")

See also

Comments

Hey Dieter, when you don't have a link with explanation, you make one! I like that approach :)

The first part on rewriting history makes sense. If I understand you correctly, you see the rewriting process as a compromise between clear and succinct history on one hand and not messing up other people's stable histories on the other. In git, that compromise happens at the level of "features" (one step above factual commits).

I wonder, with a more advanced VCS that could handle any rewriting perfectly and intuitively, would you go further? You know, features are always part of a larger functionality, which is a part of a library, program, which you do to improve yourself/make money, which you do to be happy/die with a sense of accomplishment... the rewriting and squashing would never end :-)

About imperatives in commit messages: the first three reasons are a joke, but I actually like the last one. You view commits not as tags explaining the new state of code, but rather as transformations of code. A function that given the correct input, produces output. In this view imperatives make good sense (Python functions use imperative in exactly this sense). That's a tribute to git which allows to "re-use" commits (or features? the distinction is blurry in git with rewriting) easily via cherry-picking etc.
Radim,
> you see the rewriting process as a compromise
yes.

> In git, that compromise happens
Note that git is a very flexible tool allowing all kinds of workflows, I'm just describing what I think I see commonly and successfully being applied.
I think you misunderstood something with the "features". Maybe because I used the word in two contexts.  A feature branch is just a specific branch aimed to work on a specific feature, with the intention of possibly merging that work into a more mainline branch.  There's nothing really special about it. I've also used the word feature to describe a "logical unit of change".  One should try to keep 1 commit per logical unit of change, but the work to implement a feature can of course be several commits. Because to implement the feature you'll often need to write the final logical unit which can have a dependency tree of other logical units.  You'll usually rewrite history so that the dependencies come first, the dependents afterwards.


> I wonder, with a more advanced VCS
not sure what you mean.  I think git has a pretty good system that allows achieving a clean history in a reasonable way.  Surely one could go a lot further and implement a "VCS system to manage your VCS which manages your project", you could put a VCS on top of that, and so on.  But the git way is fairly easy to implement and to understand and gets the job done. So I don't see the need to make things more complicated.

> You view commits
Exactly.  You phrased my thoughts better then I could myself.  "transformation" is the right word, because when git merges a commit in another branch it can deal with applying it on top of changed source code (and if it's too different you need to resolve a merge conflict manually).  The end result is a new commit, with the same commit message and possibly a slightly different "implementation" (which lines are changed, to what, etc).  This emphasizes a commit is a transformation, not just a set of additions and deletions in specific places.
Another thought: the second part (commits as functions) seems to have implications for the first part (rewriting) too. I'm still wrapping my head around the git workflow, but a useful rule of thumb could be that a single commit/logical unit/feature should be a minimal unit able to stand on its own, be useful to someone enough to be cherry-picked alone.

That's not an exact definition because it presupposes and second-guesses some imaginary audience, but it might help me :]

Btw the way I understand re-using code transformations in git (cherry-picking&co), it can only be done on code with shared history, a common commit ancestor. It makes merging tractable but also limits the range of acceptable input a lot -- otherwise people could start making mini-libraries of generic commit functions...
Well, I wouldn't start overengineering what commits should be.  Surely there are some "codebase transformations" many projects can benefit from, like "stripping trailing whitespace".
but for things like these, there are plenty of shell commands (i.e. sed oneliners) to fix this.  So the command is shared on some blog/faq/tutorial, you apply it, and commit the result with an apropriate commit message.  I think I get what you mean: this procedure is a bit more cumbersome then it could be; especially when you realise if you want to apply a whitespace cleanup commit later then foreseen (i.e. new code has been added), you should do the same cleanup on the newcode as well while you are merging.  (this is one of the reasons why i always mention the used command in the commit message).  I wouldn't try to somehow integrate transformations like these in a VCS, because that would overcomplicate things.

And, usually those "generic transformations" are project/editor specific anyway.
Generic code transformations are very powerful and certainly not editor specific. In fact, they are AI complete, given an expressive enough underlying language. The more expressivity, the less you have to spell out the changes byte-by-byte and the more it can resolve ambiguity automatically, before spitting out a merge conflict. Take English as an example of a powerful language: an ideal commit would simply equal its commit message: "Strip trailing whitespace", or "Add an index to all corpora". Like you would instruct a human to transform the code.

Obviously utopia atm.

Now I don't understand git's change tracking mechanism enough to make judgements (it seems to be diff based?), but surely its cherry picking is not the absolute best we can do in commit reuse. Once someone finds a coding niche where commit reuse makes sense, is still tractable plus makes life easier, I'm sure it will be a hit. Just like design patterns were. Nobody likes googling forums even for sed one-liners.

<end of theoretical rant>

I came across a link which takes an opposite view to rewriting history, what do you think?
http://paul.stadig.name/2010/12/thou-shalt-not-lie-git-rebase-ammend.html
I think you can rewrite history just fine, but you should only do it if you also use a workflow in which it is OK, and people working on the project know what they are doing.
Most arguments he mentions (git log, "is this commit in") just don't apply if you have a good workflow, such as when you have a main branch (i.e. "master") in which you don't rewrite history (like I explained in my post)
I'm not sure why he claims cherry-picking (or other methods of applying a commit) breaks git blame.  afaik that's just not the case.

About git rerere, I've never needed to do that. He may have a point there but the use case seems rare to me.

About rebasing, if the commits you rebased suddenly cause compilations to break or if the commit messages are not appropriate anymore, then you just did it wrong.  period. Rebasing is about changing/cleaning history, not f*** things up.
Sometimes you will have a merge conflict during a rebase process, yes.  But you should just fix that properly. (and the end result is still worth it - but that's only my opinion).  Yes `git bisect` is awesome, but it should not suffer from a rewritten history (in fact, bisecting will be a bit faster and it should be more clear what's going on, with a clean history)

I don't get why he relates a cleaned up commit to breakage.  Actually, if you don't clean up your history, you will have commits that introduce a feature but a bug at the same time.  The bugfix commit for that usually comes a bit later in the history.  The whole point of history rewriting is to move such bugfixes into the first commit.  The whole point is to improve the quality of your commits. (while you can, i.e. in topic/dev branches, not in your master branch which you want to keep linear), so it actually becomes safer to cherry-pick commits.

There's an interesting link in the comments there:
http://lwn.net/Articles/328438/
From what i can tell, the model I described complies with Linus' recommendations, except that I think (and I believe some projects do it like this) that changing the history of a topic (development) branch is okay, even when it's public.  Linus basically said in this case you should send patch series by email, and send them again when you changed them, whereas we use a development topic branch for that.
I've asked Scott Chacon about this stuff when I met him at the Brussels Github drinkup and he basically told me:
  • it's up to the project maintainer to implement the (any) desired workflow, but the more complicated, the less easier it becomes for newcomers (quite obvious actually, so you could see a desired "fancyness of workflow" as a compromise between clean history and openness towards new[bie] contributors)
  • my described approach works, but a more common approach is where, when topic branches need to be rewritten, they get pushed under a different name (like "topic-2") (or sent by mail as a new patch series), so while this allows all the benefits of a rewritten topic branch, it makes things a bit more convenient for others who wrote patches on top of the outdated topic branches).  But the end goal stays the same: rewriting topic branches as needed (but pushing them under a different name), and when satisfied pushing to master (and never change master history, although you could if you wanted to.  But if other people follow your master, it's less convenient for them)


Name:


E-mail:


URL:


Comment:


What is the first name of the guy blogging here?


This comment form is pretty crude. Make sure mandatory fields are entered correctly.
Basic html tags (a,i,b, etc) are allowed, others are sanitized