Hacker News new | past | comments | ask | show | jobs | submit login
SQLAlchemy Migrated from Mercurial to Git (sqlalchemy.org)
114 points by dbader on May 26, 2013 | hide | past | favorite | 58 comments



Every project has the right to chose the VCS that they feel is right for them. However I find the reasons they give for the switch are pretty weak except for the last one (github is more popular).

1. Repository size: Such size differences are uncommon, and mostly likely due to the fact that the git repo was a brand new repo created from scratch by converting the original hg repo. If they repacked the mercurial repo it would probably be greatly reduced in size. They could also do:

    hg convert --config "extensions.convert=" --branchsort sqlalchemy sqlalchemy-smaller
Which results in a reduction of the repository size from 45 MB in the original to 24 MB in the converted repo. This is not as good as git but quite close IMHO.

2. This is not correct: Mercurial bookmarks _used_ to be an extension (a long time ago!) but they have been built into mercurial for a long time now. They are basically equivalent to git branches. So nowadays this is just a process issue. If using named branches is a problem just tell your devs to not use them and use bookmarks instead!

3. Mercurial's rebase and mq extensions are part of core mercurial. They are packaged and distributed with mercurial, their test suites are part of the mercurial test suite and they are supported and developed by the core mercurial developers. The only difference between a core command and one of the official extensions is that these must be manually enabled to be user (by adding a line to the mercurial config file). This is a good thing! This makes it harder to mess up your repo if you do not know what you are doing. If you are not knowledgeable enough to find and edit the mercurial config file you should definitely not be using rebase or collapsing your changesets.

The last reason on the other hand is one that I can understand and which I think has merit. Is it worth the hassle of changing their DVCS system? Perhaps. However it bothers me when legitimate _social_ reasons to switch from mercurial to git are supported by not quite as clear cut technical reasons such as the ones put forward in this blog post.


> If using named branches is a problem just tell your devs to not use them and use bookmarks instead!

but they do, because all the devs know git and are only using mercurial because they want to send you a pull request, and now their feature branch is forever.

Mecurial is unforgiving of mistakes - once in the repo, they are forever.

As for MQ, I've really tried to understand it, but it's just profoundly awkward IMHO. Mercurial's main benefit is that it's much easier to use and understand than git, so if I'm going to need to fire up the extra brainpower, I might as well learn git where these advanced patterns are extremely commonplace, and are part of the core functionality, not as a series of optional extensions that have been bolted on over the years. I've almost never seen anyone using MQ or rebase with mercurial (or even bookmarks for that matter).

I've tried carefully in my post to express that my points are all based on an overlap of social with technical issues. I am aware that technically, hg has answers for all these things. But these answers don't pan out in the real social world, for better or for worse. If git didn't exist, I'm sure there would be a much larger knowledge share for mercurial, and I was even betting on that in the beginning (certainly the "easier" tool will be more popular?) but it hasn't turned out that way.


If I want to contribute to a project I don't expect them to accommodate my workflow. I expect to be asked to comply with their workflow.

Let's say that I were the maintainer of a project in which we did not want to use named branches. If I got a pull request from someone that had created a named branch I would just kindly ask the contributor to remake his pull request, using bookmarks rather than a named branch.

The fact that they made a mistake _on their repo_ does not mean that that same mistake must end up on _my_ repo.As I said this is just a social problem.

You say that mercurial is unforgiving of mistakes. I say that mercurial makes sure that your _shared_ history is not lost. You can edit unshared history as much as you want.

As for MQ... I would say that it is an acquired taste :-> If you don't like it there are some good alternatives, such as histedit (also a built-in extension) and the new evolve extension which is still a work in progress but which already lets you safely edit _shared_ history.


> Mercurial's main benefit is that it's much easier to use and understand than git

That's really only true if you're a beginner and you weren't properly taught how git actually works. Fundamentally, git's data model is really simple and transparent. The commands are somewhat arbitrary, but no moreso than Vim's or Emacs's. As with Vim and Emacs, you don't need to know most of the commands anyway, except for convenience.


The simplicity of git's data model is the root of the problem; that is the reason it is hard to learn and hard to use. It operates at a lower level of abstraction than the level we are thinking at when we are actually developing code. Git is less of a version-control system and more of a library one could use to build one, but in absence of any canonical high-level git wrapper, we all end up implementing and running one in our heads. This is a waste of brain-power.


Um, no. I think the git data model is perfectly good for real life version control. It turns out you don't need a higher level of abstraction. The problem IMO is that the UI doesn't map cleanly onto the natural operations on the data model. It might just be my ignorance, but the fact that I can have an operation in my head in terms of the git data model (which is fairly automatic) but not automatically know how to translate it into a git command is problematic.


To say I disagree is a massive understatement. I've seen teams try to use wrappers and just mess everything up because they don't understand what's actually going on in the underlying system.

Git has the concepts it has because they're all important. It shows them all to unfiltered because to do so is vital.


HG seems to manage the same problem with less complexity.


Can you give some concrete examples? It sounds to me like you haven't used git in an awfully long time, and are repeating things that were only true in the git 1.4 and early git 1.5 days.


I use git daily and have done so for years, but it's true that it has been a long time since I've actually tried to understand it. I gave up on the documentation, which seemed to be written in such a way that it would only make sense if you already knew what it was trying to tell you, and just memorized the handful of recipes I need to get my work done. It's not pretty; it's one of the bad tools, which I occasionally have to wrestle with, and not one of the good tools, which fit onto my brain like extensions of my body.


That's interesting. My experience was the opposite: I've also been using git since 2008, and I find it to be the most intuitive, friction-free VCS that I've ever used.

I think the difference is that I didn't start with the official documentation. I watched a video of an introductory talk given by a local Linux kernel developer. The talk started with an explanation of the underlying data model, and then showed how each command manipulates the underlying data structure and how that's useful for version control.

The talk is a bit long and a bit dry, but to date, it's the best introduction to git that I've ever seen. I highly recommend it:

http://excess.org/article/2008/07/ogre-git-tutorial/


> you weren't properly taught how git actually works

I'm using git (well, forced to, in fact) since 2009 and I've learned its' guts. And I still can't see why should I care of .git contents unless I'm git developer.

Ask yourself this: how come mercurial/bzr/fossil/veracity users aren't aware of those DVCS' internals?


You're confusing knowing the contents of the .git directory with knowing the semantic model of the revision history.

All DVCS systems involve creating and editing a series of changes to a directory tree, including metadata about each change. Those changes and their metadata are basically just a shared document that you're collaborating on with other people.

On other words, the revision history is a document that you're trying to edit. Git's document format it a bit like HTML: you can edit it blindly using a WYSIWYG editor, but it's going to seem confusing unless you at least understand concepts like elements, attributes, and entities, and maybe some CSS. Those concepts map directly to the HTML wire format, so in practice, you'll end up learning that, too.

Mercurial and bzr (I can't speak about fossil and veracity, since I haven't used them) are more like Flash or MS Word: They're easier to use for beginners, but they're more fragile and their internal formats are more obscure.


I'm pretty much aware of semantic model of distributed version control. Problem is git introduces new concepts (useless outside of git) which are widely used with and without cause.

Check out various short descriptions of push command:

* git — "Update remote refs along with associated objects"

* mercurial — "push changes to the specified destination"

* bzr — "Update a mirror of this branch"

* veracity — "Push committed changes to another repository instance"

* fossil — "Push changes in the local repository over into a remote repository"

As you may want to see, all of them except first one, are easily readable even by person used some VCS before. But not git — to understand what push command does, you have to have gitglossary opened in front of you.

> All DVCS systems involve creating and editing a series of changes to a directory tree

git is not — it stores only snapshots and changes are calculated every time you want to see them.

> the revision history is a document that you're trying to edit.

It is revision history, not "a document". To be precise, in DVCS it is directed (except darcs) acyclic graph, aka DAG of source code revisions.

> Those concepts map directly to the HTML wire format, so in practice, you'll end up learning that

One and only VCS (of 7 I have used) I've ended up learning guts was Git.

> They're easier to use for beginners, but they're more fragile and their internal formats are more obscure.

Which way are they fragile exactly? Could you please back this statement up with some examples?

And why one should care about VCS internals unless he's VCS (plugin) developer?


> As for MQ, I've really tried to understand it, but it's just profoundly awkward IMHO. Mercurial's main benefit is that it's much easier to use and understand than git, so if I'm going to need to fire up the extra brainpower, I might as well learn git where these advanced patterns are extremely commonplace. I've almost never seen anyone using MQ or rebase with mercurial.

I believe the GP's reference to mq is for the strip command (to delete unwanted revisions), which for some odd reason that I've never understood is part of the mq extension.


> Largely due to the popularity of Github, Git has achieved a much higher userbase, to the degree where we regularly have users requesting us to move to Git so they can provide pull requests.

I am too a Mercurial refugee and prefer Git, but how is that even a reason? People don't want to contribute if it's on a Mercurial repo?? Python seems to be doing fine on HG.

Meanwhile, thanks for your amazing work on SqlAlchemy(&co), zzzeek!


I think what those users mean is that they want the project on Github. There's no concept of "pull requests" within Git.

Anyway, most Git users aren't Mercurial refugees; I wager we're Subversion refugees and kids.


> There's no concept of "pull requests" within Git.

http://git-scm.com/book/ch5-2.html "run the git request-pull command and e-mail the output to the project maintainer manually."

http://www.wired.com/wiredenterprise/2012/05/torvalds_github... "Git comes with a nice pull-request generation module, but GitHub instead decided to replace it with their own totally inferior version."

> I think what those users mean is that they want the project on Github.

that would be even worse. luckily as zzzeek explained below, it wasn't the case.


Wow. I had no idea about git pull requests, so I even googled it, and didn't see it on page one. My mistake, thanks!


Very smart move. Some time ago Riak did the same thing. Their rationale was more detailed, but came to the same conclusion: http://basho.com/a-few-more-details-on-why-we-switched-to-gi...


1. What a huge saving — 0,002¢ per repo copy. Don't spend it all in one place.

2. Bookmarks is core feature since March 2011. This basically means those who pushed "move to Git" decision, haven't checked if anything was updated in Mercurial (and with bookmarks in particular) since then at least. This is pretty understandable — it is known that only software gets updated is one you're paying attention to, others don't.

3. As ezquerra mentioned, those extensions Mike is probably referring to, are shipped with Mercurial, and are enabled with only single line added to .hgrc. Not mentioning those extensions do not _emulate_ Git features, they _reproduce_ it.

4. Only reason makes sense.


Too bad there are no comments on the blog:

    SQLAlchemy's issue repository will remain hosted on Trac;
    while a Git repository can be mirrored in any number of
    places, an issue repository cannot (for now! Can someone
    please create a distributed issue tracker? Should be
    pretty doable, though getting Github/Bitbucket to use it,
    not so much...), so SQLAlchemy's long history of issue
    discussion remains maintained directly by the project.
http://fossil-scm.org !!!


There are tools like Bugs Everywhere [0], which mix well with distributed VCS. However, the question is if distributed bug tracking makes sense.

[0] http://bugseverywhere.org/


There are many different projects which implement distributed bug tracking[0], some better than others. The question of how much distributed bug tracking makes sense depends strongly on the structure of the project. If the project is developer heavy (such as developers responding to user help requests) it can work well, but if the project has several strata of developers and support people and users then it might not make as much sense.

[0] http://travisbrown.ca/blog.html#TooMuchAboutDistributedBugTr...


uh yikes! http://fossil-scm.org/index.html/dir?ci=601c15421a4a5ca5&...

although good point, I see they're doing distributed. will take a look at how they approach that.


I think you can customize most parts of the Web UI:

http://www.fossil-scm.org/fossil/wiki?name=Cookbook#css


Trac is pretty bad. Can you remove yourself off a subscribed issue yet?


Would something like Kiln Harmony work for storing the main upstream repository?

https://secure.fogcreek.com/kiln/

It seems that if KH translates seamlessly between HG and Git that a project could accept pushes from both, right?

I assume since KH is not open source / free that this would not be an acceptable method for maintaining the main repo, or that the translation would introduce interesting wrinkles, but it does seem like one could potentially have his cake and eat it too.


Thanks to hg-git, Mercurial users should have no problem using the git repository.


hg-git and Kiln Harmony serve different, though overlapping, purposes.

hg-git allows people who know both Git and Mercurial, but who prefer Mercurial, to work with Git repositories from Mercurial. No effort is made to hide the Git model, and no effort is made to ensure that the Mercurial repository generated from the Git one is idempotent (i.e., a given Git repo will always generate the same Mercurial repo) or can round-trip (i.e., I can trivially craft repos that hg-git can work with just fine, but where pushing to a bare Git repo will result in a different Git repository from the original). The benefit to this model is that the Git users don't need to do anything different, ever, and no one needs to know that you were using Mercurial, ever.

Kiln Harmony does something different: it's designed to let a team use whatever tools they want. This means that Mercurial users don't have to learn the Git model, Git users don't have to learn the Mercurial model, and generally, everything "just works" in that situation. Doing that requires a lot more processing power than hg-git requires, which is part of why we only offer it as a hosted solution for the moment. It also works best when your central repository is a Kiln Harmony repository.

I obviously think that's a fine trade-off, but if part of the main motivation of SQLAlchemy was GitHub, then Kiln Harmony probably isn't a good solution.


> Git manages the size of the repository more efficiently; while the Mercurial repository has been approaching 50M in size, the Git repository is only 17M.

17MB vs 50MB - almost a third of the size. That is definitely quite impressive.


I’m not the biggest git fan and I still think it’s right choice, but how is this even a reason to switch? Unless you go in (older) Subversion ridiculousness how are 40MB of disk space even a concern? They even listed it first. At best it would be Oh, and we saved 40MB, the size of two raw pictures from a digital camera.


The size issues of the SQLAlchemy repository come from the way Mercurial handles copies and renames.

I prefer Mercurial because is much easier to use but this file rename issue always make me feel uncomfortable when reorganizing code.

These days I'm giving Fossil a try, which still is easier to use than Git and the repository size sits between Git and Mercurial.


From the other side of the mirror, I mean from the position of someone used to git, this comment seem weird.

Git is not hard to use. It is adding a few articulations in your workflow, and they are just allowing you to run faster.

One example: interactive staging with git add -p, this articulation masks it much easier to debug: add print all over the place, try some tweaks, find the one, stage this one snippet, checkout the files, run the test, and you're done.


When you consider that distributed repositories are going to be cloned dozens, hundreds, or thousands of times, it starts to add up.

Even if this larger project could handle that bandwidth, it is a significant factor for smaller projects, or larger projects like github. Meaning more of a chance that git remains the dominant choice.


The repo is hosted by third parties, bandwidth is not an factor (it's all text in any event so the compression factor is huge). There are actual differences worth talking about, file size is not one of them.


git repositories aren't giant masses of text files; they're compressed on disk and the additional gains from compressing them with lzma are minimal (12 MB on a 306 MB git repo I have lying around). I assume Mercurial does something similar as the difference would be a lot larger than 17 vs 50 MB if not.


it's the time it takes to clone. Also with git I need to clone a lot less since I can create local feature branches that I can delete if they are abandoned.

edit: quick speed test, git clone from my server = 17.4 seconds, hg clone from the same server's hg repo = 25.4 seconds


Time to clone is not directly a function of absolutely repository size.

Both Git and Mercurial use hardlinks for local clones by default, so the reason why Mercurial is slower than Git for local clones is primarily that its repository data generally contains more files.

Cloning/pulling/pushing speed across a network is in large part determined by the protocol used; a few years ago, Git's network performance was inferior to that of Mercurial [1]. I believe that has been largely fixed since then.

Finally, Mercurial allows you to have local feature branches that you can delete if they are abandoned just fine.

[1] https://code.google.com/p/support/wiki/DVCSAnalysis -- footnote 1


Your citation is comparing the "dumb http" protocol, which was replaced several years ago with the smart http protocol in git-1.7. Dumb http was only ever provided as a method of last resort. The git:// and ssh protocols have always been fast.

Another data point: when we converted to Git, I did a number of speed comparisons. Our repository was 77MB in Git versus 178 MB in Mercurial. Clone time from bitbucket over either (smart) http or ssh was 18 seconds with Git, versus 2 minutes with Hg. We can do a shallow clone (--depth 1) in 4 seconds (10 MB transferred) with Git, but Hg has no comparable feature.


The speed was also slower for the native protocol if you read the footnote to the end. Also, as I noted, this was years ago. I was making a point about cloning/pulling/pushing speed being dependent not on just the repository size, not about the relative superiority of one or the other tool [1].

[1] I find both Mercurial and Git adequate, but lacking in some aspects that are important to me (both with respect to architectural design and workflow considerations). For practical work, I consider the differences between Mercurial and Git to be relatively minor in comparison and cannot really get exercised over them.


Is it really helpful to provide numbers that you know are many years out of date? If you're really interested in this, try to import a big repository like, say, the Linux kernel using the latest version of Mercurial and see how it compares to the latest version of git.


I was making a point about what factors influence clone performance in general, not trying to contribute to the tedious Git vs. Mercurial debate. If I had found any data about, say, Gnu Arch vs. Codeville (or some other abandoned codebase), I would have used that instead.


Mercurial has "hg clone --uncompressed", which really cuts down clone time at the expense of bandwidth.


I would assume people cloning the repository on slower connections.


If that was an actual issue (I doubt it is), they could have done a oneoff repack server side (reordering changesets to optimize compression).

Edit: I think the third point would have been enough, network effect for free software is important. Size (reorder your repo if it matters), branching (use bookmarks, not named branch unless you know what your are doing), and history rewriting (use evolve) are dubious points.


Could someone please explain me an argument about removing branch in git vs closing it in mercurial? As I understand, when you "remove" branch in git, just as when you close it in mercurial, it doesn't delete commits from tree either, since it can lead to disaster (if those commits were pushed already). So someone else could actually start new branch from commit which is in branch you just "deleted". Am I wrong here?


If there are branches with commits that aren't referenced elsewhere, i believe deleting the branch will eventually cause the commits to be "garbage collected". So you can delete abandoned experiments and keep a clean house. On the other hand, commits that are referenced elsewhere (perhaps merged into a main branch) will stay. But you could still delete the branch it was merged from, and avoid having the list of branches grow huge.


You can delete commits from Hg using 'strip' (enabled via the 'mq' extension). But this is strict deletion (as opposed to marking it for gc after a grace period) and you can't push an "unreference this commit", so you need admin access on the remote to strip a commit there. If you push branches to bitbucket for review, the abandoned commits stick around as anonymous heads until you strip them via the web interface, but they still stick around for anyone that pulled from you. The Hg community is working on a system [1] for dealing with this, but it's still not ready for general use and I find it significantly more complicated than Git's model.

[1] http://mercurial.selenic.com/wiki/ChangesetEvolution


hg strip saves the stripped commits under .hg/strip-backup as a bundle.


> So someone else could actually start new branch from commit which is in branch you just "deleted"

Well yes, of course. They could also start a new branch from any other commit that they happen have in their local repository. Branches aren't really 'things' in git. They're just a pointer to an arbitrary commit that automatically gets updated when you make new commits. It's rarely a useful thing to do, but you don't even have to be on a branch to make commits.


A commit isn't really "in" a branch in git in the same way as it is in hg. In hg, each commit "belongs" to a branch (I'm remembering correctly here, right?). In git, a "branch" is basically a variable pointing at a specific commit. A commit can be part of one or many branches (or even none).


> Am I wrong here?

It sounds like you may be a little fuzzy on how Git works.

> when you "remove" branch in git...it doesn't delete commits from tree...since it can lead to disaster

When you delete a branch, only commits and filesystem states that are no longer accessible from any other branch, tag, index, or stash become eligible for garbage collection. So in normal usage [1] [2], you shouldn't ever be able to delete the contents of any commit that exists in the history of any named commit.

A commit that is newly eligible for garbage collection will not actually be garbage collected during a grace period, I believe one week. During this time the commit is still accessible by commit hash [3].

Branches are kind of like variables in a garbage-collected language, and commits are like objects in that language. An object is eligible for GC if and only if there are no variables that point to it. Making new variables that point to the same object is a really lightweight operation since most of the data's shared. Unlike many OO languages, objects in Git are immutable, which means git can safely collapse identical copies of the same object to a single instance (an optimization which git is quite aggressive about).

> it can lead to disaster (if those commits were pushed already)

If you want to cause a remote repository to delete a branch, look at the --mirror and --delete options for the push subcommand. AFAIK, pushing the deletion of a branch has the same effect as if you deleted the branch locally on the machine you're pushing the deletion to.

> someone else could actually start new branch from commit which is in branch you just "deleted"

If another developer has merely fetched the deleted branch, their remote tracking branch will be deleted when they fetch again. If your colleague has created an actual local branch with the changes you deleted (as opposed to a remote-tracking branch), when or whether your colleague deletes that branch is entirely under his/her control [4].

[1] Using low-level git subcommands to force git to delete objects which are still referenced by other objects is abnormal usage.

[2] Corrupting the object database in your .git directory through unclean shutdown, hard drive failure, or vindictive hex editing is also abnormal usage.

[3] If you neglected to write down the commit hash before you deleted the branch, you can likely find it in the reflog. The reflog is a history of the commit hash at the tip of each branch. Alternatively, you can use git subcommands to browse the object database and find orphaned commits.

[4] At this point, git has no technological way [5] for anyone but your colleague to delete the local branch from your colleague's machine. Socially, of course, you can use project management techniques to encourage your colleague to delete the branch (e.g., if the official maintainers announce that the deleted branch is considered obsolete and will not support it or accept changes based on the version of the project in the deleted branch, that can be a compelling reason for people to stop using the copies that are floating around. Or your colleague's boss can tell him to delete it or be fired.)

[5] If you have git push access to your colleague's repository, or you have shell access to your colleague's user account, you can delete their branches. Footnote [4] is referring to the usual case where your colleague's machine is a private box where you have no access.


Thanks for your answer. I really was never aware of git's GC mechanisms before, that's why I was sure that commits still hold inside a tree. And you perfectly answered my question (well, at least parts that I understood all details about, I clearly see I'd need to read more some time later).

I still find mercurial's branching model "the right thing" in terms of tree showing development history inside a branch, or overall (multi-branch) history-review (where you clearly see which commit was made in which branch), but I now see that it's really nice from repository-cleanup perspective to have features git has, to deal with non-needed branches and commits (via different mechanisms).


Git's branching model is "the right thing" once you learn the idioms of git development.

For example, if you want to build a new feature, make a local branch so your changes are isolated. You can make work-in-progress commits that split up a large change into pieces, some of which may be experimental, contain debugging statements, or temporarily break things.

Then when you're satisfied your changes are bug-free, you can use an interactive rebase (git rebase -i) to turn the commits into clean patches that would be something you might e.g. send to the mailing list if you're working on the flagship Git project, the Linux kernel.

I like to keep my history clean, six months from now I won't be interested in all the bugs I wrote when I implemented a feature, and all the fixes that I came up with for them during initial testing. I won't want to see the implementation of that feature scattered over multiple commits. I just want to see one clean, bug-free patch that implements the feature.


TL;DR - we didn't read the HG manual, so we had to change as we did't know how to use its features.


Or because none of their contributors who only installed hg to submit a patch had read the entire hg manual.


Of course, all they had to read and follow was "contributing to sqlalchemy" manual.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: