Hacker News new | past | comments | ask | show | jobs | submit login
A Git Implementation in Awk (github.com/djanderson)
274 points by rohitpaulk on Oct 7, 2021 | hide | past | favorite | 96 comments



Namespaces (GNU Awk 5.0+) make Awk pretty well suited for larger projects (as demonstrated by aho), but it never quite took off; this article (by the author of GoAwk[1]) is a nice look at the relevance of Awk in 2020: https://lwn.net/Articles/820829/

[1]: https://github.com/benhoyt/goawk


I forget where I read it but Brian Kernighan said, at some point, the lack of namespaces in original Awk was probably the biggest mistake that prevented bigger adoption in large projects. Now that we have it, I'm hoping to see a brilliant IDE surface in the near future.


"mistake"

I read the book written by the creators and it's pretty clear they never intended it to be a general purpose language lol. I'm paraphrasing but they basically wrote in the book "People are completely mad and are intent on using our DSL as a general programming language so we added half baked functions to it but it's bad and you should feel bad"


Do not use AWK for that. Perl is best suited for that, literally. It was born as a better version of AWK.

AWK is for tools less than 20 lines long. Anything more, use Perl, please.


Perhaps; but Awk is such a compact approach. My head feels cleaner when using it compared to Perl - perhaps as it's fairly close to C. The interpreter is a single binary on practically any OS, and is usually around 5MB in size. It's installed by default on essentially every OS (aside from Windows, which is an easy single exe download). Also with implementations like Mawk, it will destroy Perl in speed.


If your awk code is more than 20 lines, is it really cleaner?


AWK would work greatly well on plan9/9front, as most programs are short and really compact.

On "big" Unixen (and I am an OpenBSD/Void user), for larger tools, Perl is preferred.

On speed, that doesn't matter. Perl under OpenBSD has a pledge/unveil module, and it's used to write the package manager. Problem solved.


Yeah, but you gotta admit, this is cool. It may not be the most practical or useful thing, but it is cool that someone made Git in AWK.


Yeah, an a Z-machine (v3) in PostScript, but I woudn't write an IF interpreter in AWK.


> use Perl

Perl is widely known as a write once read never language for a reason. In that sense I'll grant you awk is no better, but if you are stepping into scripting land you may as well use something generally readable (e.g. python, or just use bash).


>Perl is widely known as a write once read never language for a reason.

From clueless Z-ers with no actual Perl experience, maybe.


I am X-er whose first experience with Perl was working in enterprise software tech support in 1999 and diagnosing a crash due to a syntax that had somehow shipped in the install script for a GA product with a ~$100K annual license/support fee.

While “write once, read never” is hyperbolic, it captures a real issue for Perl compared to many other competitive languages (of which, to be fair, there were far fewer widely supported for the tasks Perl was most often chosen for 20+ years ago.)

That's not to say I don't find some things nice about Perl, and I’d love to be able to spend more time woth Raku which grew out of it.


Question: could it be possible to translate/"compile" perl into equivalent AWK-lib version?

Maybe I'm wrong, but I though perl code tended to be "higher-level", whereas AWK is optimised C throughout?


ok, I found https://github.com/noyesno/awka so it is possible, though it uses it's own lib to link against - I don't know how much of that is based on AWK C vs original code.


AWK is an interpreter.


Awk has many interpreters and some compilers. awka is an awk compiler. lawk is an awk-JIT.


hence "compile" in quotes, and mention of AWK-lib instead of AWK. Ultimately, interpreted AWK will still call compiled machine code, no?


Apparently astrologers declared this the year of Git implementations. Production of Git implementations has increased five-fold.


Astrologers you say.. But is no one going to talk about how mercury retrograde wiped out Facebook ..?


My mother in law who's into this sort of stuff warned me a few days before the Facebook blackout about this. Her exact words were "Mercury is in retrograde and that is bad news for technology" i just brushed it off and a couple days later Facebook had the worst outage of its history. Gave me chills. But not enough to change my sceptic ways (again her words)


Mercury retrograde happens every few months and lasts three weeks. And it's also "bad news" for travel, change, decisions, activity...

I'd get chills too, tbf, but your MIL's gnostic pronouncement is total foofoo


> Astrologers you say.. But is no one going to talk about how mercury retrograde wiped out Facebook ..?

For a fact, Facebook engineering uses mercurial instead of git in their tooling. They are going to do some 'hg rebase' whenever the stars are out of alignment =)


Nitpicking. :)

I know about 'shit' (== Shell Git) at https://git.sr.ht/~sircmpwn/shit from last year.

https://en.wikipedia.org/wiki/Git#Implementations lists JGit (<=2012), Go-git (2015), Dulwich (2008), libgit2 (<=2010), and JS-Git (<=2013).

A five-fold increase means at least 30 new implementations. :)


Production is the rate, per year.


Is it? Well, given last year's 'shit', there should be another four this year, yes?


I really want to see a C compiler written in shell.



Now can we unsee it?


It isn't a good idea for a production system or anything, but the code itself seems pretty clean. I've seen infinitely worse bash scripts than this compiler.


I wonder if it's good enough to bootstrap better and better compilers until you can build full on GCC?


Here's a couple of projects along those lines. https://bootstrappable.org/projects/mes.html

> Mes aims to create an entirely source-based bootstrapping path.

https://savannah.nongnu.org/projects/stage0/

> A class of minimal bootstrap binaries that has a reproducible build on all platforms. Providing a verifiable base for defeating the trusting trust attack.

(and comments at https://news.ycombinator.com/item?id=20264848 ).


haha love the subtle Heroes of Might and Magic reference


Among the russian-speaking IT crowd, the "astrologers declared a week of X" joke is quite widespread, even overdone.


Is HOMM popular with the Russian speaking IT crowd?

HOMM2/3 are my favorite games ever.


HoMM3 has a cult following among lots of 25-45 years old males. During recent elections, one of candidates mentioned that he enjoys playing HoMM3 a lot, comments under the video were like 80% "I'll vote for anyone who plays HoMM3" and alike.


In the 2000s they were among the go-to strategy games. The hot-seat mode probably helped. Of course, the publishers and devs saw little profit from all that following—unlike middle-aged dudes selling pirate CDs in mall stands.

HOMM 5 was developed by Nival, a major Russian dev and publisher at that time—though the 4 and 5 games weren't as famed, afaik.


Yeah, I fucking love subtlety. Tons of it.


someone needs to make a script that can scrape tweets/medium articles and predict what the next hot pet project will be. you could have your git implementation out and on your resume six months before the rush with a system like this.


Btw, I’m developing an IntelliJ IDEA plugin for Awk: https://github.com/xonixx/intellij-awk

You can already install it from marketplace.


The only really horrid thing about awk is the way you declare local variables in a function ... by just adding more parameters (and letting them be default-initialised on calls). Otherwise it can act like a relatively conventional scripting language, aware of associative arrays (ahead of its time).


Back in Google New York, sometime after 2006, but prior to my team changing floors after the YouTube acquisition, my team sat about 2 cubicles over from Peter Weinberger, and a bit further from Brian Kernighan's office. My manager, John Sarapata was standing up and complaining how much he hated awk, and I ducked a bit and said "You know AWK stands for Aho, Kernighan (motion toward Brian's office), and Weinberger (motion toward Peter's desk), right?" They're both very nice people, so I don't think they would have been too offended, but it just felt odd to be boldly talking behind the backs of such well-recognized luminaries. Last I checked, John was very senior in Google NYC, so I guess he's a bit more careful to get full context before complaining loudly. Good times.


1. This is an incredible story haha. 2. I constantly talk about how much I hate python or other languages that I actually like. Sometimes the things you complain about most are the things you need most too


"There are two types of programming languages: those everyone complains about, and those nobody uses."


- Bjarne Stroustrup.

I agree with him. His book "The Design and Evolution of C++" was a fascinating read. His site has an overview of it.


A big problem with Awk is that it lacks garbage collection (including ref counting), and that puts major limitations on the language. You can't return an associative array from a function:

    $ awk 'function f() { a[1]=2; return a } BEGIN { f() }' </dev/null
    awk: cmd. line:1: fatal: attempt to use array `a' in a scalar context
You also can't have nested associative arrays, i.e. recursive or cyclic data structures are not allowed.

As far as I can tell, this is because a stack frame owns everything allocated within it, and when a function returns the whole stack frame is cleaned up unconditionally. You can pass arrays down but not up. It's very naive memory management (by modern standards; it's probably better than BASIC).

So I'd say it's clearly not expressive enough for general purpose programming. There are Lisps in awk but they do weird tricks with text as far as I remember.


> There are Lisps in awk but they do weird tricks with text as far as I remember.

Not quite -- https://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/lang/... was the first and used two global associative arrays, car[n] and cdr[n], where n is an integer which iirc had to have a particular tag modulo 4, as the type tag. (Much as you'd do it in a lower-level C implementation.) I used almost the same scheme in my later Lisp.


Yeah this is cool, I was thinking of the "normal" way of doing it where evaluating a Lisp function reinvokes the interpreter function. But there are obviously other ways to do it if you have global arrays, as awk does.


> it's clearly not expressive enough for general purpose programming

The problem of persisting arrays usually leads to handling them in bash.

Now you have two problems.[1]

[1] If you paraphrase Jamie Zawinski in an explanation, you then have two things to explain.


Not sure what you mean because bash has the same problem -- you can't return an array from a function.

And I still think Shell, Awk, and Make Should be Combined. In that world you'd still have one problem :)

https://www.oilshell.org/blog/tags.html?tag=awk#awk

Oil has recursive data structures, although I'm still working out how to compose shell-like "procs" and functions which could return an associative array. We could just add them both naively but I think it would cause many shell programs to be a lot messier.


You don't have to in bash because you can just declare that your array is global inside your function. In fact, this is the default behavior.


Sure but you can also declare global arrays in Awk. The point is that Awk and bash are the same in several respects, so you don't really gain anything by switching to one or the other:

    - they don't have any garbage collection or ref counting
    - they don't have recursive data structures
    - you can't return any kind of array from a function
In bash you also can't pass an array to a function, but in Awk you can.


Thank you. This comment alone is enough reason to never touch Awk if I can help it.


You definitely want to work with it if you can. I live much of my life in python, but there hasn't never been a week, and usually not a day, in the last 10+ years where I don't write a quick awk snippet to do something useful to data.

The advantage of awk over python is that it almost always takes < 60 seconds from the point at which you say, "I want to summarize, review, filters, report" on some columnar data to the point at which you have output in front of you.

I"m fast with python doing that - but it's usually a 3-4 minute set of steps.

The only exception being json, which awk is clumsy with. (jq is my goto tool for the simple queries, python for anything that requires more than a minute of thought.)

Also - it's almost always the case that I'm using awk in conjunction with a bunch of other tools, cut/sed/uniq/grep to get what I want.


Many BASICs had GC to clean up strings -- even on rinky-dink 8-bit micros.


I have really warm feelings about AWK, actually! It’s such a simple language that you can learn pretty much all of it in an afternoon, but it is still expressive enough to be really useful as a text processing language. I have an AWK script that turns my CSV bank statements into Ledger transactions, and AWK is just perfect for that kind of thing.


Didn't mean to sound negative about it, I love awk too. It is definitely small enough to learn in a day (or two), and it fills a niche people don't realise they have.


LOL that's what I do too.

A bit of pdf2txt, awk and some shell scripting, and budgeting becomes very easy...


Can you share? I've been thinking of doing the same and would love to see some prior art.


be wary that the initial parsing is probably going to be bespoke for your bank account csv export format. my own bank doesn't even share consistency in these exported csvs between the checking, savings, and the credit card I have with them.


Awk more or less introduced associative arrays to modern programming.


SNOBOL4 was earlier than AWK with associative arrays by a little under ten years. However, from the standpoint of manipulating strings with regular expressions, AWK introduced the concept as far as I know.


I did say "more or less" and "modern". Who's heard of SNOBOL4 now? :) My guess is that every other language that got AA's got them by way of influence from awk (or something else that got them from awk); awk itself may have gotten them from SNOBOL4 though.

But ok, we can at least say that awk is the oldest programming language to have associative arrays that's still in widespread use. (Yes, I'm hoping that someone will argue for SNOBOL4's current widespread use. :) )

> However, from the standpoint of manipulating strings with regular expressions, AWK introduced the concept as far as I know.

Two can play at this game! `awk` is really a successor of `sed`, which is all about manipulating strings with regular expressions. sed maybe isn't turing complete though (but someone's gonna prove me wrong here too) or at any rate not convenient to use in as general a way as awk.


> sed maybe isn't turing complete though (but someone's gonna prove me wrong here too)

Done! :) Since a turing machine has been written in sed, sed is turing complete. [1] [2] (both from Peter of Browserling).

  [1] https://catonmat.net/proof-that-sed-is-turing-complete
  [2] https://catonmat.net/ftp/sed/turing.txt


Thank you, kind friend.

> Christophe isn't the first person to realize that sed is almost a general purpose programming language. People have written tetris, sokoban and many other programs in sed.

Wow! But clearly, you wouldn't WANT to except for the fun of it.


I wouldn't want to :). It looks very expressive, much more than I had known just 3 days ago, but maybe it's a little loose around the edges for my taste.


String pattern matching in Snobol4 was still stronger than regex for a very long time, possibly still today. A language very ahead of its time.


Yep, I recommend Gimpel's old book Algorithms in SNOBOL4 for fun examples (from the days when such a title did not mean an undergrad data-structures-and-algorithms curriculum).


Agreed it is an excellent book. Even the Catspaw Spitbol manual is well written. Along with the Green Book. An often forgotten period of computing!


We are used to see awk very condensed one-liner, so the code in the repository is surprisingly readable. This makes it a fun project which can be used to learn a lot about both awk and git. Nice!


Technically this isn't quite compatible, because it uses a different hash algorithm. git as of v2.13.0 uses a hardened sha-1 algorithm to counter the shattered attack. In normal Cases you won't see the difference.


If you'll excuse me, I'm going to implement GIT in 6502 assembler.


Finally version control for Famicom BASIC, it was about time.


Back in the days of yore, I briefly borrowed this Chinese import thing: https://74.img.avito.st/640x480/1870138674.jpg

— which both could play cartridges and had Basic, with absolutely unlicensed Mario sprites in the mem.

But never connected the dots that there was an official thing like that, even though vaguely heard of the Famicom brand's various forays into neighboring markets of the day.


Yes, the Famicom had Family BASIC (it was a Family Computer after all!), an official Nintendo release only in Japan. What you had on your Subor was probably a totally unlicensed copy of Family BASIC, as that program did come with Mario and Donkey Kong assets to play with.


Git in Brainfuck - marriage made in hell


Afterwards follow up with PROLOG and Brainfuck, please.


I implemented a Whitespace interpreter in jq!

https://github.com/andrewarchi/wsjq


A version of PROLOG in awk is here - https://github.com/prolog8/awkprolog - but it would be interesting to see those problems for students in AI which are done with awk.


> Afterwards follow up with PROLOG and Brainfuck, please.

I think one of these languages is being insulted. Not sure which...


Getting started on a malbolge implementation, check back in approximately 1000 years.


> Getting started on a malbolge implementation, check back in approximately 1000 years.

This could be an appropriate software challenge for the Long Now Foundation [0], when they have finished that clock.

[0] https://longnow.org/clock/


Would it be cheating to use that Lisp interpreter written on-top of Malbolge?


Do it in FORTH - then we can have Git running on bare metal on some obscure architectures.


z80?


It's a pity that this apparently needs gawk. If it could run in the awk included in busybox/toybox it could actually be useful in tiny installations.


How would it be useful? Can anyone show me a non-SBC embedded system in which busybox/toybox are relevant?


But developing on SBCs is exactly the use case!


If one is gonna develop on a SBC, he can utilize proper GNU/Unix tools not trimmed-down equivalents from busybox/toybox. In fact, I fail to see a use-case for these tools in embedded space whether on SBCs or not.


https://github.com/djanderson/aho/blob/8cd5cb737a3296cd3b3fe...

Wouldn’t it be more in the spirit of awk to write this as

    editor = config::get("core.editor") || ENVIRON["EDITOR"] || “vi”


This would be in spirit of JS. In Awk, afaik, ‘A || B’ will result in boolean (1 or 0).


Yep, The closest thing is ?:

    editor = config::get("core.editor")

    if (!editor) editor = ENVIRON["EDITOR"] ? ENVIRON["EDITOR"] : "vi"


See also awk-jvm [1], a toy JVM in awk. Though, inexcusably, that one also uses gawkisms rather than awk proper, when portability is the one thing awk has over other mini langs ;)

[1]: https://news.ycombinator.com/item?id=23612910


wait awk has functions?


Fun fact: early awk did not. The Lisp interpreter in awk which I linked in another comment was from those days and worked entirely as a big nested loop, using no functions besides the built-ins.


Weird flex(sic) but OK.





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: