Hacker News new | past | comments | ask | show | jobs | submit login
TXR: A Programming Language for Convenient Data Munging (nongnu.org)
91 points by hashx on Oct 4, 2014 | hide | past | favorite | 38 comments



I wish their page included something along the lines of "Why do I care?"

Maybe a few examples of "data munging" tasks which the authors view as poor fits for [language X] and how their stuff solves the problem better.

Maybe something like "why is our language better than regexps in whatever language environment you already know?"


There is page with a navigation frame giving Rosetta Code examples, syntax colored, with back links to Rosetta:

http://www.nongnu.org/txr/rosetta-solutions.html

TXR has regexps is you need them. The regex engine is geared in a different direction from mainstream regex: it doesn't have anchoring, register capture or Perl features like lookbehind assertions. On the other hand it has intersection and negation (without backtracking).

TXR translations of Clojure, Common Lisp and Racket solutions to the same problem:

http://www.nongnu.org/txr/rosetta-solutions-main.html#Self-r...


I saw those; what I miss is "This is why I think this new way is better".

If it's supposed to be obvious by inspection, well... I guess I'm too unenlightened.


I'd say its multi-line approach makes it quite unique when compared to, say, sed or awk.


TXR looks rather like the CRM114 language that's been used to implement some rather amazingly accurate text classifiers (some better than most people on their own mail), though a bit less bizarre and I think more accessible: http://crm114.sourceforge.net/docs/INTRO.txt CRM114 too treats pattern matching as the fundamental construct and has blazing performance for it and certain kinds of number crunching (it has to), but I don't think it's nearly as useful for the average hacker trying to munge a couple of text files. Still, worth a look both to users and possibly to language implementors. I'm definit


Reminds of a trick I do with mustache.java. Templates can not only be used to generate output, but because of the declarative nature of the mustache language they can be used to parse output back into data that in combination with the template would generate that output. Makes for pretty intuitive parsers. In my case all text that isn't templating declarations are regexes.



As I come to see more of my data-related work be consumed by data munging/cleaning work, I'm convinced that a language/framework devoted to data munging is at least as important as those devoted to data visualization.


It looks ugly and akward to type. It's doesn't seem like it would be a pleasure in which to write programs.


Oh it can be. Typically, if I need to do some text transformation or extraction, I start by getting sample data and renaming it to a .txr suffix. Then just generalize that data into the TXR pattern that matches it and gets out what is needed.

As an example, I was doing some kernel work and needed patches to conform to the kernel's "checkpatch.pl" script. Unfortunately, this thing outputs diagnostics in a way that Vim's quickfix doesn't understand; I wanted to be able to navigate among the numerous sources of errors in the editor.

First I looked at the checkpatch.pl script hoping that of course they would have the diagnostic output in one place, right? Nope: formatting of messages is scattered throughout the script by cut-and-paste coding.

TXR to the rescue:

Sample output:

   WARNING: line over 80 characters
   #279: FILE: arch/arm/common/knllog.c:1519:
   +static void knllog_dump_backtrace_entry(unsigned long where, unsigned long from

   WARNING: line over 80 characters
   #321: FILE: arch/arm/include/asm/unwind.h:50:
   +extern void unwind_backtrace_callback(struct pt_regs *regs, struct task_struct

   WARNING: line over 80 characters
   #322: FILE: arch/arm/include/asm/unwind.h:51:
   + void dump_backtrace_entry_fn(unsigned long where,

   WARNING: line over 80 characters
   #323: FILE: arch/arm/include/asm/unwind.h:52:
   + unsigned long from,
Quick and easy TXR to the rescue:

  @(repeat)
  @type: @message
  #@code: FILE: @path:@lineno:
  @(output)
  @path:@lineno:@type (#@code):@message
  @(end)
  @(end)
Result (redirected into errors.err, loads with vim -q):

  arch/arm/common/knllog.c:1519:WARNING (#279):line over 80 characters
  arch/arm/include/asm/unwind.h:50:WARNING (#321):line over 80 characters
  arch/arm/include/asm/unwind.h:51:WARNING (#322):line over 80 characters
  arch/arm/include/asm/unwind.h:52:WARNING (#323):line over 80 characters
  arch/arm/include/asm/unwind.h:53:WARNING (#324):line over 80 characters
  arch/arm/kernel/unwind.c:352:ERROR (#337):inline keyword should sit between storage class and type
The nice thing is that we know what the above does when we revisit it six months later.


It's a very ugly language, I don't think anyone is going to disagree there.

That being said, it has some intriguing features that I'm not going to dismiss. I work with COBOL on a daily basis, so I'm not going to say no to a new language just because it's ugly. There seems to be a lot of utility here.


Purely out of curiosity: what is it that you do that forces you to work with COBOL on a daily basis?


I mostly do legacy code conversion for the banking sector. It's all contract work, so it varies, but 95% of the time that's my deal.


Likewise! I'm curious too. Why is COBOL your main language? Bank legacy servers?


My first reaction was also that it didn't look very clean. But after some admittedly cursory comparison of how you'd do something in TXR to a few existing scripts I have (some Perl, some awk, and some chaining Unix utilities), it doesn't look terrible, and maybe even good. I should emphasize this is based on like 30 minutes of looking at it though, not serious knowledge of how TXR works.

One part that seems nice is that it handles multi-line constructs in a way that isn't horrible. Perl and awk have a big complexity jump once you go past one-line records, and most of the traditional Unix utilities just don't handle them at all (stuff like cut/join/sort only works on single-line, delimited records). Since constructs like Perl's while(<INPUT>) stop automatically doing the Right Thing once you get multi-line records, the usual next stop is that you're manually maintaining a state machine.


Yes, all those at-signs make it look like a combination of lisp and perl, which probably won't excite too many people.

But I'd say that data munging is inherently ugly. I don't really see myself using this as the next tool to write clever algorithms that will stand the test of time, but if you offer me this as a stand-in for the usual shell-script/awk/sed/perl/printf/regexp mess you need for ad-hoc file transformations, I'm suddenly listening.


The at sign in the TXR pattern language is that way because TXR can match reams of literal text.

This is hard to show in small examples, so small examples become dense with the notation. Just like, say, tiny examples of HTML become a dense soup of tags.

Note that TXR Lisp doesn't have the at signs. You can write a pure TXR Lisp program by wrapping the whole file with @(do ... ).

TXR looks a lot better with syntax highlighting; unfortunately, this only exists for Vim. On the other hand, the syntax highlighting definition file for Vim is quite good.


TXR has some really cool features and seems very well suited to the domain. If you're going to dismiss perfectly good tool just because it "looks ugly" then you're just being unprofessional. And if it's "awkward to type" you can always write a transpiler if you really need it, or more likely a couple of macros/snippets for your editor.


Cool ideas. I really like that it has support for grammars. What's the performance like compared to Perl on similar tasks?


Kaz Kylheku is one of the kooks from comp.lang.lisp where lisp is the One True Language. The funny thing is that TXR is written in C!

Kaz: How come you didn't write TXR in lisp?


Because I'm also one of the kooks from comp.lang.c where C is the One True Language.

But seriously, TXR is built on its own Lisp: an infrastructure which provides the managed environment and data representations which also support the TXR Lisp dialect.

This is no different from any Lisp implementation based on a C kernel, like CLISP, GNU Emacs, ...

If you do it from scratch, you lose a lot: you don't have a mature, optimized dynamic language implementation. But, by the same token, you can experiment in ways that you normally wouldn't. You get to dictate things like, oh, what is a cons cell. I have lazy conses that look like ordinary conses: they satisfy consp, and work with car, cdr, rplaca and rplacd. You can invent new evaluation rules. I came up with a way to have Lisp-1 and Lisp-2 in a single dialect, seamlessly, with the conveniences of both. I have Python-like array access. I made traditional Lisp list operations work with vectors and strings: you can mapcar through a string and so on. Sequences and hashes are functions. For instance orf is a combinator that combines functions analogously to the Lisp or operator. If hash1 and hash2 are hash tables, you can do something like [orf hash1 hash2 func] to create a function which takes one one-argument that will look that argument in hash1; then if that returns nil, it will try hash2, and if that returns nil, it will pass the key to func and return whatever that returns. Or ["abc" 1] returns the character #\b. [mapcar "abc" '(2 0 1)] yields "cab": the numeric indices are mapped through "abc", as if it were a index to character function. Fun things like this are good reasons to experiment with your dialect.

I believe TXR is a great companion if you're a Lisper working in ... one of those other environments.

Ah, one more thing. Well, two, or maybe three. Part of why I used C was to create a project whose tidy, clean internals stand in stark contrast to some of popular written-in-C scripting languages. You know, to sock it to them! See, there is a hidden agenda: the call of "I can do this better". If you use C, then a more direct comparison is possible. Secondly, people widely understand C. Give them a cleanly written project in C, and maybe they will hack on it, and from there understand something about Lisp too. C means low dependencies from the point of view of packaging: easy porting with just basic shell environment with make and a C compiler. Cross-compiling for ARM or whatever is a piece of cake. Easy work for package maintainers, ...


I don't buy it.

TXR is not built "on its own Lisp", it's built on C. If you believe that lisp is so great, then why didn't you just use ANSI Common Lisp? Why is TXR even necessary when I can do all the same data processing stuff in Perl, which is far more versatile and ubiquitous?

And all this nonsense about writing TXR in C because it's "more widely understood", "low dependencies", "easily packaged" - after 15-some years of advocacy in comp.lang.lisp, it's laughable that defsystem, asdf, and SBCL/CLISP/CMUCL aren't good enough for you.

Lisp is either as good as all the Naggums, Tiltons, and Pitmans of c.l.l. proclaim, or it's not. By writing TXR in C, you've just proved that it's not.


Oh shit.

SBCL's runtime contains traces of C. https://github.com/sbcl/sbcl/tree/master/src/runtime

CLISP is written on top of C.

CMUCL's runtime contains traces of C.

Now we are fucked...

I'm so glad that at least my Lisp Machine has no C. Oh wait, it has a C compiler...


The point is that lisp advocates rarely seem to use any of these lisp implementations to do anything noteworthy or useful. They always seem to fall back on C, or some other language that's more "widely available" or "has minimal dependencies" or "has more potential contributors" or "can be more easily compared with other similar programs".

I find this hypocrisy to be quite intriguing.


> The point is that lisp advocates rarely seem to use any of these lisp implementations to do anything noteworthy or useful.

That's possible. There are many Lisp dialects and implementations which have few applications. That's true for a lot of other language implementations, too. There are literally thousands implementations of various programming languages with very few actual applications. Maybe it is fun to implement your own language from the ground up. Nothing which interest me, but it does not bother me.

If he wants to implement a small new Lisp dialect its perfectly fine to implement it in C or similar.

> They always seem to fall back on C, or some other language that's more "widely available" or "has minimal dependencies" or "has more potential contributors" or "can be more easily compared with other similar programs".

Some new dialect is written with the help of C? That bothers you?

Wow.

Actually 95% of all Lisp systems contain traces of C and some are deeply integrated in C or on top of C (CLISP, ECL, GCL, CLICC, MOCL, dozens of Scheme implementations and various other Lisp dialects). There are various books about implementing Lisp in C.

Really nobody in the Lisp community loses any sleep that somebody implements parts of Lisp in C.

> I find this hypocrisy to be quite intriguing.

Because some random guys implement their own language in C? Why do we have Python, Ruby, Rebol? There was already PERL or AWK or ... Somebody decided to write their own scripting language. So what?


> Because some random guys implement their own language in C? Why do we have Python, Ruby, Rebol? There was already PERL or AWK or ... Somebody decided to write their own scripting language. So what?

When a Python advocate wants to do some data processing, do they first write their own Python implementation in C? No. When a Ruby advocate wants to make a Rails website, do they first write their own implementation of Ruby in C? No.

Several fine implementations of lisp already exist that compile down to machine code and, if the lisp community is to believed, have performance "close to C". So why does a lisp advocate feel the need to re-write lisp in C for a project that didn't actually need it? The lisp community would have us all believe that lisp is the "programmable programming language", and all the other rhetoric about how every other language has just stolen ideas from lisp, etc., etc.. They all truly seem to believe that lisp is something special. That's why I find it laughable that someone like Kaz Kylheku, a 15 year veteran of comp.lang.lisp, decided not to implement TXR by using a pre-existing lisp implementation.


> When a Python advocate wants to do some data processing, do they first write their own Python implementation in C?

They write it in C. Checkout the Python world sometimes.

* CrossTwine Linker - a combination of CPython and an add-on library offering improved performance (currently proprietary)

* unladen-swallow - "an optimization branch of CPython, intended to be fully compatible and significantly faster", originally considered for merging with CPython

* IronPython - Python in C# for the Common Language Runtime (CLR/.NET) and the FePy project's IronPython Community Edition

* 2c-python - a static Python-to-C compiler, apparently translating CPython bytecode to C

* Nuitka - a Python-to-C++ compiler using libpython at run-time, attempting some compile-time and run-time optimisations. Interacts with CPython runtime.

* Shed Skin - a Python-to-C++ compiler, restricted to an implicitly statically typed subset of the language for which it can automatically infer efficient types through whole program analysis

* unPython - a Python to C compiler using type annotations

* Nimrod - statically typed, compiles to C, features parameterised types, macros, and so on

and so on...

> So why does a lisp advocate feel the need to re-write lisp in C for a project that didn't actually need it? The lisp community would have us all believe that lisp is the "programmable programming language"

Why don't you understand the difference between 'a lisp advocate' and 'the lisp community'?

> nd all the other rhetoric about how every other language has just stolen ideas from lisp, etc., etc..

Nonsense.

> That's why I find it laughable that someone like Kaz Kylheku, a 15 year veteran of comp.lang.lisp, decided not to implement TXR by using a pre-existing lisp implementation.

I find it laughable that you find it laughable...


Every single Python project you stated simply proves my point. They are Python compilers of some sort. TXR, on the other hand, is a data processing language implemented in its own lisp which is implemented in C. In other words, TXR is an application of lisp, not just a compiler or interpreter like those Python projects you listed. So, all your examples are irrelevant.

TXR didn't need its own dialect of lisp. So, the question remains: why didn't Kaz use SBCL or CLISP? They're good enough for c.l.l. kooks like him to recommend to everyone else, but why're they not good enough for him to use?


The kook here is you, and I can prove it: you have a bizarre view that developers should be divided into political parties based on programming language, and code strictly to the party lines. Bizarre views make the kook.

TXR does need its own dialect of Lisp because Common Lisp isn't suitable for slick data munging: not "out of the box", without layering your own tools on top of it.

This is a separate question from what TXR is written in. Even if TXR were written using SBCL, it would still have that dialect; it wouldn't just expose Common Lisp.

That dialect is sufficiently incompatible that it would still require writing a reader and printer from scratch, and a complete code walker to implement the evaluation rules of the dialect. Not to mention a reimplementation of most of the library. The dialect has two kinds of cons cells, so we couldn't use the host implementation's functions that understand only one kind of cons cell. So, whereas some things in TXR Lisp could be syntactic sugar on top of Common Lisp, with others it is not so.

Using SBCL would have many advantages in spite of all this, but it would also reduce many opportunities for me to do various low-level things from scratch. I don't have to justify to anyone that I feel like make a garbage collector or regex engine from scratch.

So, the reasons for not using "SBCL" have nothing to do with "good enough". It's simply about "not mine".

TXR is a form of Lisp advocacy.

TXR is also (modest) Lisp research; for instance I discovered a clean, workable way to have Lisp-1 and Lisp-2 in the same dialect, so any Lispers who are paying attention can stop squabbling over that once and for all.

It pays to read this:

http://www.dreamsongs.com/Files/HOPL2-Uncut.pdf

Why we have Lisp today with all the features we take for granted is that there was a golden era of experimentation involving different groups working in different locations on their own dialects. For example, the MacLisp people hacked on MacLisp, and it wasn't because Interlisp wasn't good enough for them. Or vice versa.

That experimentation should continue.


> So, the reasons for not using "SBCL" have nothing to do with "good enough". It's simply about "not mine".

Kaz, the C programming language isn't yours either. My point is that Common Lisp is supposed to be a general purpose programming language with power far greater than a primitive language like C, but you chose to implement TXR in C simply because C makes it much easier for you to accomplish your goal than Common Lisp. I'm just trying to point out the obvious, which nobody from c.l.l. seems willing to admit.


Bizarre.


It's a tool with some embedded kind of Lisp dialect. There are zillions of it.

> why didn't Kaz use SBCL or CLISP?

Why should he? He can do whatever he want. I personally don't care at all about what he does. Why are you? Kind of strange obsession with comp.lang.lisp. Are you one of the trolls posting there?

> They're good enough for c.l.l. kooks like him to recommend to everyone else, but why're they not good enough for him to use?

Probably he did it to annoy real programmers like you?


> > why didn't Kaz use SBCL or CLISP?

> Why should he?

Kaz invested a bunch of time implementing a whole new backquote implementation for CLISP, but it's still not good enough for him to use CLISP to implement TXR? It doesn't make any sense!

Any right-thinking programmer should care about inconsistencies such as this. If I'm evaluating a programming language, and I see someone in its community writing their own language implementation to support an application that could've easily been written using one of the standard language implementations, then it looks to me like the standard implementations aren't mature enough or trustworthy enough for me to use for my application. Not only that, but it suggests that maybe this particular language isn't as good as its advocates claim, especially if I have to drop back down to C in order to meet certain requirements (e.g., portability, speed, wider understanding, etc.).

But any right-thinking programmer already knows that lisp is not worth wasting any time on. It's dead, and people like Kaz, and projects like TXR, are going to make sure it stays that way.


I am not convinced that CLISP can be used to write another programming language which is itself completely BSD-licensed.

See here:

http://sourceforge.net/p/clisp/clisp/ci/default/tree/COPYRIG...

CLISP's licensing is somewhat confusing and appears to dictate the license to the application. So, for example, I probably wouldn't use it for a commercial, closed-source application. For the same reasons, it cannot be used for a BSD-licensed application.

(However, I did use CLISP for the licensing back-end of such an application: that back-end runs on a server and isn't redistributed. Things you don't distribute to others cannot run afoul of the GPL.)

CLISP's license lets you make compiled .fasl files, and these are not covered by its copyright (unless they rely on CLISP internal symbols). However, that is where it ends. Memory images saved with CLISP are under the GPL. (Memory images are the key to creating a stand-alone executable with CLISP!) If you have to add libraries to CLISP itself, you also run into the GPL. I believe that this would cause issues to the users of TXR, which they do not have today. For a user to be able to run the .fasl files, they need CLISP, and of course that has to be distributed to them under the GPL terms, and you can't add C libraries to that CLISP without taining them with the GPL.

You can wrap TXR entirely in a proprietary application, including all of its internals: the whole image, basically. This wouldn't be possible if some of its internals were the CLISP image.

Regarding the GPL, I do not believe in that any more. I will not use this license for any new project. It is not a free software license in my eyes. Free really means you can do anything you want; any restriction logically means "not entirely free". Proprietary products that use free code do not take away anyone's ability to use the original. The problem with the FSF people is that they regard the mere existence of something as offensive. "It's not enough that there is a free program; we must litigate to death the non-free program which is based on the same code before we can be happy."


I think your views on the GPL are spot on. At least we can agree on that.


the 'right-thinking programmer'. How bizarre.


At first I thought this was TXL, for source code transformation http://www.txl.ca/


Hmm, this looks more like parsing than munging to me, but then I guess "munging" is not exactly scientific terminology.

My own take on easy data transformations, if you'll allow me the plug: https://github.com/stdbrouw/refract




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: