Hacker News new | past | comments | ask | show | jobs | submit login
How Lisp-family languages facilitate building bioinformatics applications (oxfordjournals.org)
118 points by abrax3141 on Jan 9, 2017 | hide | past | favorite | 64 comments



When I started working in bioinformatics I was excited to use Lisp for my work. I wrote a considerable amount of library code and programs at first. But I always found resistance when it came to other people using my programs and even more if the program was to be published and released for general use.

There is a large amount of really, really bad Perl code still in use in biology. Trust me when I say to those outside of the field, I've seen things you people wouldn't believe. I try to discourage Perl use between it makes it so easy for biologists to write terrible code.

I've found a compromise in Python. People will happily use it because it's trendy and ubiquitous. And I think it makes it easier to write OK code. But I miss using Lisp. The style of programming when you have a REPL is perfect for this kind of thing. The ability to quickly write prototypes which can later be improved is an advantage which cannot be stressed enough. C is fantastic when you have a good idea of the algorithms and data structures you want to implement and are concerned with implementing them efficiently. But that's rarely the case when writing bioinformatics code. I need to know if the damn thing works first before I spend time making it work fast. And why would you not use a language where both things are equally easy?

I'm happy to see this paper because it means that I have a way to justify my usage of Lisp in the future. Big thanks to the authors.


A benefit of introducing LISP in bioinformatics is the cultural element that tools are expected to work on the command line using simple linux streams for I/O. (As well as a systemic allergy to dependencies, preferring to do everything from scratch.) You can make drop-in Lisp replacements of shitty Perl tools and no one will be the wiser, as long as you are producing bit-compatible results. I imagine establishing trust in your tool will be harder if you provide them with Lisp vs. Perl source.

Other communities (e.g. statistics, machine learning) have a culture of ramming all their tools in a toolbox-library/framework (R, sklearn). This makes for a much tightly coupled environment, making it much harder to introduce Lisp work.


I think that IPython is one of the best tools for scientists. Jupyter Notebooks that allow reproducible results, a lot of data-wrangling libraries, visualizations, distributed computing. I wish more scientists knew of IPython.


yes - and the thing that makes python approachable is the syntax and standard formatting. I love LISP, I'm not confused or annoyed by the parens, but many people are. LISP would be more popular if there was a good (browser-based) hybrid structural editor with the keyboard-navigation familiarity of a text editor, in a notebook format where the LISP code is displayed like Python (the parens hidden, and handled automatically, but serialized as normal LISP code behind the scenes). This is not that hard to do, and we're nearly there:

* http://gorilla-repl.org/

* https://github.com/kovasb/session

(Also my own early abandoned experiments: http://celeriac.net/iiiiioiooooo/ http://celeriac.net/iiiiioiooooo/public/ http://celeriac.net/iiiiioiooooo-dom/ http://celeriac.net/io/public/)


GorillaREPL is very nice. There is also a Proto-REPL that looks very promising (https://github.com/jasongilman/proto-repl)


See Wisp Scheme: http://srfi.schemers.org/srfi-119/srfi-119.html

It doesn't address the browser-based point.


Also in bioinformatics, and I use Python too, but not for the same reasons. Among languages with mature statistics/ML/numeric libraries, your choices are R and Python (MATLAB/Mathematica if you're willing to consider non-open-source). R is just terrible, so Python is all that's left if you want a sane language with batteries.

I used to write a decent amount of Clojure. I would be surprised if you would have problems with reviewers complaining about a nonstandard language -- unless you're publishing a library, then you might rightfully get dinged because the target audience would be small. But if you're just running a routine analysis, I have always found the reviewers don't care what language it's in.

Anyway, I dropped Clojure because the numeric support is just terrible. There's Java libraries, of course, but you lose the conciseness and functional aspect. You just can't performantly use arrays or matrices as if they were generic sequences, as appealing as it appears from a distance.

Considered Racket, but I consider lazy evaluation a must (a la Python's generators), and I could never figure out how to use Racket's lazy mode.

So, Python it is... Maybe choosing a programming language is like marriage: better to have something you can tolerate for a long period with all its warts than something you're infatuated with.

> I have a way to justify my usage of Lisp in the future

Haha, OK. I found this paper really unconvincing. No mention of the terms "numeric" or "array", and the only mentions of "statistics" and "machine learning" are in the form of "Lisp would be a really great foundation for stats/ML if someone would just write a library for it".


Why would one need to mention numeric arrays? Lisp has all the usual capabilities in this area, and with a little effort is as fast as C, sometime faster. See stats comments elsewhere in this thread.


Building a community around lisp is tricky. While there are many criticisms to this article [1], there are some elements of the "Lisp Curse" which is true - each programmer tends to rewrite his/her own libraries and so a collaborative framework seldom emerges even while many programmers may work in the same domain. This can be a disadvantage in a domain like the natural sciences where deep collaborations are necessary.

Furthermore, the disconnect with traditional mathematical syntax has also turned off many users. There was a predecessor to R called Lisp-Stat [2] which never achieved the acceptance that S/R did, and the Lisp-based syntax of Lisp-Stat was cited as one of the reasons.

Following the example of a high-profile Lisp supporter like Peter Norvig who accepted Python as a "Lispy" alternative early on, I think there are a lot of us who have come to accept Python as a necessary compromise. While Python is not homoiconic, functional programming is encouraged through its libraries rather than its core language definition, and the power of macros are missing, the inherent popularity and all the benefits that come with it is tough to beat.

[1] http://winestockwebdesign.com/Essays/Lisp_Curse.html

[2] http://homepage.divms.uiowa.edu/~luke/xls/xlsinfo/xlsinfo.ht...


> each programmer tends to rewrite his/her own libraries and so a collaborative framework seldom emerges even while many programmers may work in the same domain

So, exactly like bioinformatics, then. Trivial tasks are supported by packages of libraries like BioPerl, but all the original research is sui generis, with the attitude in general being that each problem is unique and only results get published, so who cares if the code is a write-only mess as long as it does the one job after which no one will care about it anyway. In such an environment, the "Lisp Curse" already exists anyway, so why not get the benefit of a more powerful and expressive language when you're already paying its cost?

(Granted, Lisp for bioinformatics is never going to become a generally accepted thing. But if you're not releasing code anyway, no one cares what language you write it in...)


> who cares if the code is a write-only mess as long as it does the one job after which no one will care about it anyway

This is exactly the kind of attitude that projects like 'Software Carpentry' (https://software-carpentry.org) are pushing back against.


What about Bioconductor and Openbabel/Pybel?


R and especially Python didn't see a lot of use where I was; the former was regarded mainly as a chart generator, while the latter had one diehard fan (an engineer like me, not an investigator) and no further uptake. It has been a few years since I was there, though, and I haven't kept closely in touch, so perhaps things have changed in the interim.


> and the Lisp-based syntax of Lisp-Stat was cited as one of the reasons.

The syntax referred to is S-expressions. They are a simple and elegant way of writing trees. Lisp was originally supposed to have M-expressions which would be more like mathematical notation, but it was found that Lips users actually liked S-expressions.

The reason becomes obvious after some amount of Lisp exposure. S-expressions are so simple that before long one does not even see them any more. The parentheses give way and one can just look straight at the Lisp code itself. For this reason many Lisp programmers do not even edit the S-expressions. They instead use something like paredit to edit the underlying tree directly.

It's similar to how readers don't really "see" the letters that make up a word in many natural languages. And, further, they don't see the words that make up certain expressions. We have this fantastic ability to internalise language but so many are choosing not to use it when it comes to programming.


Until children are taught S-expressions in elementary school math, infix math notation will always be more familiar to the general population, and there will always be a learning barrier that must be overcome to gain similar skill with S-expressions, no matter how superior they might be.


But we are all using S-expressions without noticing. What do you say "Add 3 and 4" or "Take 3 and add 4"?

That is the same with most things, we say the verb before the object(s).


Your English notation is not really an S-expression. I would say it is similar in that the operation comes first. Even though people understand such notation for simple expressions, most people are not used to doing advanced (beyond elementary school) math using such notation.

I agree that the concepts behind S-expressions are simple and logical, but expressing an entire program using S-expressions is an additional thing that must be learned. Other languages chose a syntax that is closer to what people are used to and that helps with the adoption of the language, even if the language loses other capabilities.


Yes, that right. And moreover, much math isn't infix: Factorial is postfix. Sum, Product, and Integral are prefix. Infix is actually the exception, and is not semantically uniquely without parens anyway (languages have different orders of operation!), so folks should just get over the mindless prefix hate.


"What is 6+1*0-2/2?" memes/threads on Facebook have taught me that infix math notation is more or less completely unfamiliar to the general population. At least sexprs are fully parenthesized.


If you are talking about order of operations, all of the sixth graders that I teach know how to evaluate that expression. While it's true that I had to teach them * means multiply and / means divide, they can quickly pick that up. Our school is all honors and the students are above average intelligence, but they are not all math or tech geeks.

On the other hand, I have taught courses at the high school level at the same school based on "How to Design Programs" [1] using the Racket language. Although I can explain the concept of (+ 6 (* 1 0) (/ -2 2)) rather easily, the students make mistakes writing expressions for several weeks.

1: http://www.ccs.neu.edu/home/matthias/HtDP2e/


I wasn't entirely serious, and I believe sixth-graders can evaluate infix expressions with correct order of operations, but it was adults I was referring to, many of whom have long forgotten.


I can't take the infix arithmetic argument anymore. In any interesting piece of software, such arithmetic will only be 5% of the problem. You'll deal with many other concepts, combinators, morphisms, trees, vector spaces, geometry ... which are often functional in syntax (Ker <morphism>) ...

Plus when you're used to trees, linearization and parsing, you *fix the way you see fit, pre post or in ...


Lisp-Stat (or XLispStat) didn't just use Lisp syntax, it was a Lisp.

I've used XLispStat, S-PLUS (a commercial version or R), and SAS. SAS had the most comprehensive statistical libraries, but I never liked the language. I liked xlispstat but the statistical libraries weren't as good as those of SAS or S-PLUS. For data mining, I settled for S-PLUS, which was sufficiently lispy, and had a REPL. So it wasn't just the syntax (or lack of it).

Beyond being dynamically typed and having a REPL, Python isn't particularly Lispy: it's less functional, and it has no (linked) lists or symbols. It's quicker to develop in than most of the non-Lisp alternatives, though.


You might be interested in Hy [1], which is a Lisp implemented in Python that can import Python libraries. I'm not sure if it cross-compiles to Python or just to Python bytecode, though.

1: https://github.com/hylang/hy


For a moment you had me wondering if there would be a way to use Hy and google's new Grumpy together, and then I dismissed the idea as probably utter madness. For a long time I've been pining for a Clojure-like lisp that can compile to machine code.


Why wouldn't it work?


I'm not certain that it wouldn't work, but it would have multiple compile steps: Hy => Python => Go => Native, which would either cause a lot of pain or make nigh impossible many of the advantages of using a lisp, like connecting a repl to a running process and being able to evaluate and rewrite code from there.


You could prototype in the Hy REPL then compile after.


IIRC it's sexp to python AST.

There's also Pixie, a python lisp to LLVM system, that was specifically made for high performance.



It seems so.. which makes my LLVM mention false. There's a talk by the author online http://www.youtube.com/watch?v=1AjhFZVfB9c

worth it


> I've found a compromise in Python. .... But I miss using Lisp. The style of programming when you have a REPL

95% of my python programming time these days is spent inside a repl.


What Lisp did you use?


Ira Kalet (RIP, 1944-2015) wrote a book using Lisp for the examples:

Principles of Biomedical Informatics, Second Edition, 2013 https://www.amazon.com/Principles-Biomedical-Informatics-Sec...


Surprised that Racket didn't get a mention in the paper, especially given the authors' interest in building DSLs

EDIT: (I consider Racket separate from Scheme)


What an awesome paper! It was during my time as a sysadmin at a DNA lab I became really enamoured with lisp (and emacs), and functional, homoiconic languages as a whole, for many of the reasons listed. Great to read them all and more articulated so well.

BioLisp would be a great project.


I asked Bioinformatics MOOC teachers about non python/java language in the field, naming common lisp or haskell. I got a large NOPE from them, saying it was mostly for performance reason. I specifically asked about the paradigm too, because the MOOC spent a large amount of time talking about control flags and temporary variables, that to me obscured the overall logic.. but alas.


Using lisp always seems attractive.

However, in Bioinformatics, the reality is that many users are still using Perl. It would be get to have them migrate to, modestly more readable languages.

Outside of scripting, alignment and other performance sensitive code is usually written in C (or C++). As with many academic projects, the code quality unfortunately is quite low.


I work in a lab (academic/genetics). We have a public facing website so when we develop tools they go there. So we're using: perl, java, R, python, php apart from the stuff we compile from c.

Stringy code thats hard to decipher can be written in any language I have discovered.

My take, Perl is kinda on the way out, though it was used extensively. BioPerl is a nice package.

Biologist like R, it pretty quick and behaves like they aren't programming. It can make graphs nicely.

Python seems to be the go to compromise. BioPerl is a really nice package. But people start to want to use it for big things and to get it performing adequately requires a lot. There is confusion from the researchers that scares them away: between python 2/3 numby, pypy. BioPython packages are pretty excellent.

If it needs to be fast and crunch large sets of data (fairly common) the tool is in C or C++. We should start using Rust..

We use Php (Silex) to deliver a front end and some quick database lookup and display. Its replacing perl for this.


I've been in the field since 2001. I have seen exactly two perl files in that time


It really depends on the subfield. Sequence-based genomics is very highly Perl-dependent even today. BioPerl is pretty much the standard library for that even though BioPython, etc. are beginning to take over. Other subfields such as differential sequence abundance which are more about the statistics rather than the raw sequence tend to be R-centered thanks to Bioconductor.


I think by your definitions there are sub-subfields then ;) In my neck of the woods sequence based stuff has java & c doing a lot of the heavy lifting.


>modestly more readable languages.

This is definitely all relative, I would argue lisp is easier to read given that the philosophy of its syntax is that there is no syntax...

Though if you've only ever read classical languages I can see how lisp would seem alien


Lisp someone else wrote is harder to read than Perl you wrote, but any Lisp is easier to read than Perl someone else wrote.


My experience with Perl was that me-a-week-ago counts as someone else. That's not flippant — it's an accurate reflection of my experience that Perl tends to be write-only.

There are some people who do write beautiful Perl, but they are few and far between (again, in my experience).


> many users are still using Perl. It would be [hard] to have them migrate

I spent a year as a staff member of a genomics institute, and researchers occasionally came to me for help getting the local sui generis tooling stack set up and working sanely. (Well, I say "stack"; jwz's bookcase-from-mashed-potatoes metaphor hastens to mind.)

Having seen the miseries they went through, I don't think it would necessarily have to be all that hard to make migration look good, especially to a language whose syntax is other than wildly irregular and whose performance is other than usually abysmal.


Java and .NET are also used a lot.


Python more (though I have had to refactor completely incomprehensible Python written by bioinformacians).


Depends on the company.

We have some customers with zero lines of Python code, but they do use R and Tableau.

All the software used to talk to the devices and do ETL processing or graphical analysis is then written in Java or .NET, depending on the department and their set of OSes.


Usability matters.

Functional versus procedural programming is first and foremost a usability problem for developers. The two are interchangeable as far as being Turing complete.

I just spent a week at a Jakolb Nielson's education course on usability. I asked the executive vice-president at the company if they had branched into software development itself, sadly the answer is still no.

Functional programming for the majority of developers is not usable. The article mentions they are are targeting DSL, domain specific languages. Which is fine. For example, Haskell has seems to have found a boutique community in async, MQ world. But, from a usability perspective one is severely limiting ones pool of developers choosing a functional language.

It is fascinating from a human nature perspective that people who think in functional programming are brains are wired differently from the procedural programmer. There is very visceral reaction when either camp is asked to program in the manner they find least usable.

Ultimately I think compilers will get to a development stage where independent of functional programming or procedural programming approaches the same optimized code gets implemented under the hood.

Postgres was originally written in LISP. Ultimately Stonebraker had to make the call that if Postgres was going to get adopted widely, LISP had to go and it was rewritten in C/C++ before being open sourced. As a point of research you all might want to look into the Postgres experience.


The original paper is here: http://db.cs.berkeley.edu/papers/ERL-M90-34.pdf

There are other cases where an original Lisp implementation has been rewritten: ViaWeb and Reddit. I think the reason is that the teams which took over the projects were unfamiliar with Lisp.

My experience is different, though I should mention that I'm a sole developer. I'm roughly equally experienced in C and Lisp, but find I'm far more productive in Lisp. My first language was FORTRAN. Unlike C, Lisp is memory safe and you don't have to manage the memory yourself, interactive, strongly typed, you never need to write a parser, the built in symbol and list types are incredibly useful, and there are fewer "gotchas".

Lisp is multi-paradigm rather than purely functional, so like the Algol derivatives it has loops and destructive assignment.


Not try to be negative.

But Perl is still used more widely than Lisp in bioinformatics. or lisp can be neglected in bioinformatics compared to Perl.

Check the status of BioPython, BioPerl. BioLisp? not even exists.



This is obviously an opinion piece disguised as a scholarly article. I'm not sure what the hidden agenda is, and the HN comments don't clarify matters much.

For example, this nonsense statement reveals the author's bias: "Clojure is a rising star language in the modern software development community." Huh?


in my opinion, better to publish as a blog post.


"Gat [3] compared the run times, development times and memory usage of 16 programs written by 14 programmers in Lisp, C/C ++ and Java. Development times for the Lisp programs ranged from 2 to 8.5 h, compared with 2 to 25 h for C/C ++ and 4 to 63 h for Java (programmer experience alone does not account for the differences). The Lisp programs were also significantly shorter than the other programs."

Could selection bias be skewing these results?

Proficient Lisp programmers can certainly create shorter and faster programs with Lisp. Who would ever contest that? Average programmers, on the other hand, can probably develop in a similar amount of time and write faster programs with C++ (given the amount of libraries and information available - in comparison to Lisp - and the ubiquity of tooling).


How about Hy? It's a Lisp that runs on the Python interpreter, and interoperates with Python. It's not extremely mature yet, but it is really cool!


It's pretty wild using numpy or pandas in Lisp. Hy is definitely cool.


   on average, the Lisp programs ran significantly faster
   than the C/C++ programs and much faster than the Java
   programs (mean runtimes were 41 s for Lisp versus 165 s
   for C/C++).
The only thing I can say is that their C/C++ code has serious problems.


Perhaps they didn't optimize.


As a Lisp fan myself, I wish they would show an example of how WITH-GENES or MAP-GENES encouraged encapsulation vs. objects or passing closures.


The section of the paper where this is mentioned offers a novel (at least to me) argument in support of homoiconicity, to wit, that you build DSLs directly into the base language, which (it is asserted) makes the DSL development process more flexible (or something), and that this is important for 'living' domains, where you're trying to work out the domain semantics, like biology. So, you're not really so much building a specialized language for X, as slowly turning (almost in the sense of wood turning, as on a lathe) Lisp into X-Lang.


A concrete example.





Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: