Hacker News new | past | comments | ask | show | jobs | submit login
Transpiling between any programming languages (2019) (mongodb.com)
71 points by chrisseaton on Feb 16, 2020 | hide | past | favorite | 50 comments



While it seems that their transpiler supports just simple JSON-like expressions with support for constructors for non-native literals (like big integers) and I don't think it could generalize into arbitrary code translation, I still appreciate the work they've done. Transpilation between arbitrary high level languages is a very hard task and I doubt anyone can achieve good solution for that anytime soon.

I have a different solution in mind: designing a language specifically for transpilation, so that it could be used to generate high quality, readable code for various programming languages (I call it "idiomatic translation"). I've recently started exploring that topic as a part of Zygote project [1].

[1] https://github.com/krcz/zygote


You should check out Haxe. - it compiles to 10 source level languages and a couple of VMs.

https://haxe.org/


The targets Haxe compiles to are all rather similar (object oriented mutable-state imperative VM's). While that's cool and interesting, its far from any language. I imagine adapting it to compiling to something like Haskell may be less than straightforward, at least in a way that produces decent quality/performant Haskell code.


Thank you, Haxe definitely has its place on my list of things to check. What worries me however is that there are no examples of output code on their webpage.


The output is pretty slim, it's one of the reasons I prefer using haxe over TypeScript when targeting the web – haxe has better dead-code-elimination so output js tends to be smaller and include a greater level of optimisation – check out this example to see what I mean https://try.haxe.org/#4f46D


True. I don't think the output will be particularly pretty, and it does assume the presence of its own stdlib. But hopefully it's still fun to see how exactly they do things under the hood.


The best solution is to make every language a compile target for a virtual machine or some CLR flavor.

Creating yet another "high level" language that compile to all other languages would mean designing something that is the least common denominator between all languages, which is OK for a language that is not supposed to be written by hand, AKA bytecode, it's not OK for a language developers are supposed to write.


But in order to make it work you would need all the programs to start using the CLR. It's all of nothing. And if one of languages stops being supported - e.g. not receiving bindings for new libraries - you need to rewrite whole codebase to something different. In case of the "high level" language you just need to transpile it to one of the supported languages and continue development in the output language.

> designing something that is the least common denominator between all languages

My vision is rather finding some patterns, higher abstractions in various languages: e.g. imperative loop summing squares of all odd elements is semantically similar to filter -> map -> fold functional chain. Building whole language on such patterns will definitely be hard, but I believe it is possible and worth pursuing.


Is that so different? The natural next step is implementing the reverse, other languages to yours, and then suddenly yours is an intermediate language that can be used to go other lang to other lang.


That might not be easy, for the same reason you can easily transform vector graphics into raster one but not the other way around or output of decompilers doesn't look nice: you are losing information during the translation process. You can easily transpile TypeScript to JavaScript, but that's not true for the reverse process.


The reverse process should be easy! Since TypeScript is a superset of JavaScript, transpilation is an identity transform. :)


I did something similar as in my Bachelors end project. A manufacturer of industrial printers wanted to generate API bindings in multiple languages from a single source of truth.

We started with just transpiling data definitions based on Apache Thrift [1], but later expanded it to also be able to transpile code checking invariants on the data and integration tests.

Jetbrains' Meta Programming Studio (MPS) [2] was a really great tool for this, as it does a lot of what is mentioned in the article for you. It makes it really easy to define new nodes, syntax rules, parse rules from text, ast-to-ast transformations etc.

[1] https://thrift.apache.org/

[2] https://www.jetbrains.com/mps/


This seems like a really hard problem to solve - but it's some pretty cool progress. I do wonder how Mongo decided building a high-level transpiler was worth their time though. Is the business value of translating those queries really that high?


I think potentially very high.

There's a subset of operators that are similar (but not identical) between programming languages. For instance you can add numbers like

   const a = b + c
in javascript

   int a = b + c
in java, etc, you can even write

   (b+c) as a
in the select statement of a SQL query. If you look at basic operations on numbers and strings it is not hard to translate expressions between conventional languages and query languages. Now query languages have different ideas about control structures, but working at the expression level you can mostly avoid that.

Now there are issues around data types (are we adding ints or floats, what to do about overflow, when you index the Nth element of a string is it 8 bits (maybe part of a Unicode char), is it a basic plane unicode char, or could it be any codepoint, or something else...) but those problems don't turn up 100% of the time when you try to translate.


The data type and edge case issues have to be considered 100% of the time unless the transpiler can reliably determine through static analysis that they aren't a factor.


As a Google Summer of Code project, I worked on transpiling Maxima CAS to Python, both high level languages. The approach I followed involved conversion to an internal custom defined IR, and then converting that to Python code.

The full project report: https://gist.github.com/LakshyAAAgrawal/33eee2d33c4788764087...


I’m wondering if you had a large enough corpus you could use the same techniques you use for NLP translation.


If you heavily modify them, that might work quite well. Every machine translation models contains language models for the source and target language that specify what a valid sentence in each language looks like. In NLP translation, they are approximations created by ML. Our models for programming languages are much better: We have the exact syntax, type systems should also help.

However, AFAIK modern MT systems are pretty much blackboxes and the language models cannot be separated from the rest of the model. In any case, it wouldn't be as easy as plugging a compiler into the system to check if the output compiles.


Of course. I hope you don't want to use the output, though.


Convert the tests by hand, then convert the language by NLP. Then go back and fix breakages in the test suite...


I thought that this was near when I wrote "Running PHP in Javascript" [1]. I'm so happy this article includes images and more in-depth technical explanations, since I basically hacked it together without understanding half of it (for fun!).

[1] https://francisco.io/blog/running-php-in-javascript/


If you believe that “transpilation between any languages” is a thing, chances are you’re observing your local blub paradox. That said, most practical and practiced languages are really interchangeable, because they bring nothing new beyond the syntax, minor runtime api differences and fresh names for old subroutines. Heck, this makes me want to work on my custom best language again.


Looks like they read the Watt & Brown book.



That.


What is the use-case for these transpilation features? Who would want to convert queries from an arbitrary language to another one?


General transpilation is logically impossible.


Any language that can be compiled for x86 can be transpiled to any language in which an interpreter of x86 opcodes could be written. The question is what level of abstraction can be preserved; if it requires lowering that far, it's not likely to be worth it.


Well, for that matter any Turing complete language can embed an interpreter/compiler for any other Turing complete language. If that's all that is meant by transpiling, it is no different than something like the JVM.

My understanding of 'transpiling' is that it is meant to convert, say, idiomatic Python into idiomatic Java.


You might as well call JNI and extern calls transpilation too.


Why? (It sounds like the sort of thing you'd try to reduce to the halting problem, but I'm not sure how.)


Isn't it a bit like when the jokes go missing in a translation of human language?


That’s a practical difficulty, not a logical impossibility.


It's logically impossible if the joke is a godel sentence for that language!


Here's an example in Python.

print_hello = gen_expression()

eval(print_hello)

The result of the program is "Hello", but the only way a transpiler would be able to reproduce this is by knowing the expression generated by 'gen_expression', and converting the expression to the target language (also assume target language has equivalent of 'eval'). However, then the transpiler needs to know 'gen_expression' will halt, and we have the halting problem.


If the target language has its own eval, why would the transpiler need to run gen_expression at transpile time? It sounds like the transpiler in question is trying to specialize the eval invocation for its input, which is indeed not possible in general. Normal transpilation (as with compilation in general) just converts code that may or may not halt to other code that may or may not halt, without needing to determine AOT what's going to result at runtime. The straightforward way to transpile evalX(langX)->evalY(langY) is to emit evalY(transpileXY(langX)), with the transpiler available at runtime along with the evaluator.


But then aren't you making the target version aware it has been transpiled? Once you go that route, then it becomes a back and forth of hacking around whatever trick used for that situation. I cannot think of a general fix that doesn't result in having to solve the halting problem.

As I said it is logically impossible in general, but you can perhaps restrict the problem to a practical narrower case in which transpilation is always possible.


> But then aren't you making the target version aware it has been transpiled?

I'm not sure what this means.

If you take some code that generates langX at runtime, and compile/transpile that code to langY, you get some langY code that generates the same langX code value. So you need a eval(langX) that can be invoked from langY. Given an X-to-Y transpiler available at runtime, you can create the needed evaluator by composing the transpiler with the langY's eval(langY) function, with ordinary function composition.

I may be explaining this poorly (or misunderstanding you?), but I'm quite sure about the conclusion I'm trying to express. As a concrete example, you can generate some TypeScript at runtime, transpile it, and pass it to the JS eval function. The transpiled-at-compile-time code halts iffi the pre-transpilation version would halt, as does the transpiled-at-runtime code. The transpiler never halts. https://stackoverflow.com/questions/45153848/evaluate-typesc...


Lets say we have a language where each function can return a string expression of itself. E.g. print.str() = "print". gen_exp() uses this function at least for some of its operation. Once we transpile the resulting expression will no longer run in the target language, even with the inner call to transpile. Maybe there is an easy solution to this issue too, but it seems I can keep coming up with new complications the transpiler will need to account for, thus I can't see how transpilation can be guaranteed in general. We would need a general proof that no counter example is ever possible for some given transpilation scheme to guarantee generality.


A transpiler is a compiler. There's nothing that can be compiled but can't be transpiled, because transpiling is just what we call compiling when the target is subjectively deemed a high-level language. So the problem you're describing isn't with transpilation; it sounds like it's with compiling a dynamic language with reflection. That's a tractable problem; you just can't solve it as statically as you would for an AOT-compiler for a more analyzable language. Reflection has to be as late-bound as the rest of the language.


Right the problem applies to compilation as well, which is why not all languages can be compiled. Compilation in general is undecidable.


sound great, doesn't work. if it would work, we would have perfect translations for spoken languages already. and google(translate) with it's massive computing, financial and brain power still hasn't figured it out.


Spoken languages have the problem of ambiguity, context, changing meaning over time to different people. Programming languages are pretty structured and well defined.


First we,ll figure out translating Java to Ruby, then English to Japanese, and only then we,ll be ready to translate Perl :D


On the other hand most spoken languages use similar underlying concepts (with some exceptions, e.g. translation to language without subjective direction terms like "left" or "right" might be tricky). Programming languages can have completely different execution models, especially when you compare ones between paradigms. Translating imperative for loop into map/filter/fold chain in functional language would have to be very non-local.


Yes it would probably produce a lot of unidiomatic code (see for example c2rust [1], which generates semantically equivalent Rust code, but with raw pointers and unsafe everywhere). In the worst case you could even just emulate the origin language in the target language, everything is turing complete after all.

With enough effort and smart transformation passes however, I think you could come pretty close to something resembling "native" code. It might not always be worth it to spend that development time though.

[1] https://c2rust.com/


On the contrary, it's trivially possible due to Turing Completeness.

Programming languages aren't languages. They're arithmetic calculations wearing linguistic clothing.

Making it idiomatic and fast, is very, very difficult. Achievable for a subset focused on querying MongoDB? Absolutely.


Compilers obviously don't work, because, if they did, then we could automatically turn written English into machine language.


Programming languages and natural spoken languages are very different.


BTW any programming language construct can be expressed in natural language. (maybe unidiomatically) The reverse isn't true.

Maybe one day the field of literate programming will come to maturity.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: