Hacker News new | past | comments | ask | show | jobs | submit login
JavaScript-Is-Weird as a compressor (github.com/mgarciaisaia)
111 points by mgarciaisaia on Oct 16, 2023 | hide | past | favorite | 41 comments



Just for fun, I decided to pass the output.js file through Google Closure Compiler's advanced optimizations. It does a surprisingly good job at reconstructing part of the strings.

  % npx google-closure-compiler -O ADVANCED --js output.js

  (()=>{})["co"+(1/0+[])[4]+"structor"]("co"+(1/0+[])[4]+...
Not pasting the full thing. But it reduces the output.js file from ~118 KiB to ~9.92 KiB, which is pretty good!

There is technically not much stopping the compiler from inferring that 1/0 === Infinity, recognizing (1/0+[])[4] is free of side-effects, and eventually concluding its safe to substitute the whole expression with "n". Google Closure already has optimizations for string concatenation, so if it were able to perform an optimization pass with Infinity, then it would also be able to emit the string "constructor" instead of "co"+(1/0+[])[4]+"structor"


$ wc -c output.js

  117708 output.js
$ uglify-js output.js -c unsafe | node

Hello world!

$ uglify-js output.js -c unsafe | wc -c

    2740
$ uglify-js output.js -c unsafe | cut -c 1-60

(()=>{}).constructor("conso"+"".constructor["from"+(()=>{}).

$ uglify-js output.js -c unsafe | gzip | wc -c

     192


Interesting. I wonder if an even more effective approach to unobfuscation would be an "anti-jit" compiler; since what we're interested in is the actual execution flow, can we leverage all the browser's optimization engine to pull that out for us?

Do the various JIT engines use an intermediate representation (IR) and what does it look like?


V8 starts with interpreting bytecode (Ignition), then hot code gets tiered up to a non-optimising JIT without an IR (Sparkplug), and even hotter code goes an optimising one with an IR (TurboFan)

That way it can start executing quickly and not waste time on compiling/optimising things that only get run a few times


> npx google-closure-compiler -O ADVANCED --js output.js

Had no idea about this command!


is () syntactically important or could you use _=> to save a char?


I'm actually surprised at how insanely terrible the "compression" is.

Looking at the table at the end, I'm not surprised at all that the "weird" obfuscated code is ~2000x the size of the original source.

But I am surprised that that the gzipped weird code is still ~25x the size of the original source, as opposed to ~0.25x for gzipping the original source.

After all, the amount information in the weird code should still ultimately be approximately the same as the original source code, right? Or maybe double or something like that. I'm very surprised it's twenty-five times as much.

The only reason I can guess is that the "weird" process results in information structures that are represented in an extremely hierarchical way, and gzip is built for stream compression, and is unable to find/represent/compress hierarchical structures?

And if that's the case, it makes me wonder if there are any compression algorithms which are able to handle that better? That might not be based on "dictionary words/sequences" as much, but rather attempting to find "nestable/repeatable syntax patterns"?


Compression is all about detecting patterns in the data, usually with some kind of sliding window of past data, used to predict future data. For gzip in particular, this sliding window has a fixed size of 32KB.

If the "weird" version is ~2000× the size of the original source, gzip would need a sliding window of ~2000× the size (about 64MB) to obtain equivalent compression.


Also, an encoder like these can inject any amount of entropy (eg replace "0" with "100500999-100500999") which gzip can reduce to "100500999-same" but without knowledge of JS semantics has no chance to reduce to either "0" nor the nearby "172913211-172913211")


> For gzip in particular, this sliding window has a fixed size of 32KB.

I'm surprised people keep reaching for gzip for this sort of comparison. LZ4 / zstd compress better at the same compression speed, and zstd and brotli compresses much better. Brotli is also supported in all modern browsers.

There's basically no reason to use gzip in new software. (With the exception of compressing for the web - at least until zstd makes it into browsers.)


In this case, op explicitly compiled javascript for use in web browsers, so in my opinion reaching for gzip is not so strange. There is no browser support yet except behind feature flags in Chrome, and in curl, but that's not a browser. Also, zstd was explicitly designed to compress as good as deflate but much faster. You can also use zstd to compress 'better' but that costs some CPU time again but that is mostly a different use case


Brotli is a much better choice for compressing JavaScript for the web. It’s supported by every browser (and curl I think), and the compression ratio is way better. The only downside is that you probably want to compress once instead of on every request, because brotli compression is slow. Easy enough to set up though.


It would be interesting to see what better compression schemes could do here.


> After all, the amount information in the weird code should still ultimately be approximately the same as the original source code, right?

Taking a very high level view of this: the compressor can't know about the rules that are the difference between a .weird file that is valid .js and a byte sequence that is not and neither can it know the difference between bytes that are valid .js and look like .weird and are part of the subset that are encodings of a .js file and those that are not.

The compressor builds a more or less accurate model of observed probabilities, but even the most optimal encoding would still have to keep escape hatches around for all other bit sequences. Those escape hatches are not free.


~100x compression (from weird to weird.gz) is really good, no matter the input. Simple compressors like gzip are limited in what they can do. They have size-constrained dictionaries, cannot have symbols of less than one bit, etc... Gzip works fine with real life data, and does it in a reasonable amount of time using a reasonable amount of memory, but it may not fare well with such a torture test.

New, fancy, and really slow compressors with their contexts, arithmetic coders and neural networks will give a much better result, maybe close to ideal, but it will take orders of magnitude more time and use up GBs of memory doing so.


For comparison, 7-zipped modernizr.weird.js is only 17.85KB which is almost exactly 2½x the size of the original.


Ah thank you! That feels far more sane.

I don't think I've ever seen such a dramatic difference in compression rates before. It's fascinating to see.


1) The title is a clickbait, but

2) Thanks for leading with an example of a negative result! That's what any researcher faces every day, unlike what gets published, after all


I won’t quote Edison here, but I totally agree.


My view of the author changed somewhat when I reached the section where he had to ask ChatGPT how to read command line parameters.


Author here :)

I went that route since reading the CLI parameters was something I needed to do, but wasn't actually part of the issue I was trying to solve. I could have spent a few cycles of googling+testing+fixing/refactoring, but it was a trivial task (that I didn't remember how to do off the top of my head) and would have distracted me from the main task.


I don't remember it either, because I rarely write command line utilities.

Asking ChatGPT is just like looking up documentation, just faster ...


Disagree. I can be more or less certain that documentation authors or write-ups have knowledge about the topic im researching to a reasonable degree, because 1) they need accurate and efficient code that makes sense to establish a relationship with the reader and compete with other solutions 2) humans are bad at explaining things they dont know about opposed to executing or managing unknowns

ChatGPT? None of these. Its a giant probability calculator that wants my prompts and its own answers to foster itself. This also nurtures its inaccuracies and makes it easier to confidently lie. I hardly find any reasonable use for ChatGPT other than quick completion suggestions for at a maximum of 2-3 lines, because thats what its designed for.


You know that the GPT is trained on data including that documentation, right?

If you don't like ChatGPT, that's fine. But it's clearly useful to many people, and you aren't entitled to feel superior for not using it.


> you aren't entitled to feel superior for not using it

I will break the HN spirit but you kind of wanted to say that there. I didnt imply that at all, nothing near that. The dataset is just, imo, too big to give extremely accurate (or more accurate than human-thought) answers for specific questions.


I read that as satire given the current state of the LLMs like chatGPT


But also, this is the area where an LLM shines; tasks that you technically could solve yourself but are so boilerplate that you should spend you brain power elsewhere. Then again here it looks like not much brainpower got put into this project at all (and that's OK, sometimes it be like that )


Or the command line is too complex and/or poorly documented and chatGPT was quicker at solving it?


It's not a huge surprise that gzip as a general compression algorithm didn't compress this down any further. I do wonder about a format that was specifically trained on these specific characters though, and the patterns that tend to emerge from the weird compiler. Maybe the chunks at a certain scale would be predictable and thus compressible.

Of course at that point you're probably more interested in a common binary format, and should start thinking about wasm instead.


The information content of the weird program is the same as that of the original cleartext, so I wouldn't expect zipping a transpiled program to compress better than compressing the original.


Another comment pointed out that there is more information or entropy in the weird version if you don't happen to be a js interpreter.

The compression algorithm knows that 165427-165427 is really 165427-same, but does not know that it's really 0 and the same as all the other things that resolve to 0.

There must be a lot of similar things like that in this particular case that relies on knowledge of the rules of js.

I guess it's tempting to think next about adding ai to a compressor so that it could know the actual rules of js and refactor it, but I just think of the tragedy that is jbig that does OCR as part of the compression, except, gets it wrong, and the original data is lost forever without a trace. And it's built right in to some scanners and happens before the compressed output even leaves the device. The user never sees anything else. There is no better reference uncompressed version anywhere if the user did not know about the problem and override some default settings.


Information content isn't really a well-defined concept for finite-length strings.


Maybe OT, but does anyone know if there is runtime analysis of JS and/or wasm?

Obfuscation is a serious threat to the open web, and things like fingerprinting can be incredibly invasive.

Web browsers typically only support static “prettifying” (ie auto indent). I’ve seen websites probe for chrome extensions, canvas and all kinds of APIs. Deobfuscators are often not enough to restore legibility. (I assume disassemblers are similar, but I’ve never tried.)

I would love to have a trace/intercept/breakpoints of any external APIs called in order to restore a sense of control over what code websites run. Ideally, integrated in the browser. With WASM gaining popularity, this will become (much) more important, imo.


Author here! I've just enabled Issues & PRs in the repo if you want to chime in there.


It might be that GZIP isn’t actually a good format with which to try to compress this data. I would think a compression algorithm that expects a rather large 8 byte character space wouldn’t be very suitable for a 4-bit space


I got a little nerd sniped by this. I wrote a program that takes an input (/usr/share/dict/words on Debian 12, specifically), and compresses it with gzip and Zstandard, and also expands its binary representation (by replacing each 1 bit with A and each 0 bit with B) and compresses that. The result is:

    gzip    expgz   zstd    expzstd original
    263120  421278  252449  389004  985084
So unlike in the article, expansion + compression is at least better than the original. The ratios are (smaller # better; the opposite of how most compression algorithms advertise their performance, but what the original article used): 0.27, 0.42, 0.25, 0.39. gzip and Zstandard aren't a lot different in either case. Whatever patterns that the Javascript weirdifier uses is less obvious to compression algorithms in general than just bitwise substitution.

Here's the program if you want to look for bugs: https://go.dev/play/p/wwNXVzO2TO-


This is way more fun than the "base64->gzip" algorithm that so many of us have tried upon learning about compression...


Using `xz -9 -e` results in pretty reasonable compression, in comparison:

     19648  dommy-2.0.js
     28092  dommy-2.0.weird.js.xz
    115023  lodash-4.17.15.js
    138808  lodash-4.17.15.weird.js.xz
      7114  modernizr-custom.js
     11148  modernizr-custom.weird.js.xz


> So, yeah - this isn't a good idea. If the Weird transpiler only changes the encoding of each character with a really weird equivalent, it makes a lot of sense that it doesn't compress better than the source one - the ideal scenario would be to compress the same.

If your results disagree with your premise in unsurprising and obvious ways - please start your article with that so I can stop reading it.


and Golang is a clever way to build up an AI


ahaha, downvoted out of sight but I'm pretty sure I read this sentiment in some other comment in this site




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: