Reverse engineering programs with unknown instruction sets (2012) [pdf]

tempodox · on Jan 27, 2023

Stuff like that is definitely fun. In the 1990s I bought a Sharp PC-E500S pocket computer and hacked the CPU's instruction set. With no internet and no documentation about the processor, I invented my own assembler syntax for the instructions. Assembler, disassembler, hex monitor, (written in Basic) are all still working to this day.

intelVISA · on Jan 27, 2023

Lovely, you should document your stories that sounds impressive!

fallat · on Jan 27, 2023

Please please write about the whole process :) I'd love to read it!

lloydatkinson · on Jan 27, 2023

You should post that online I'm sure people, including me, would love to read it.

tempodox · on Jan 27, 2023

All my notes at the time were made with pencil on paper. Even if I could find them, I'm not sure they would still be readable. The Basic programs could only be copied by re-typing them manually on a contemporary computer. Presenting this pre-internet stuff on a website would just be too much work, sorry.

codetrotter · on Jan 27, 2023

I understand and sympathise with that.

If you do find the documents though, please consider just scanning them and uploading them to Internet Archive and posting the links to HN. That way someone else in the future can find it and decide if they want to do the manual re-typing etc themselves :)

hasmanean · on Jan 27, 2023

That just makes it a meta challenge…for some unknown engineer who wants to reverse-engineer an engineer’s program that reverse-engineered a program with an unknown instruction set.

msm_ · on Jan 27, 2023

Shout out to CPUAdventure challenge from DragonCTF 2019, which were basically this. If you like the slides, you should find this writeup entertaining: https://www.robertxiao.ca/hacking/dsctf-2019-cpu-adventure-u...

thrdbndndn · on Jan 27, 2023

Thanks, this is much easier to understand than a slide (without presenter).

Dr_Jefyll · on Jan 28, 2023

Probably the second-best fun I ever had was reverse engineering a discrete-TTL processor and the firmware written for it. These were embedded in some Xerox Diablo daisy-wheel printers dating from the latter half of the 20th Century. And the best fun I ever had was hacking that code to better suit the unique needs of my customer!

I wrote about the Diablos and their multi-axis realtime motion control here [1]. The good stuff about the hacking starts just over halfway down the page, "the Diablo proprietary processor."

HN has honored me in past by recognizing other items on the site, such as "One-Bit Computing at 60 Hertz" [2] and "the KK Computer - a radical 6502 redesign" [3].

[1]https://laughtonelectronics.com/oldsite/comm_mfg/commercial_... [2]https://laughtonelectronics.com/Arcana/One-bit%20computer/On... [3]https://laughtonelectronics.com/Arcana/KimKlone/Kimklone_sho...

kijiki · on Jan 27, 2023

Also enjoyable, reverse engineering the Transmeta Crusoe's internal VLIW instruction set: https://www.realworldtech.com/crusoe-intro/

I suspect the Anonymous author might have gotten a tip or two from a friendly Transmeta hardware or software engineer.

skissane · on Jan 27, 2023

I wonder what the mystery instruction set in the slides actually is? (Assuming it is a real instruction set and not just something made up to demo the idea.)

gwern · on Jan 27, 2023

It's a reverse-engineering conference presentation by 2 Russian authors who highlight that they aren't providing any details about the context despite the obvious extreme relevance, and where their solution does not handle any obfuscation at all. So they are probably not decompiling APT malware running in nested VMs, but I'm going to guess reverse-engineering old highly-secret Russian military hardware where the only docs are high-level ones about the usage and repair, not what the chips are doing, and where the contractor wants to bugfix or develop new versions but needs to understand all the inner logic and what empirical ad hoc corrections it might be incorporating through the wisdom of long-dead Russian mega-brain engineers.

olivierduval · on Jan 27, 2023

Amazing !!! Look a lot like breaking a cypher with the added specifics of processor knowledge !

egberts1 · on Jan 27, 2023

I once wrote a detector of 38 known machine languages.

Akin to an expansion of the UNIX file command.

It would listed known machine code(s) encountered at least within 4 bytes (in probability order).

Good times, good times.

(oh, sadly, not open source, but proprietary; I still do wish I could release this gem.)

unwind · on Jan 27, 2023

In what context was that used, if you can elaborate?

egberts1 · on Jan 27, 2023

Like the UNIX file command, it lists out what the file content probably is/are.

It can also breakdown the file in question by regions and group such data content into most probable types … for each region.

As to its final application, that is not in my contract/task description.

unwind · on Jan 27, 2023

Okay, thanks.

Yeah I'm very familiar with 'file', I just wondered in what context one needs the ability to identify 38 machine languages, i.e. why does an organization deal with files containing unknown machine code, and have the need to identify them?

Sounds like maybe reverse engineering/security "research"-oriented work, perhaps.

egberts1 · on Jan 27, 2023

I was basically leveraging my eidetic memory of opcodes and operands and its bitfields.

It all got started with writing pure assembly for Motorola 6502 (for arcades) and PDP-11 then eventually ended with ARM/RISC/MIPS. Most esoteric one is the Transmeta VLIW (TMS3200-02).

and someone asked for one (internally).

tom_ · on Jan 27, 2023

Previously on HN, possibly not unrelated: https://news.ycombinator.com/item?id=25115916

stuckkeys · on Jan 27, 2023

Is the site decompilation.info down? Cannot access it.

serhack_ · on Jan 27, 2023

It seems so

amelius · on Jan 27, 2023

But what if the CPU assumes the instruction stream is compressed?

gus_massa · on Jan 27, 2023

In the slide 9, they show the frequency of each 16-bit value. In a compressed code, the frequency of each value should be almost equal.

10 or 20 years ago, when reverse engineering any unknown file it was a good to assume it was no compressed and you could get some insight looking at the hex editor and hopping the best. Now many are compressed, so a good first step is to change the extension to .zip and try WinRar (or look for a header if you are not lazy).

I assume that with compressed code you can use the same strategy. Try to assume it's using a well known compression algorithm, and crossing your fingers.

anthk · on Jan 27, 2023

7zip, unar, innoextract...

And, of course, upx.