Came here to say this too. It is this kind of in depth investigative work, that has genuinely taken time and smarts to achieve, is what I come to HN to enjoy reading. All credit to the author.
This is a great first step but we're not done yet. It proves the binaries are built from the published code, but only when the published code has been thoroughly vetted can we conclude there is no backdoor.
Also the build is not deterministic yet, even on his own machine he got some differences in truecrypt.sys between successive builds. By definition the build is not (yet) fully deterministic then:
"Using the same source and same project directory results in the same pattern of difference in the block starting at 0002CBAC, as the pattern shown between my build from the correct project directory and the original file. This means that this difference is a normal result of the compilation process, and can be considered harmless from our point of view"
The disassembly for both versions of the file was a 100% match, though, which is a pretty good indicator that the binary difference must be something unimportant and not related to the actual code.
It's possible that the difference is in the actual code: x86 machine code sometimes has multiple different encodings for the same assembler instruction.
These would show up as single-byte differences in the binary.
See http://www.strchr.com/machine_code_redundancy for some examples.
Compilers might use this to stenographically hide additional data in the binary. Printer manufacturers already do something similar (https://en.wikipedia.org/wiki/Printer_steganography). I wouldn't be surprised if MS compilers embedded a hidden data in binaries -- it could be useful to track down malware authors; or identify software created with pirated MS tooling.
Yes that is a good step towards fully deterministic build.
There are ways to manipulate the PE file though to achieve slightly different behaviour.
If you want to be thorough and make sure that your binary was not infected by malware for example then comparing the binaries makes sense.
I share the author's belief it's probably benign, and on the assumption that the source code contains no crazy obfuscated magic, it'd be pretty hard to actually hide malicious code or behaviour there.
Still if they want to be thorough (and they should), I'd have liked a bit more explanation than waving it away with a "well these bytes seem to change all the time, so that's okay then". That's not really a justification, it's just an explanation. And it's not okay.
One possible explanation is that the compiler and linker don't produce 100.00% same output because they have somewhere uninitialized space which gets written to the disk. That's why disassembly can still match 100.00%: the bytes that differ don't contribute to any observable functionality. If you know C, you can imagine how it happens:
You have somewhere char buff[ 32 ] which you don't set all to zeros, then you do strcpy( buff, "something" ) then you write all 32 bytes of buff to the file. What's behind "something"? It can be different in every run.
See the "RSDS"? It's some initial value, just like c:\truec... that follows some bytes later. And what's behind? Some small sequence of random bytes. I've just checked and confirmed that it appears in the ".rdata" (read only data) section of the executable.
Now the fact that this doesn't happen on much more places also means that somebody in Microsoft obviously from time to time does fully clean the code, that is, somebody worries about such effects too, but it seems that from time to time some "late" fix then slips through, misbehaving.
Another explanation would be that there is something in the building code or script that produces some different values in the initialization area. That can be then observed by source inspection.
The third explanation would be some kind of "unique id" generated by the compiler or linker. Then this effect should be observed in all binaries even when the source is fully different (e.g. compile some program which generates Vogon poetry, observe the same effect). This hypothesis doesn't match the observations, according to the pictures presented.
So I believe the highest possibility is the first assumption being true.
It seems like if you can have the tools alter the embedded timestamp (even as a post-processing step), you can match the binaries pretty closely.
The last bit is the signature, which you can't duplicate, but you can also just take out, as it's not code. But, if you're paranoid about that, zero-out or remove the signature after verifying it.
GPG doesn't do disk encryption (full or any), and LUKS is linux-only. That's what keeps people from "just migrating away"...
It doesn't matter what you or I personally prefer, if you want what at least seems to be trustworthy and secure[1], cross-platform cryptography, Truecrypt is what you want.
[1]: Obviously, that is the part that is slowly being evaluated and tested. We'll see what'll happen.
What's the point of implementing cryptography on a closed-source OS? We'll audit TrueCrypt and then have people say "there can still be backdoors in Windows or Mac OS." When it comes to security and cryptography, Linux (and similar) are the only things that matter.
Why do you think that a binary distribution of LUKS deserves less suspicion than binary distributions of Truecrypt?
If everyone were using LUKS instead, one would hope that people would attempt to verify that trust in binary distributions LUKS was well-placed as well.
Deterministic builds are hard...really hard. The combined power of the Debian community has trouble getting deterministic builds, as does the Tor project.
When I think of deterministic builds, I think of the ROMs on the old GameBoy. Apparently some of the ROMs had whatever empty space was left in the fixed-size image padded with random data pulled straight out of the build host's memory. As if consistent tool versions weren't already enough of a challenge!
I'd have thought that deterministic builds are really simple unless your toolkit ecosystem is FUBAR. After all, a compiler is a simple function from input to output (unless the FUBAR ecosystem syndrome arises, as I said).
In short, if you ignore certain things when comparing binaries and make sure you build things on absolute path of the same length(!), you can tell binaries are functionally equivalent.
They also use a unique license that is incompatible with most other open source projects. They seem to have strong reasons for doing things the way they do, and aren't fond of explaining them.
Considering they document their build process very carefully, I can only imagine they wound up standardizing on that library a long time ago and keep using it to this day.
Yes, even if there is any recent open source alternative to build 16-bit code, selecting the 20 years old compiler binaries (http://en.wikipedia.org/wiki/Visual_C++) for which it can be easiest to be sure they haven't changed (there's still enough CD's around produced at that time, I believe I still have one too) is a damn good decision.
Hmm... 20 year older binaries? Anyone ever read "Reflections on Trusting Trust"? Perhaps they don't trust newer code from MS? Think about it, MS can't backdoor an old binary.
I am just shooting into darkness, but would not it be easier to compile it twice and diff outcomes to find found out what parts are being changed so those can be ruled out?
it seems to me that the relaxed gpg key verification that the author uses doesn't give us any more assurances regarding the authenticity of the source than a simple hash offered on the website would. i think in this situation, if the author did not intend to attempt more rigorous verification of the truecrypt pgp key, at least cross-checking that the key offered on the site matches the key offered on a public key server pgp.mit.edu for example would be prudent before signing the truecrypt key with your own.
Import the .asc file in the keyring (File > Import certificates).
Now you should mark the key as trusted: right click on the TrueCrypt Foundation public key
in the list under Imported Certificate tab > Change Owner Trust, and set it as I believe checks are casual.
You should also generate your own key pair to sign this key in order to show you really trust
it and get a nice confirmation when verifying the binary.
i think there are two concerns here. one is that the source is not tainted by a third party during or before the download. the second (arguably much more important in this case) concern is that the compiled binary matches the source. the second concern is addressed well by the author as far as i can tell, but i think that there is room for improvement in their concerns about the former. i assume they have thought about this and do have at least some concerns because it is mentioned that
The PGP signature of the binary can be downloaded
through the button PGP Signature, which makes you
download TrueCrypt Setup 7.1a.exe.sig over HTTPS
(*although with the NSA in the middle, it might not
mean much*).
[emphasis mine]
cross-referencing the pgp signature with at least one other (public) source would go a long way toward allaying those concerns (that the HTTPS might not mean much).
this criticism is in no way meant to detract from the rest of the work, and i mean only to refer to pgp sig verification best practices here.
But you can do this too easily and be make sure yourself! Do it yourself with GPG, then calculate SHA of the binaries, compare with his text. he published the checksums with which he worked in more points of his analysis.
I get the point reg. verifying the Windows-Compiling-Build, but wouldn't the same verification on an open source platform allow for even easier (maybe even automatic) verification?
How about an vmware/vbox image setup explicitly for that purpose? Not feasible for windows due to licencing issues, i guess.
Also, huge kudos for the effort going into this work. Thanks!
Why not? If you're being so paranoid about the origin of a binary, you have to at least acknowledge the fact that you're trusting the compiler in making this comparison.
So let me throw out an idea that might help justify this trust too. Compile the same TrueCrypt sources with a totally different compiler, then use both binaries in a deterministic way and compare the raw encrypted result. (I'm assuming here that the same encryption keys and data will give the same result, but don't know for sure if that's true.)
Create a piss-poor barebones compiler (compiler A) by hand (literally, if possible. punchcards would allow you to hand-verify the contents of the program in a way that is not susceptible to Thompson's attack (there is no risk that your eyes were built with a compromised compiler)) that suffices to compile a compiler that you want to verify (compiler B). Compiler A should to run on hardware that you can trust (need not be x86).
(= ((a-binary b-source) b-source) b-binary)
If you trust a-binary, and you trust b-source, then if that returns true you should be able to trust anything created with b-binary. (a-binary b-source) will not equal b-binary, but ((a-binary b-source) b-source) should.
If the hardware is not executing binaries as described, then all bets are off. Compiler A could, in theory, be build and run on a homebrew CPU (http://www.homebrewcpu.com/), but if you are putting that much effort into this, this better be your hobby...
Let's not forget you need to verify by hand the electrical characteristics & functionality of every logic chip you use to build that homebrew CPU, because the supply chain of basic logic chips is already known to be infected (primarily a QA issue, but clearly a potential vector)
Don't forget the power company. They can cause "spurious" bit flips by momentarily dropping the voltage at just the right instant. Best to use a battery made from potatoes you've grown yourself.
You also need to ensure either the device, or the location you operate the device is rad-hardened. We cannot rule out the possibility that They can control the phases of the sun and introduce errors to your circuit at-will with solar flares.
True, though I suspect you are going to be doing that already since you will undoubtedly fuck up your homebrew CPU in novel ways many times while building it, and spend many hours debugging each piece of it. ;)
If all of those compilers share an ancestor compiler (in other words, if we think GCC is compromised and if Clang was originally bootstrapped with a compromised GCC), then I don't think that would be effective (although code that 'infects' not just future compilers of the same family (GCC that 'infects' future GCC's) but also any future compiler would be incredibly clever, to put it mildly).
Even if that were not the case, if the hypothesis is that GCC was compromised at some point in the past by a shadowy organization, then you have to consider the possibility that this shadowy organization also got to the other compilers. I think that is where probability steps in though; how confident are you that at least some of the compilers are still safe (or perhaps, at least compromised in conflicting ways)?
The TCC binary is small enough that it is eminently tractable to inspect it all by hand (or with IDA Pro if you are the rich kind of hacker). Binaries aren't black boxes, they're just code, only like it's written by a demented cowboy coder with really bad taste in variable names.
The problem is that hypothetically any tools you use on a computer could be compromised (by their compiler, or otherwise) to not show you truthful results on your screen. IDA Pro (and other tools at your disposal) may recognize certain patterns in binaries and know to show you a transformation of those patterns instead. This transformation would essentially be the reverse of the transformation that the compiler performs.
If you are able to inspect the actual contents of the program, not the output of a program that itself inspects the actual contents of the program, then this problem disappears. You have to examine the machine code without an intermediary program that could lie to you.
(Of course it is very unlikely that IDA Pro, objdump, or even 'od' is compromised in this way, but I would say this class of attack is largely hypothetical and implausible already...)
Edit:
From wikipedia: "What's worse, in Thompson's proof of concept implementation, the subverted compiler also subverted the analysis program (the disassembler), so that anyone who examined the binaries in the usual way would not actually see the real code that was running, but something else instead."
> If you're being so paranoid about the origin of a binary, you have to at least acknowledge the fact that you're trusting the compiler in making this comparison.
Which the author does, in fact:
"Of course, we need to trust the compiler, but in this case, it is independent of TrueCrypt."
You wouldn't just have to verify if it produces the same encrypted output, but also if all the steps along the way are carried out in precisely the same manner. A compromised version of TC may correctly encrypt the volume as expected, but also leak the key or the encryption password on the sly.
If I wanted to compromise TrueCrypt via a secret compiler-injected vulnerability, I'd replace the key generation logic with something that used maybe 64 actually random bytes as the input to an unpublished high-quality PRNG (the NSA almost certainly has a few of those hanging around). I don't think you could detect that by your method.
Because TrueCrypt is several orders of magnitude more high-profile of a target (being actual cryptography instead of a compiler) and probably also several orders of magnitude easier to compromise in a useful fashion & spread.
Paranoia is without bound. Who knows, you yourself might be a sleeper agent and you just don't know it yet! Maybe your eyes really were compromised in development! So you approach the problem from the point of view of what is likely, and what is not. You cannot guard against every single paranoia, but you can guard against ones you deem more likely.
You know that at a minimum, some manager at the NSA/CIA read that seminal paper and salivated at the prospect of a compromised compiler. Whether they are other there or not, I'm certain millions have been spent attempting it.
I seem to remember retrieving from a BBS way back when, an MS-DOS shareware pascal or C compiler of some kind that would leave behind a serial number foot print in executables that the author said he could use to prove that an unregistered version of his product was used. I wish I could remember it now, though
Dropping a 4-16 byte identifier string into an unused portion of a binary is world's away from reading source code, solving the halting problem to determine exactly which part you want to backdoor, and then outputting a backdoor binary. Shit, gcc embeds its version string in every object file, that doesn't mean it's a trap.
Don't need to solve the halting problem. There are plenty of well known, fixed points in any program you can attack. Patch the main entry point, patch the exit point, patch the memory allocation point, patch any function entry, etc. All you need is a few bytes of jump instruction to jump to the embedded compromised code. That can further download any specific code tailored to the specific program given its signature.
Since the compiler is in charge of generating the layout of the executable, it's in the perfect position to alter it so slightly to patch in a backdoor.
In order for your compiler to propagate the backdoor into my compiler and my compiler's output, it needs to recognize that it's compiling a compiler and insert the appropriate backdoor. It needs to identify the parts of my compiler that output binary code as opposed to an XML dump of the AST. That's hard.
Let me say it again, you can patch the WELL KNOWN points of any program.
I don't care where your compiler's AST tree or code generation is. For any compromised program (including a compiler) all I need to do is to monitor the files it generates (patch file_open), for any executable output files, patch its main entry point and add in a payload.
When a compromised compiler is generating your compiler, it will patch your compiler's entry point and add in an extra payload. When your compiler compiles another compiler, it will do the same thing, and so to any other programs it generates.
In the wise words of capitalists: show me the money. No one is going to develop some software to prove a point for an argument in an internet forum. You put up the money to commission a project with the ongoing rate and I'll show you the code.
I'm not wasting money to try to prove an improvable point.
It's very easy to play "specialist" and come up with theoretical scenarios, like the idiots that think it's possible to attack git using SHA1 collisions
In the purely theoretical sense, RSA is also broken, since you "only" need to gather a lot of computers to factor a key.
It's also every easy to make an empty one-liner, especially borrowing from some authority to make it appear important.
If you are not willing to waste money on proving a point, why would you expect me to waste substantial effort to write code to prove my point to you?
And if you are not willing to put money behind your statement, your one-liner talking point is exactly what it says, "talk is cheap."
I at least put in the effort to build detail case to rebut the previous comment poster's point and showed how it can be done. If you think my point was wrong, build a detail case to rebut it. Then we can have a meaningful discussion; otherwise, it's just cheap empty talk.
BTW, what I talked about was not theoretical. That's how viruses are written. You don't have to believe me, but again it's not my job to convince everyone.
It makes secure use of git a pain in the ass. You can't do even fetch objects from a source that isn't fully trusted, because they could override objects from a trusted repo.
I'm pretty sure you're wrong. Sometimes an argument is stronger motivation than money. Also sometimes better than money: knowledge, friendship, one-upping random comments, passion, aspiration, etc. Linus wrote Linux basically for reasons you say wouldn't motivate anyone to write code.
Look. I wasn't making a universal statement. My reply was specifically aiming your GP, whose smartass statement appealing to authority added nothing to the discussion. His statement embodies exactly what he is saying, "talk is cheap." And he wanted me to put in substantial effort for his one-liner? I wanted him to put some skin of his own in the game. Put up the money to make a point that his statement is not just cheap talk.
The good thing about his analysis: It has all the information you need to reproduce it and form your own conclusion. So even if he was working for the NSA, by following his steps, you will either come to the same result or not.
Well then, what and why are the differences? I mean, if there's an arbitrary data block somewhere, then the "matching disassembly" can have wildly different behavior by simply copying & executing parts of that block.
I can't wait until the source audit uncovers a funny little subroutine that loads the certificate from the .EXE, decodes the public key into RAM, and then starts executing it. :)
edit: not that this seems like a realistic method of injecting malicious code. If you could get away with that in an open source project, you could probably get away with just hiding the malicious code in the app directly.
I got an impression from the article that disassembly was what he did to explain the binary differences that remained AFTER he corrected for timestamps/certificates/etc.
Your impression was wrong. He showed all of the differences in the screenshots. That's how few there were. Not a single bit of the code portions was different. Only a handful of metadata bytes, plus appended certificate.
The disassembly was only there to be cute and emphasize the point.