Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
How I compiled TrueCrypt 7.1a for Win32 and matched the official binaries (concordia.ca)
346 points by maqr on Oct 24, 2013 | hide | past | favorite | 105 comments


That's amazing work. Well done to the author.


Came here to say this too. It is this kind of in depth investigative work, that has genuinely taken time and smarts to achieve, is what I come to HN to enjoy reading. All credit to the author.


This is a great first step but we're not done yet. It proves the binaries are built from the published code, but only when the published code has been thoroughly vetted can we conclude there is no backdoor.


Also the build is not deterministic yet, even on his own machine he got some differences in truecrypt.sys between successive builds. By definition the build is not (yet) fully deterministic then:

"Using the same source and same project directory results in the same pattern of difference in the block starting at 0002CBAC, as the pattern shown between my build from the correct project directory and the original file. This means that this difference is a normal result of the compilation process, and can be considered harmless from our point of view"


The disassembly for both versions of the file was a 100% match, though, which is a pretty good indicator that the binary difference must be something unimportant and not related to the actual code.


It's possible that the difference is in the actual code: x86 machine code sometimes has multiple different encodings for the same assembler instruction. These would show up as single-byte differences in the binary. See http://www.strchr.com/machine_code_redundancy for some examples. Compilers might use this to stenographically hide additional data in the binary. Printer manufacturers already do something similar (https://en.wikipedia.org/wiki/Printer_steganography). I wouldn't be surprised if MS compilers embedded a hidden data in binaries -- it could be useful to track down malware authors; or identify software created with pirated MS tooling.


Perhaps this could even be used by an organization to differentiate official binaries of closed-source software from leaked binaries.


If you read the whole article, you'd see that every single bit difference is accounted for; none of the bit differences occur in the actual code.


Yes that is a good step towards fully deterministic build. There are ways to manipulate the PE file though to achieve slightly different behaviour. If you want to be thorough and make sure that your binary was not infected by malware for example then comparing the binaries makes sense.


Indeed, that part itches me a little as well.

I share the author's belief it's probably benign, and on the assumption that the source code contains no crazy obfuscated magic, it'd be pretty hard to actually hide malicious code or behaviour there.

Still if they want to be thorough (and they should), I'd have liked a bit more explanation than waving it away with a "well these bytes seem to change all the time, so that's okay then". That's not really a justification, it's just an explanation. And it's not okay.


One possible explanation is that the compiler and linker don't produce 100.00% same output because they have somewhere uninitialized space which gets written to the disk. That's why disassembly can still match 100.00%: the bytes that differ don't contribute to any observable functionality. If you know C, you can imagine how it happens:

You have somewhere char buff[ 32 ] which you don't set all to zeros, then you do strcpy( buff, "something" ) then you write all 32 bytes of buff to the file. What's behind "something"? It can be different in every run.

Now look at the picture: https://madiba.encs.concordia.ca/~x_decarn/truecrypt-binarie...

See the "RSDS"? It's some initial value, just like c:\truec... that follows some bytes later. And what's behind? Some small sequence of random bytes. I've just checked and confirmed that it appears in the ".rdata" (read only data) section of the executable.

Now the fact that this doesn't happen on much more places also means that somebody in Microsoft obviously from time to time does fully clean the code, that is, somebody worries about such effects too, but it seems that from time to time some "late" fix then slips through, misbehaving.

Another explanation would be that there is something in the building code or script that produces some different values in the initialization area. That can be then observed by source inspection.

The third explanation would be some kind of "unique id" generated by the compiler or linker. Then this effect should be observed in all binaries even when the source is fully different (e.g. compile some program which generates Vogon poetry, observe the same effect). This hypothesis doesn't match the observations, according to the pictures presented.

So I believe the highest possibility is the first assumption being true.


RSDS is part of the debug information section. The structure looks like (using the construct library's syntax):

  CV_RSDS_HEADER = Struct("CV_RSDS",
      Const(Bytes("Signature", 4), "RSDS"),
      GUID("GUID"),
      ULInt32("Age"),
      CString("Filename"),
  )
So it's a GUID, and one property of GUIDs is that they're "Globally unique", hence regenerated each time. Mystery solved.

Edit: Here's a source [1] so you don't have to take my word for it.

[1] http://www.godevtool.com/Other/pdb.htm


It seems like if you can have the tools alter the embedded timestamp (even as a post-processing step), you can match the binaries pretty closely.

The last bit is the signature, which you can't duplicate, but you can also just take out, as it's not code. But, if you're paranoid about that, zero-out or remove the signature after verifying it.


Why are people so adamant about auditing TrueCrypt instead of just migrating to an alternative?


TrueCrypt is highly trusted, they are verifying that trust is well placed. What alternative were you proposing?


LUKS, GPG as examples only. I feel like time would be better spent switching to something less suspicious than trying to prove the suspicions false.


GPG doesn't do disk encryption (full or any), and LUKS is linux-only. That's what keeps people from "just migrating away"...

It doesn't matter what you or I personally prefer, if you want what at least seems to be trustworthy and secure[1], cross-platform cryptography, Truecrypt is what you want.

[1]: Obviously, that is the part that is slowly being evaluated and tested. We'll see what'll happen.


What's the point of implementing cryptography on a closed-source OS? We'll audit TrueCrypt and then have people say "there can still be backdoors in Windows or Mac OS." When it comes to security and cryptography, Linux (and similar) are the only things that matter.


Why do you think that a binary distribution of LUKS deserves less suspicion than binary distributions of Truecrypt?

If everyone were using LUKS instead, one would hope that people would attempt to verify that trust in binary distributions LUKS was well-placed as well.


It's important for Windows users to have access to trusted encryption programs.


I don't think it is. Windows and Mac users will never be safe, period. If you care about security, stop using those OSes.


"TrueCrypt is a project that doesn't provide deterministic builds."

Why? What is the benefit of doing so when everyone wants a deterministic build?


Deterministic builds are hard...really hard. The combined power of the Debian community has trouble getting deterministic builds, as does the Tor project.


When I think of deterministic builds, I think of the ROMs on the old GameBoy. Apparently some of the ROMs had whatever empty space was left in the fixed-size image padded with random data pulled straight out of the build host's memory. As if consistent tool versions weren't already enough of a challenge!


Just in case someone (like me) is not familiar with the concept of deterministic builds, this is a good read:

https://blog.torproject.org/blog/deterministic-builds-part-o...

Synopsis:

"deterministic builds" -- packages which are byte-for-byte identical no matter who actually builds them, or what hardware they use.


Isnt nix and guix a kind of solution that the deterministic build problem?

They produce builds with hashes of the source and all dependencies and tools used, or so I believe after skimming their manuals.


They're not really a solution if the official binaries are compiled using VC, and you want to compare to that.


Oh! I thought that was the norm. Author made it sounded like TrueCrypt doesn't and others do the way he said it.


Deterministic builds are hard...really hard.

I'd have thought that deterministic builds are really simple unless your toolkit ecosystem is FUBAR. After all, a compiler is a simple function from input to output (unless the FUBAR ecosystem syndrome arises, as I said).


Here is a not-in-depth look at what's involved in trying to generate byte-for byte binaries on msvc:

http://stackoverflow.com/questions/1180852/deterministic-bui...

In short, if you ignore certain things when comparing binaries and make sure you build things on absolute path of the same length(!), you can tell binaries are functionally equivalent.


Well, the parent poster would say that if binaries depend on the path length of build location, then that's evidence of a FUBAR toolchain :)


I don't understand, what is the problem if we have different compiled binary?


From my understanding, if you can achieve deterministic builds, it makes it easier to detect tampering.


I wish the term "reproducible builds" were the more popular one. I think that's more accurate.


For what it is worth, that is what Debian[0] seems to call it as well.

[0] https://wiki.debian.org/ReproducibleBuilds


They also use a unique license that is incompatible with most other open source projects. They seem to have strong reasons for doing things the way they do, and aren't fond of explaining them.


>and aren't fond of explaining them.

Who tried to poke them how, and what happened?


Just a slightly off-topic question, but WTF does TC require VC 1.52 for?


It's the easiest way to build C or C++ targeting 16-bit real mode with a Windows host/toolchain, and they use it to compile a boot loader.


It's the last version that can produce 16-bit code?


Considering they document their build process very carefully, I can only imagine they wound up standardizing on that library a long time ago and keep using it to this day.


Yes, even if there is any recent open source alternative to build 16-bit code, selecting the 20 years old compiler binaries (http://en.wikipedia.org/wiki/Visual_C++) for which it can be easiest to be sure they haven't changed (there's still enough CD's around produced at that time, I believe I still have one too) is a damn good decision.


"Reflections on trusting trust" is 30 years old. 20 year old binaries should still be more trustable than newer ones though, I suppose.


Truecrypt didn't exist in 1993.


Hmm... 20 year older binaries? Anyone ever read "Reflections on Trusting Trust"? Perhaps they don't trust newer code from MS? Think about it, MS can't backdoor an old binary.


I am just shooting into darkness, but would not it be easier to compile it twice and diff outcomes to find found out what parts are being changed so those can be ruled out?


He did that further down in the article, when he noticed that file path was encoded in the TrueCrypt.sys file.


it seems to me that the relaxed gpg key verification that the author uses doesn't give us any more assurances regarding the authenticity of the source than a simple hash offered on the website would. i think in this situation, if the author did not intend to attempt more rigorous verification of the truecrypt pgp key, at least cross-checking that the key offered on the site matches the key offered on a public key server pgp.mit.edu for example would be prudent before signing the truecrypt key with your own.

  Import the .asc file in the keyring (File > Import certificates).
  Now you should mark the key as trusted: right click on the TrueCrypt Foundation public key 
  in the list under Imported Certificate tab > Change Owner Trust, and set it as I believe checks are casual.
  You should also generate your own key pair to sign this key in order to show you really trust 
  it and get a nice confirmation when verifying the binary.


You can do yours checks and compare with the author's! It's a very fast thing to do for the checks you mention. The more people repeat it, the better.


i think there are two concerns here. one is that the source is not tainted by a third party during or before the download. the second (arguably much more important in this case) concern is that the compiled binary matches the source. the second concern is addressed well by the author as far as i can tell, but i think that there is room for improvement in their concerns about the former. i assume they have thought about this and do have at least some concerns because it is mentioned that

  The PGP signature of the binary can be downloaded 
  through the button PGP Signature, which makes you 
  download TrueCrypt Setup 7.1a.exe.sig over HTTPS 
  (*although with the NSA in the middle, it might not 
  mean much*).
[emphasis mine]

cross-referencing the pgp signature with at least one other (public) source would go a long way toward allaying those concerns (that the HTTPS might not mean much).

this criticism is in no way meant to detract from the rest of the work, and i mean only to refer to pgp sig verification best practices here.


But you can do this too easily and be make sure yourself! Do it yourself with GPG, then calculate SHA of the binaries, compare with his text. he published the checksums with which he worked in more points of his analysis.


I get the point reg. verifying the Windows-Compiling-Build, but wouldn't the same verification on an open source platform allow for even easier (maybe even automatic) verification?

How about an vmware/vbox image setup explicitly for that purpose? Not feasible for windows due to licencing issues, i guess.

Also, huge kudos for the effort going into this work. Thanks!


> TrueCrypt is not backdoored in a way that is not visible from the sources

... as long as you also trust the compiler not to introduce any backdoor... (cf. Reflections on Trusting Trust)


Oh god, not this conversation again.


Why not? If you're being so paranoid about the origin of a binary, you have to at least acknowledge the fact that you're trusting the compiler in making this comparison.

So let me throw out an idea that might help justify this trust too. Compile the same TrueCrypt sources with a totally different compiler, then use both binaries in a deterministic way and compare the raw encrypted result. (I'm assuming here that the same encryption keys and data will give the same result, but don't know for sure if that's true.)


Create a piss-poor barebones compiler (compiler A) by hand (literally, if possible. punchcards would allow you to hand-verify the contents of the program in a way that is not susceptible to Thompson's attack (there is no risk that your eyes were built with a compromised compiler)) that suffices to compile a compiler that you want to verify (compiler B). Compiler A should to run on hardware that you can trust (need not be x86).

  (= ((a-binary b-source) b-source) b-binary)
If you trust a-binary, and you trust b-source, then if that returns true you should be able to trust anything created with b-binary. (a-binary b-source) will not equal b-binary, but ((a-binary b-source) b-source) should.

If the hardware is not executing binaries as described, then all bets are off. Compiler A could, in theory, be build and run on a homebrew CPU (http://www.homebrewcpu.com/), but if you are putting that much effort into this, this better be your hobby...


Let's not forget you need to verify by hand the electrical characteristics & functionality of every logic chip you use to build that homebrew CPU, because the supply chain of basic logic chips is already known to be infected (primarily a QA issue, but clearly a potential vector)


Don't forget the power company. They can cause "spurious" bit flips by momentarily dropping the voltage at just the right instant. Best to use a battery made from potatoes you've grown yourself.


You also need to ensure either the device, or the location you operate the device is rad-hardened. We cannot rule out the possibility that They can control the phases of the sun and introduce errors to your circuit at-will with solar flares.


I think it's time for the crowbar.


This is what capacitors are for.


Yeah, but can you trust capacitors nowadays?


True, though I suspect you are going to be doing that already since you will undoubtedly fuck up your homebrew CPU in novel ways many times while building it, and spend many hours debugging each piece of it. ;)


Good reference on this technique: http://www.dwheeler.com/trusting-trust/


Do you suggest to apply the Bayes' theorem to verify the result's validity? As compilers: TCC, LLVM Clang and GCC come into my mind.


If all of those compilers share an ancestor compiler (in other words, if we think GCC is compromised and if Clang was originally bootstrapped with a compromised GCC), then I don't think that would be effective (although code that 'infects' not just future compilers of the same family (GCC that 'infects' future GCC's) but also any future compiler would be incredibly clever, to put it mildly).

Even if that were not the case, if the hypothesis is that GCC was compromised at some point in the past by a shadowy organization, then you have to consider the possibility that this shadowy organization also got to the other compilers. I think that is where probability steps in though; how confident are you that at least some of the compilers are still safe (or perhaps, at least compromised in conflicting ways)?


The TCC binary is small enough that it is eminently tractable to inspect it all by hand (or with IDA Pro if you are the rich kind of hacker). Binaries aren't black boxes, they're just code, only like it's written by a demented cowboy coder with really bad taste in variable names.


The problem is that hypothetically any tools you use on a computer could be compromised (by their compiler, or otherwise) to not show you truthful results on your screen. IDA Pro (and other tools at your disposal) may recognize certain patterns in binaries and know to show you a transformation of those patterns instead. This transformation would essentially be the reverse of the transformation that the compiler performs.

If you are able to inspect the actual contents of the program, not the output of a program that itself inspects the actual contents of the program, then this problem disappears. You have to examine the machine code without an intermediary program that could lie to you.

(Of course it is very unlikely that IDA Pro, objdump, or even 'od' is compromised in this way, but I would say this class of attack is largely hypothetical and implausible already...)

Edit:

From wikipedia: "What's worse, in Thompson's proof of concept implementation, the subverted compiler also subverted the analysis program (the disassembler), so that anyone who examined the binaries in the usual way would not actually see the real code that was running, but something else instead."


> If you're being so paranoid about the origin of a binary, you have to at least acknowledge the fact that you're trusting the compiler in making this comparison.

Which the author does, in fact:

"Of course, we need to trust the compiler, but in this case, it is independent of TrueCrypt."

He even links to the "trusting trust" article.


Ah, indeed, I didn't read closely enough to see that mention.


You wouldn't just have to verify if it produces the same encrypted output, but also if all the steps along the way are carried out in precisely the same manner. A compromised version of TC may correctly encrypt the volume as expected, but also leak the key or the encryption password on the sly.


If I wanted to compromise TrueCrypt via a secret compiler-injected vulnerability, I'd replace the key generation logic with something that used maybe 64 actually random bytes as the input to an unpublished high-quality PRNG (the NSA almost certainly has a few of those hanging around). I don't think you could detect that by your method.


Why not?

Because TrueCrypt is several orders of magnitude more high-profile of a target (being actual cryptography instead of a compiler) and probably also several orders of magnitude easier to compromise in a useful fashion & spread.

Paranoia is without bound. Who knows, you yourself might be a sleeper agent and you just don't know it yet! Maybe your eyes really were compromised in development! So you approach the problem from the point of view of what is likely, and what is not. You cannot guard against every single paranoia, but you can guard against ones you deem more likely.


See: https://news.ycombinator.com/item?id=6608922

The compiler backdoor would have to be very old to exist.


You know that at a minimum, some manager at the NSA/CIA read that seminal paper and salivated at the prospect of a compromised compiler. Whether they are other there or not, I'm certain millions have been spent attempting it.


Im with you.

Lets all just stop right here. No, dont mention that cross-compile with different compilers and compare. Just everyone stop.


It's not crazy talk. :)

I seem to remember retrieving from a BBS way back when, an MS-DOS shareware pascal or C compiler of some kind that would leave behind a serial number foot print in executables that the author said he could use to prove that an unregistered version of his product was used. I wish I could remember it now, though


Dropping a 4-16 byte identifier string into an unused portion of a binary is world's away from reading source code, solving the halting problem to determine exactly which part you want to backdoor, and then outputting a backdoor binary. Shit, gcc embeds its version string in every object file, that doesn't mean it's a trap.


Don't need to solve the halting problem. There are plenty of well known, fixed points in any program you can attack. Patch the main entry point, patch the exit point, patch the memory allocation point, patch any function entry, etc. All you need is a few bytes of jump instruction to jump to the embedded compromised code. That can further download any specific code tailored to the specific program given its signature.

Since the compiler is in charge of generating the layout of the executable, it's in the perfect position to alter it so slightly to patch in a backdoor.


In order for your compiler to propagate the backdoor into my compiler and my compiler's output, it needs to recognize that it's compiling a compiler and insert the appropriate backdoor. It needs to identify the parts of my compiler that output binary code as opposed to an XML dump of the AST. That's hard.


Let me say it again, you can patch the WELL KNOWN points of any program.

I don't care where your compiler's AST tree or code generation is. For any compromised program (including a compiler) all I need to do is to monitor the files it generates (patch file_open), for any executable output files, patch its main entry point and add in a payload.

When a compromised compiler is generating your compiler, it will patch your compiler's entry point and add in an extra payload. When your compiler compiles another compiler, it will do the same thing, and so to any other programs it generates.

It's virus writing 101.


How do you identify an executable output file?


File that contains _main? File that ends in .exe?


Watch for an ELF/COFF/PE/etc. header.


In the wise words of Linus Torvalds: Talk is cheap, show me the code


In the wise words of capitalists: show me the money. No one is going to develop some software to prove a point for an argument in an internet forum. You put up the money to commission a project with the ongoing rate and I'll show you the code.


I'm not wasting money to try to prove an improvable point.

It's very easy to play "specialist" and come up with theoretical scenarios, like the idiots that think it's possible to attack git using SHA1 collisions

In the purely theoretical sense, RSA is also broken, since you "only" need to gather a lot of computers to factor a key.


It's also every easy to make an empty one-liner, especially borrowing from some authority to make it appear important.

If you are not willing to waste money on proving a point, why would you expect me to waste substantial effort to write code to prove my point to you?

And if you are not willing to put money behind your statement, your one-liner talking point is exactly what it says, "talk is cheap."

I at least put in the effort to build detail case to rebut the previous comment poster's point and showed how it can be done. If you think my point was wrong, build a detail case to rebut it. Then we can have a meaningful discussion; otherwise, it's just cheap empty talk.

BTW, what I talked about was not theoretical. That's how viruses are written. You don't have to believe me, but again it's not my job to convince everyone.


Forging SHA1 collisions is not sufficient to attack git.


It makes secure use of git a pain in the ass. You can't do even fetch objects from a source that isn't fully trusted, because they could override objects from a trusted repo.


I'm pretty sure you're wrong. Sometimes an argument is stronger motivation than money. Also sometimes better than money: knowledge, friendship, one-upping random comments, passion, aspiration, etc. Linus wrote Linux basically for reasons you say wouldn't motivate anyone to write code.


Look. I wasn't making a universal statement. My reply was specifically aiming your GP, whose smartass statement appealing to authority added nothing to the discussion. His statement embodies exactly what he is saying, "talk is cheap." And he wanted me to put in substantial effort for his one-liner? I wanted him to put some skin of his own in the game. Put up the money to make a point that his statement is not just cheap talk.


Oh yeah! I remember this too.

I think we're thinking of the A86 shareware assembler:

https://en.wikipedia.org/wiki/A86_%28software%29


I entered just to say it's an incredible work done by this guy... it's been years since I analized a file on hex mode (from Norton Commander, jeje).


Please God, don't let the author be working for the NSA. These days I get suspicious at every "it's all good" piece of news.


The good thing about his analysis: It has all the information you need to reproduce it and form your own conclusion. So even if he was working for the NSA, by following his steps, you will either come to the same result or not.

That's how science works.


Coolest post I've read today! Good work!


Kudos for effort.


Tldr: Binaries didn't match, here's some handwaving at the differences.


I think that's inaccurate. The disassembly of them matched perfectly.


Well then, what and why are the differences? I mean, if there's an arbitrary data block somewhere, then the "matching disassembly" can have wildly different behavior by simply copying & executing parts of that block.


It's explained in the article. Timestamps, file paths, certificates and oddities of the PE format.


I can't wait until the source audit uncovers a funny little subroutine that loads the certificate from the .EXE, decodes the public key into RAM, and then starts executing it. :)

edit: not that this seems like a realistic method of injecting malicious code. If you could get away with that in an open source project, you could probably get away with just hiding the malicious code in the app directly.


I got an impression from the article that disassembly was what he did to explain the binary differences that remained AFTER he corrected for timestamps/certificates/etc.


Your impression was wrong. He showed all of the differences in the screenshots. That's how few there were. Not a single bit of the code portions was different. Only a handful of metadata bytes, plus appended certificate.

The disassembly was only there to be cute and emphasize the point.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: