Hacker News new | past | comments | ask | show | jobs | submit login
OpenBSD's safe, new file(1) implementation (marc.info)
124 points by zdw on April 25, 2015 | hide | past | favorite | 84 comments



You may find the code for the current implementation in their CVS: http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/usr.bin/file/?s...

You can also find a git mirror of the referenced commit at: https://bitbucket.org/braindamaged/openbsd-src/commits/c421d...


OpenGrok on BSD Cross Reference is probably more useful in looking at it.

http://BXR.SU/OpenBSD/usr.bin/file/file.c.


Thanks for mentioning Super User's BSD Cross Reference (bxr.su)! Apparently there are many more OpenGrok installations nowadays than it used to be 5 years ago.

https://github.com/OpenGrok/OpenGrok/wiki/OpenGrok-installat...

EDIT: Not all of them are working, though.


Never heard of that website, seems useful. Unless I am missing something, however, it seems that while the OpenBSD, NetBSD and DragonFlyBSD sources are up to date, the FreeBSD sources are out of date from 09-Dec-2013.


http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/usr.bin/file/fi...

I'm not so familiar with FreeBSD coding standards and design practices but it looks like that could still be refactored even more by getting rid of all the extra memory allocations (and deallocations), making future changes less likely to introduce bugs. For example it's allocating one structure for each file given as input, and making copies of all the paths... which I'm pretty sure isn't necessary as file tests each of its input files independently; the code misleadingly implies, however, that they are somehow dependent. The old version was better in this respect.

That said, compared to the much more complex old version, the rest of it is an improvement.


OpenBSD != FreeBSD

This[0] is a good place to start. Other than than, just read the code.

[0] http://www.openbsd.org/cgi-bin/man.cgi/OpenBSD-current/man9/...


Getting that wrong shows how much I work with the different variants of * BSD...

That's a syntactic style guide; I was referring to higher-level issues at the design level, like "avoid dynamic allocation unless absolutely necessary", which are between syntax and the general very-high-level "follow the UNIX philosophy" type of guidelines.

To answer the sibling comments about preparing a patch and sending it in: this was just an observation I had from a brief glance at the code; I don't have the resources to delve into the research and find out if this is actually something they would want to change, if there is a reason behind doing it this way, or only the norm for the project (e.g. the Java community largely has a very different notion of what it means to "simplify" code) - so if someone wants to take these ideas and follow through, they are more than welcome to.


Please take the time to send diffs to tech@ instead of writing code review comments on hackernews which will most likely not be read by relevent people and therefore have no effect whatsoever. Thanks.


Is I'm pretty sure yet enough to warrant a report?


There's nothing stopping you from submitting a patch for a small sample of the issues and a friendly question on their mailing list.


There are many inherent problems with utilities like file(1), however, if people are going to continue using them then we should try to make the best implementation possible.

This new implementation was created from scratch, carefully, with modern coding practices by a very proficient programmer.


I do think a cleaner, more careful file(1) implementation is a step in the right direction. But it's still a bit unnerving to me that a bug in a utility like file(1), which needs to do nothing but read a file and output text to stdout, could possibly do things like overwrite other files, write to the network, etc., even in the presence of parsing bugs and a maliciously crafted file. A parsing bug in a utility that does nothing but read files and parse them should not be able to do anything worse than force a misparse! The file(1) binary simply does not need to have permission to do things like write to the network or disk. This seems like a shortcoming of the traditional Unix permissions model. A promising direction imo is the more fine-grained permission model of Solaris/Illumos, which FreeBSD's Capsicum project also aims in the direction of.


While capability systems are totally awesome, for this sort of application, you don't really need it. Open the target file read-only, open the magic database read-only, keep stdout open, and then even something as restrictive as Linux's seccomp mode 1 sandbox (read(2), write(2), exit(2), nothing else) should be enough to let you print out an answer.


Or object capability systems in general ... like the material at http://erights.org/ and the security kernel work from long long ago: http://mumble.net/~jar/pubs/secureos/ and http://mumble.net/~jar/pubs/secureos/secureos.html


I thought this was basically apparmour:

https://wiki.ubuntu.com/AppArmor


SELinux does exactly this already.


> This new implementation was created from scratch, carefully, with modern coding practices by a very proficient programmer.

Cool. What happens 10 years from now, when we've wanted all the missing features back, and when writing C code (let alone C code without tests) to run on untrusted input is no longer "modern coding practices" but firmly legacy?

This seems like a great way to have a version of file(1) that works great in 2015, on the subset of filetypes that are in-scope for the initial version, on this version of OpenBSD, but I'd like to know how it becomes sustainable.


Nobody is saying this is a perfect implementation, however, it has been written in-line with existing best coding practices that any OpenBSD developer should be comfortable with, and hence maintainable until it is deemed not to be. If everyone suddenly stopped depending on file(1), then it would happily be removed from the tree.


Yes, to be clear, this is a huge improvement.

But it's sort of like living in a world where modern medicine has been discovered, and celebrating that one city has started moving away from leeching to treat fevers, while every other city is still practicing it, and that one city still uses leeching to treat many other diseases. Strictly speaking, it is an improvement, but....


"This solution isn't perfect therefore we shouldn't do it"?


That doesn't really seem like the right response to a culture of bloodletting, does it?

I was going for, let's quit bloodletting, but let's also not celebrate how we're occasionally no longer bloodletting, because it's still embarrassing how far behind modern medicine we are.


Bloodletting? That's just too crazy for words. The OpenBSD people produce first class code that's solves real problems in real production, and you compare it to witchcraft?

If you absolutely need speak in parables, do it with something that represents an earlier common practice, for example with producing houses with hammers and nails instead of modern pre-fab concrete blocks. But even that is stretching it a bit too far, the productivity difference simply isn't comparable.


Bloodletting isn't witchcraft. It removes iron from the body such that virii and bacteria trying to reproduce would have problems. It makes the host environment more hostile to the invading species. It has a lot of negative side effects and wasn't very effective, but it did have an effect.

Of course, some people did bloodlet for reasons that didn't help, but without knowledge of bacteria, many were just trying their best with what limited knowledge they did have.


I am sorry, I'm not following your extended analogy.

Could you please elaborate on what "bloodletting" and "modern medicine" represent here?


Not him, but my interpretation is that we're writing C with little testing (treating a few fevers) conforming to modern standards over C with little testing conforming to old standards (leeching).

We have languages like Rust and we have proof tools like Coq which can be leveraged to become more memory safe or more provably correct... using those tools would be like using modern medicine.

I don't think the analogy is really that great because those tools are still incredibly immature in terms of actual usability.


I posted a portable version on twitter a few hours ago, but this is just something I put together quick without much testing so "YMMV".

https://twitter.com/canadianbryan/status/591997995459051520


I've updated the diff/tgz to include the new privsep functionality, systrace(4) based syscall whitelisting was conditionalized on OpenBSD.


The commit message is:

    New implementation of the file(1) utility.
    This is a simplified, modernised version
    with a nearly complete magic(5) parser
    but omits some of the complex builtin
    tests (notably ELF) and has a reduced set
    of options.
    
    ok deraadt
What is significance beyond the obvious?

What options are now missing?

What are the tradeoffs? Beyond a simplified API, are there new capabilities, functionality/use-cases, and/or improved performance?


The current de facto file(1) implementation was originally written in the 1980's, which was a very different time. It has been continually maintained since then, but it was still holding onto a lot of legacy idioms that have lead to some pretty significant security vulnerabilities in recent times.

This is a simple implementation that includes a safer new magic(5) parser and excludes many of the "built-in" parsers that the contemporary file had, i.e: for executable formats, which are probably better examined with specially tools like objdump(1).

The original author shared this very fitting sentiment on the news the new file(1)...

http://marc.info/?l=openbsd-cvs&m=142989483913635&w=2

"The Albatross fell off, and sank Like lead into the sea." — Ian Darwin

  The Rime of the Ancient Mariner


brynet, who posted below, tweeted a pdf-rendered man page for the new openbsd file command:

http://brynet.biz.tm/pub/file.1.pdf

It seems to have all the options I'd ever use. In fact I've never used any options to file (though I do use it to quickly identify a handful of files, often enough).

The only reason OpenBSD does this kind of thing is because they got annoyed with the vulnerabilities in the classic file/libmagic source which everyone seems to use. They probably disliked the code style / organization of the source enough that they didn't want to patch it up (which they also do fairly often for various ports). That said I've never personally looked at the source.


> That said I've never personally looked at the source.

I'd say very few people do, which is kind of scary for a utility like file(1), which people feed any random file without giving second thought.


Seeing as it's OpenBSD it's probably much safer than existing versions


There's only so much safety we should ever expect from code written in C by a human being.


The seL4 microkernel was written by hand, in C by fallible humans [1]. However, the authors were careful to do so in such a way that it was feasible to subsequently prove its equivalence to a Haskell program that was proven consistent with the seL4 specifications.

My point is, the degree to which you can expect "safety" from a program (whatever that word means to you) is predicated on the degree to which you can trust the competence and practices of the developers, and the degree to which your definition of "safety" applies to the program's specifications. Remember, all programming languages are Turing-complete, which means you can blow your foot off with any of them if you're not careful.

[1] http://www.nicta.com.au/pub-download/full/7371


That's sort of what I was thinking of. I wouldn't say seL4 was written by hand, more cross-compiled in a way they couldn't fully automate, only because they didn't have a Haskell code generator that optimizes well when you don't need laziness or GC.


While this simplification of tools is great, I'm a little disappointed that it's a system integrated app and it happened in OpenBSD fairly quietly. If it first happened in something like RH or Debian, the package would be separate from core and we would have the opportunity for the following:

- using language with more safety guarantees than C

- including seccomp (either bfp, or even just mode 1) by default

- in the announcement saying: tested with XXX, YYY, ZZZ; fuzzed with ABC, DEF

It would be a great opportunity to create a poster-child people could point to and say - this is the way we should be (re)writing secure software, these were the awesome tools used to help it. For OpenBSD the first two couldn't happen (part of base system so C; there's no seccomp support), the third could but didn't - at least not publicly (valgrind, clang-analyser, [am]san, afl, etc. could all be listed)

Maybe the next time something important is broken.


Don't let the work of a few OpenBSD developers discourage you from creating your own file(1) implementation using your new favourite "secure" language of the week.


Everyone can write their own. Not everyone can substitute the implementation used by a large distribution and make a proper announcement people will care about. I don't understand why are you mocking language choice point though. Pretty much any language which doesn't manage memory directly would be a good choice. This utility doesn't even have big performance requirements.

Basically I'm saying that we can try very very hard to write secure code in languages which invite issues, or... try to eliminate whole classes or issues at a time. Is it such a terrible idea?


New languages and technologies come and go and you're free to use them. OpenBSD will continue to create new software while adhering to modern C coding practices, even when that means defining them.


> New languages and technologies

You mean decades-old languages like Perl and Python for example? Technologies like the ones we can finally now afford to implement - actual syscall filtering and selective capabilities dropping? Supporting utilities which only matter at development time? How do those go away?

And I wrote in the first message - OpenBSD integrates `file`, has to use C and doesn't implement seccomp. They couldn't become the poster child. But the next `file` reimplementation probably won't hit the news anymore.

People can improve the way we write software right now. But choose a project at random and they don't care or don't use what's available for free. Some promotion would be great when everyone's looking.


You have a lot more faith in the authors of memory management algorithms than I do.


Well "file" is not where I would start, as it is not really security critical for most people, although OpenBSD has good reasons.

Seccomp is almost unusable for most purposes with anything resembling tight filters as you do not really know what syscalls glibc especially will add to your program. Capsicum in FreeBSD is much more usable, and they are starting to priv-sep programs with it, they are good examples, I would look at them. Allegedly Linux will get Capsicum one day. OpenBSD has many examples of priv-sep such as openssh, just using seperate processes.

Linux is disadvantaged by not having a base system in teh same way - you can make a new "file" but how to know if anyone will use it. You can try to rewrite the Gnu tools better, but many are a horrible mess. You probably have to start a new Linux distro...


Corrected headline:

OpenBSD's new, untested file(1) implementation.

The only thing "safe" here is that the author was completely unharmed by the unit tests that he failed to write. Where I work 2500 lines of new C code without a single line of tests would be laughed off the code review system and then probably revisited around performance review time.

Heroic programming is how we got to where we are today, running the world on gigantic piles of untested gotos and question mallocs and frees. There are no real heroes in programming, only people who haven't yet figured out how to write tests.


> untested file(1) implementation

Completely unfounded assumption.

> the author was completely unharmed by the unit tests that he failed to write

That's always valid, having written unit tests or not, and a lot of unit test lovers write thousands of irrelevant unit tests that let bugs slide

> Where I work 2500 lines of new C code without a single line of tests would be laughed off the code review system

A lot of software has been written without unit tests which doesn't mean they don't work (as opposed to what detractors might say). In fact there are a lot of them you're using right now.

> There are no real heroes in programming, only people who haven't yet figured out how to write tests.

Yes, sure, "Unit tests are the only true way", not to mention unit tests in C are much LESS frequently possible, especially in low level stuff.

I just find risible how people accepted this unit test fetichism without questioning. Uncle Bob fans always makes me laugh


Wait five minutes and it will be the next fad somebody is complaining about. Another plus of OpenBSD: a largely fad-free experience.


I was under the impression that OpenBSD's unit testing files reside in the src/regress directory.

Here is the relevant directory for file: http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/regress/usr.bin...

The files in that directory sure look like tests to me. Am I missing something obvious?


Yes, the obvious thing you miss is that according to the OP the unit tests should contain the tests for the bugs detected by recent third-party fuzzing and the "file" should contain the fix to that. Instead, there most recent test is 6 years old but there's the whole brand new implementation of the whole "file" utility.

It does sound CADT-y.

http://www.jwz.org/doc/cadt.html

"It hardly seems worth even having a bug system if the frequency of from-scratch rewrites always outstrips the pace of bug fixing. Why not be honest and resign yourself to the fact that version 0.8 is followed by version 0.8, which is then followed by version 0.8?

But that's what happens when there is no incentive for people to do the parts of programming that aren't fun. Fixing bugs isn't fun; going through the bug list isn't fun; but rewriting everything from scratch is fun (because "this time it will be done right", ha ha) and so that's what happens, over and over again."

Specifically, why didn't somebody just remove the ELF part from the existing code, if the goal was "reducing the attack surface by removing the features?" (which can be a reasonable strategy for increasing the security).


I agree with you that more specific tests should be included for the recent fuzzing results, but the great-grandparent comment seems to suggest that there are no tests at all, which is clearly not true.

Instead there most recent test is 6 years old

I don't feel that this detail is relevant. Because file is a basic utility, and its function is not supposed to change much over time, a test written 6 years ago for file should still be relevant to file today.

Specifically, why didn't somebody just remove the ELF part from the existing code

I think this was not done because much of the file code followed outdated coding conventions, and would have to be replaced anyways. Take a look at this diff comparison of file.h: http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/usr.bin/file/fi...


I think you also miss that the reason the "file" got the attention were the recent fuzzing actions. Which discovered the bugs obviously not present in the quoted 6 year old tests. Otherwise there would be no discovery at all.

So no, the 6-year tests weren't sufficient. And yes, the updated tests and fixed old implementation is more important step than ""this time it will be done right", ha ha"

Additionally, replacing the "defined" constants from the file.h with the enum in another .h is hardly a reason for the rewrite.


I think I understand your point better now. More thorough tests in addition to the current set would certainly be an improvement.

However, I don't understand why you suggest that fixing the old implementation to adhere to new programming standards would be better than re-writing the program from scratch, because most of the program would have to be re-written anyways, and so the developer would barely save any time and effort by fixing rather than re-writing. Also, if more thorough tests were to be included, then either fixing the program or re-writing it would have the same impact security-wise, as the improved tests would account for any discrepancy in either route.

IMO, in this case, whether to fix the old software or to re-write a new version were both acceptable options.


I neither suggest anything like "fixing the old implementation to adhere to new programming standards" nor "re-writing the program from scratch."

I suggest extending the tests, fixing the problems discovered in the existing implementation and then doing something completely new, somewhere else.


Considering file is supposed to be invoked on unknown files, I fail to see how tests other than fuzzing would give any assurance security wise. A very anal code review on the other hand...

And the OpenBSD people are very aware of fuzzing, this new implementation of file is a direct reaction to Michal Zalewski findings...


A good and systematic approach to incorporate fuzzing findings would be as follows:

1) Identify a harmful input through fuzzing 2) Reduce the input to minimal testcase 3) Contribute testcase and fix to existing code

Note the absence of "rewrite the whole thing without tests" in this process. From-scratch rewrites that may or may not have fewer bugs than their predecessors are known as CADT.

http://www.jwz.org/doc/cadt.html


CADT refers mostly to rewrites in which lots of new features and complexity are being added - the most likely sources of bugs - not to ones like this that go in the opposite direction.


Let me quote from the source of the CADT term:

http://www.jwz.org/doc/cadt.html

"This is, I think, the most common way for my bug reports to open source software projects to ever become closed. I report bugs; they go unread for a year, sometimes two; and then (surprise!) that module is rewritten from scratch -- and the new maintainer can't be bothered to check whether his new version has actually solved any of the known problems that existed in the previous version."

The OP rightly points that there are no known tests that demonstrate the known problems in the older implementations and also no proofs that the new implementation passes them.


Don't compare the Gnome development process with the OpenBSD development process.

They are very different. Especially one being a core utility another a Desktop Environment.

I agree with that critique of the Gnome development process, but the OpenBSD development process prides itself on security and that's a first commit, new changes are being added (as it can be seen by the log of that file - http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/usr.bin/file/fi... )


Interestingly, the old file(1) was also written by an OpenBSD developer -- Ian Darwin of Toronto. I met him at a BSDCan, nice chap.

http://mdoc.su/f/file.1

http://darwinsys.com/file/

Apparently, Ian no longer considers himself a maintainer, devoting such title to Christos Zoulas of NetBSD and Two Sigma.


Security isn't the only concern. What about correctness? Tests could help there.

I'm sure I'll manage to write a secure file(1) implementation, if it's not required to produce correct output.


Fuzzing is like rolling dice- there's only a chance it'll find the lurking issues. Unit tests enforce contracts and assist future maintenance.

I sincerely hope the lack of unit tests is because it's covered by existing tests.


Well, with tools like AFL (American Fuzzy Loop) it is more like rolling dice to find a path through a maze of numbered doors - adjusting the dice rolled (6-sided, 2-sided ...) based on how far one has come, and how many doors there is to choose from... ok, it's not really like rolling dice any more at all. Did you read:

http://lcamtuf.blogspot.no/2014/11/pulling-jpegs-out-of-thin...


"American Fuzzy Lop" - it's named after a breed of rabbit where the ears sag down.


And here I thought it was named after some obscure cereal brand ;-)


What makes this file(1) safer than any other?


Lcamtuf notably found several ELF Parsing bugs in file, that appeared may have been exploitable.

The work he has been doing with AFL and googles big fuzz farm, focusing on utilities that are used daily without thought is insanely important, imho


link?



What caught my attention most was the fact they are using CVS. Nothing against, just I don't see CVS too much today.

Anyone knows how much impact this change can bring? How much systems rely on file?

Personally, every time I use it was as one-time script.


CVS is brought up in almost every single OpenBSD thread, right before someone complains about Comic Sans on their website. The OpenBSD guys refuse to use git because they feel it is way too complicated for the task it serves. At this point, CVS is almost their barrier to entry: if you're (royal you, not rhapsodyv) going to complain a bunch about the version control tool, you're probably not going to be an active member of their developer community.

Despite using old tools, the OpenBSD guys release their updates every 6 months like clockwork, and have for the past two decades. I know of no other FOSS project with that level of project management.


> At this point, CVS is almost their barrier to entry: if you're (royal you, not rhapsodyv) going to complain a bunch about the version control tool, you're probably not going to be an active member of their developer community.

CVS is frankly hideous. Saying "git is way too complicated" is another way of saying "CVS is way underpowered". However, a good reason to cling to this relic of a bygone age is that, as far as I know, they have everything in CVS, and it must be a lot more convenient to be able to checkout an arbitrary part of their dev tree than messing around with git submodules.

That said, if their attitude is really "if you complain about CVS, you are not worthy", that is bound to turn off many people, and for good reason.


CVS fits their development methodology just fine, so far. Having a central CVS repo, where development is done in "head" which gets branched every 6 months for a release. Contributions are supposed to never break "head", so mostly small easy to review patches are commited. Even big changes are committed on a per patch basis, working towards a bigger goal. Also there is AnonCVS, which mirrors the central CVS repo on dozens of mirrors. The same way you can mirror the CVS repo for yourself locally. Sure CVS+AnonCVS+diff(1) could be better(whatever that means), but it does the job. Switching is hard, losing history is bad and putting off old developers is way more dangerous than discouraging new ones. Using mailing lists and CVS is the price to pay to partake.


> CVS fits their development methodology just fine, so far.

That's what I am saying. However, there is a big difference between "we use CVS for good reasons" and "we are not aware of the limitations of CVS compared to modern DVCS". That's the difference between "sane" engineering conservatism and a "get off my lawn" attitude.


An interesting data point in the OTHER camp is Emacs' decision to move to git. But Emacs has much different goals as a project than OpenBSD.


Just a note that Emacs was on BZR not CVS.


> That said, if their attitude is really "if you complain about CVS, you are not worthy", that is bound to turn off many people, and for good reason.

Perhaps their goal isn't reaching those people.


Moreover, that's not what I said. Being able to come into a project and work within its ecosystem is an extremely valuable skill. It's the exact opposite of the mentality that drives NIH syndrome. OpenBSD is an OS written for high security, high stability. They simply don't have time to quibble with the next big VCS thing, especially when they're not having issues with the old one. They know the edge cases with CVS, migrating an entire project and dev team to git or SVN or mercurial or whatever would be way more disruptive than just dealing with those quirks.

Side note: I'd be really interested to see an OpenBSD designed VCS. I'd be very curious what the design would look like.


CVS is frankly hideous.

I don't know. What's wrong with it? I've used it for decades now and it's never really been a problem. Sure, I like git, but the idea that CVS is some kind of show-stopper just never resonated with me.


Why not Subversion (aka SVN)? Almost every open source project moved from the older CVS to Subversion around 2005. About 2-4 years ago some moved on to Git (or Mercurial (Hg)). Subversion has several advantages and fixed issues that CVS had, beside that it is somewhat similar - contrary to Git/Hg.


Presumably CVS works for them. They had used it successfully for 5 years before Subversion existed and are presumably well acquainted with its foibles. And all systems have foibles, so if you are used to one set of them and it is working for you, then fixing things that aren't would seem to be a much better use of time, rather than following the whims of fashion to court those developers who will only help if they don't have to learn another versioning system.


What is something the OpenBSD team could accomplish with a different VCS that they are not doing now?0


CVSWeb is fare nicer than anything I've seen for Git, Mecurial or Subversion. I'm not sure that's going to make anyone to move back to cvs though :-)


AnonCVS (the idea of a repository that J. Random user could just look at at any time) was the first innovation of OpenBSD. I guess you whipper-snappers don't remember it, but back 20 years ago for the most part unless you were on the team you would only see a project's source with each release. And several projects had an annoying tendency to take patch submissions and mutter, "yeah, um, we've fixed that already in CVS" and quietly apply your patch.



That really is the best statement about this.


The problem is, even if you don't like git, subversion is a "better CVS than CVS." Just to give one example, the OpenBSD guys have to periodically manually delete old versions of various files because CVS has a limit on how many old revisions it can track. Now that is just silly.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: