Magika: AI powered fast and efficient file type identification

TomNomNom · on Feb 16, 2024

This looks cool. I ran this on some web crawl data I have locally, so: all files you'd find on regular websites; HTML, CSS, JavaScript, fonts etc.

It identified some simple HTML files (html, head, title, body, p tags and not much else) as "MS Visual Basic source (VBA)", "ASP source (code)", and "Generic text document" where the `file` utility correctly identified all such examples as "HTML document text".

Some woff and woff2 files it identified as "TrueType Font Data", others are "Unknown binary data (unknown)" with low confidence guesses ranging from FLAC audio to ISO 9660. Again, the `file` utility correctly identifies these files as "Web Open Font Format".

I like the idea, but the current implementation can't be relied on IMO; especially not for automation.

A minor pet peeve also: it doesn't seem to detect when its output is a pipe and strip the shell colour escapes resulting in `^[[1;37` and `^[[0;39m` wrapping every line if you pipe the output into a vim buffer or similar.

ebursztein · on Feb 16, 2024

Thanks for the feedback -- we will look into it. If you can share with us the list of URL that would be very helpful so we can reproduce - send us an email at magika-dev@google.com if that is possible.

For crawling we have planned a head only model to avoid fetching the whole file but it is not ready yet -- we weren't sure what use-cases would emerge so that is good to know that such model might be useful.

We mostly use Magika internally to route files for AV scanning as we wrote in the blog post, so it is possible that despite our best effort to test Magika extensively on various file types it is not as good on fonts format as it should be. We will look into.

Thanks again for sharing your experience with Magika this is very useful.

TomNomNom · on Feb 16, 2024

Sure thing :)

Here's[0] a .tgz file with 3 files in it that are misidentified by magika but correctly identified by the `file` utility: asp.html, vba.html, unknown.woff

These are files that were in one of my crawl datasets.

[0]: https://poc.lol/files/magika-test.tgz

ebursztein · on Feb 16, 2024

Thank you - we are adding them to our test suit for the next version.

TomNomNom · on Feb 16, 2024

Super, thank you! I look forward to it :)

I've worked on similar problems recently so I'm well aware of how difficult this is. An example I've given people is in automatically detecting base64-encoded data. It seems easy at first, but any four, eight, or twelve (etc) letter word is technically valid base64, so you need to decide if and how those things should be excluded.

beeboobaa · on Feb 16, 2024

Do you have permission to redistribute these files?

renewiltord · on Feb 17, 2024

LOL nice b8 m8. For the rest of you who are curious, the files look like this:

    <HTML><HEAD>
    <TITLE>Access Denied</TITLE>
    </HEAD><BODY>
    <H1>Access Denied</H1>
     
    You don't have permission to access "http&#58;&#47;&#47;placement&#46;api&#46;test4&#46;example&#46;com&#47;" on this server.<P>
    Reference&#32;&#35;18&#46;9cb0f748&#46;1695037739&#46;283e2e00
    </BODY>
    </HTML>

Legend. "Do you have permission" hahaha.

IvyMike · on Feb 16, 2024

You are asking what if this guy has "web crawl data" that google does not have?

And what if he says no, he does not have permission.

beeboobaa · on Feb 16, 2024

> You are asking what if this guy has "web crawl data" that google does not have?

No, I'm asking if he has permission to redistribute these files.

timschmidt · on Feb 16, 2024

Are you attempting to assert that use of these files solely for the purpose of improving a software system meant to classify file types does not fall under fair use?

https://en.wikipedia.org/wiki/Fair_use

beeboobaa · on Feb 16, 2024

I'm asking a question.

Here's another one for you: Do you believe that all pictures you have ever taken, all emails you have ever written, all code you have ever written could be posted here on this forum to improve someone else's software system?

If so, could you go ahead and post that zip? I'd like to ingest it in my model.

timschmidt · on Feb 16, 2024

Your question seems orthogonal to the situation. The three files posted seem to be the minimum amount of information required to reproduce the bug. Fair use encompasses a LOT of uses of otherwise copyrighted work, and this seems clearly to be one.

beeboobaa · on Feb 16, 2024

I don't see how publicly posting them on a forum is

> the minimum amount of information required to reproduce the bug

MAYBE if they had communicated privately that'd be an argument that made sense.

timschmidt · on Feb 16, 2024

So you don't think that software development which happens in public web forums deserve fair use protection?

beeboobaa · on Feb 16, 2024

That's an interesting way to frame "publicly posted someone else's data without their consent for anyone to see and download"

timschmidt · on Feb 16, 2024

I notice you're so invested that you haven't noticed that the files have been renamed and zipped such that they're not even indexable. How you'd expect anyone not participating in software development to find them is yet to be explained.

beeboobaa · on Feb 16, 2024

[flagged]

timschmidt · on Feb 16, 2024

Have fun, buddy!

jdiff · on Feb 16, 2024

It's three files that were scraped from (and so publicly available on) the web. That's not at all similar to your strawful analogy.

timschmidt · on Feb 16, 2024

I'm over here trying to fathom the lack of control over one's own life it would take to cause someone to turn into an online copyright cop, when the data in question isn't even their own, is clearly divorced from any context which would make it useful for anything other than fixing the bug, and about which the original copyright holder hasn't complained.

Some people just want to argue.

If the copyright holder has a problem with the use, they are perfectly entitled to spend some of their dollar bills to file a law suit, as part of which the contents of the files can be entered into the public record for all to legally access, as was done with Scientology.

I don't expect anyone would be so daft.

beeboobaa · on Feb 16, 2024

Literally just asked a question and that seems to have set you off, bud. Are you alright? Do you need to feed your LLM more data to keep it happy?

timschmidt · on Feb 16, 2024

I'm always happy to stand up for folks who make things over people who want to police them. Especially when nothing wrong has happened. Maybe take a walk and get some fresh air?

_a_a_a_ · on Feb 17, 2024

I share your distaste for people whose only contribution is subtraction but suggest you lay off the sarcasm though. Trolls; don't feed. (Well done on your project BTW)

timschmidt · on Feb 17, 2024

I don't see any sarcasm from me in the thread. I had serious questions. Perhaps you could point out what you see? Thanks for the supportive words about the project.

_a_a_a_ · on Feb 17, 2024

Perhaps I misread "Maybe take a walk and get some fresh air?" - no worries though.

timschmidt · on Feb 17, 2024

I've certainly seen people say similar things facetiously, but I was being genuine. I'm not sure if beeboobaa was trolling or not, I try to take what folks say at face value. They seemed to be pretty attached to a particular point of view, though. Happens to all of us. The thing for attachment is time and space and new experiences. Walks are great for those things, and also the best for organizing thoughts. Einstein loved taking walks for these reasons, and me too. It feels better to suggest something helpful when discussion derails, than to hurl insults as happens all too frequently.

beeboobaa · on Feb 17, 2024

Literally all you did is bitch and moan about someone asking a simple question, lol. Go touch grass.

timschmidt · on Feb 17, 2024

I already had my walk this morning, thanks! If you'd like to learn more about copyright law, including about all the ways it's fuzzy around the edges for legitimate uses like this one, I highly recommend groklaw.net. PJ did wonderful work writing about such boring topics in personable and readable ways. I hope you have a great day!

beeboobaa · on Feb 17, 2024

no thanks, not interested in your american nonsense laws. lecturing people who are asking SOMEONE ELSE a question is a terrible personality trait btw

timschmidt · on Feb 17, 2024

181 out of 195 countries and counting!

https://en.wikipedia.org/wiki/Berne_Convention

Look at that map!

https://upload.wikimedia.org/wikipedia/commons/7/76/Berne_Co...

P.S. Berne doesn't sound like a very American name.

You would really learn a lot from reading Groklaw. Of course, I can't make you. Good luck in the world though!

beeboobaa · on Feb 18, 2024

man, you really are putting a lot of effort into justifying stealing other people's content

timschmidt · on Feb 18, 2024

Thanks for such great opportunities to post educational content to Hacker News! I genuinely hope some things go your way, man. Rooting for you. Go get 'em.

jdonaldson · on Feb 17, 2024

If you can’t undermine someone’s argument, undermine their nationality. American tech culture doesn’t do this as much as it should, perhaps because we know eventually those folks wake up.

beeboobaa · on Feb 18, 2024

Not sure what your point is, but why would i care to learn about the laws of some other dude's country that he's using to support his bizarro arguments?

timschmidt · on Feb 18, 2024

> why would i care to learn about the laws of some other dude's country

The website you're attempting to police other people's behavior on is hosted in the country you're complaining about. Lol.

Maybe there is a website local to your country where your ideas would be better received?

beeboobaa · on Feb 17, 2024

You're so brave

timschmidt · on Feb 17, 2024

Thanks!

westurner · on Feb 16, 2024

What is the MIME type of a .tar file; and what are the MIME types of the constituent concatenated files within an archive format like e.g. tar?

hachoir/subfile/main.py: https://github.com/vstinner/hachoir/blob/main/hachoir/subfil...

File signature: https://en.wikipedia.org/wiki/File_signature

PhotoRec: https://en.wikipedia.org/wiki/PhotoRec

"File Format Gallery for Kaitai Struct"; 185+ binary file format specifications: https://formats.kaitai.io/

Table of ': https://formats.kaitai.io/xref.html

AntiVirus software > Identification methods > Signature-based detection, Heuristics, and ML/AI data mining: https://en.wikipedia.org/wiki/Antivirus_software#Identificat...

Executable compression; packer/loader: https://en.wikipedia.org/wiki/Executable_compression

Shellcode database > MSF: https://en.wikipedia.org/wiki/Shellcode_database

sigtool.c: https://github.com/Cisco-Talos/clamav/blob/main/sigtool/sigt...

clamav sigtool: https://www.google.com/search?q=clamav+sigtool

https://blog.didierstevens.com/2017/07/14/clamav-sigtool-dec... :

  sigtool –-find-sigs "$name" | sigtool –-decode-sigs

List of file signatures: https://en.wikipedia.org/wiki/List_of_file_signatures

And then also clusterfuzz/oss-fuzz scans .txt source files with (sandboxed) Static and Dynamic Analysis tools, and `debsums`/`rpm -Va` verify that files on disk have the same (GPG signed) checksums as the package they are supposed to have been installed from, and a file-based HIDS builds a database of file hashes and compares what's on disk in a later scan with what was presumed good, and ~gdesktop LLM tools scan every file, and there are extended filesystem attributes for label-based MAC systems like SELinux, oh and NTFS ADS.

A sufficient cryptographic hash function yields random bits with uniform probability. DRBG Deterministic Random Bit Generators need high entropy random bits in order to continuously re-seed the RNG random number generator. Is it safe to assume that hashing (1) every file on disk, or (2) any given file on disk at random, will yield random bits with uniform probability; and (3) why Argon2 instead of e.g. only two rounds of SHA256?

https://github.com/google/osv.dev/blob/master/README.md#usin... :

> We provide a Go based tool that will scan your dependencies, and check them against the OSV database for known vulnerabilities via the OSV API. ... With package metadata, not (a file hash, package) database that could be generated from OSV and the actual package files instead of their manifest of already-calculated checksums.

Might as well be heating a pool on the roof with all of this waste heat from hashing binaries build from code of unknown static and dynamic quality.

Add'l useful formats:

> Currently it is able to scan various lockfiles, debian docker containers, SPDX and CycloneDB SBOMs, and git repositories

Things like bittorrent magnet URIs, Named Data Networking, and IPFS are (file-hash based) "Content addressable storage": https://en.wikipedia.org/wiki/Content-addressable_storage

dieortin · on Feb 17, 2024

I’m not sure what this comment is trying to say

westurner · on Feb 18, 2024

File-based hashing is done is so many places, there's so much heat.

Sub- file-based hashing with feature engineering is necessary for AV, which must take packing, obfuscating, loading, and dynamic analysis into account in addition to zip archives and magic file numbers.

AV AntiVirus applications with LLMs: what do you train it on, what are some of the existing signature databases.

https://SigStore.dev/ (The Linux Foundation) also has a hash-file inverted index for released artifacts.

Also otoh with a time limit,

1. What file is this? Dirname, basename, hashes(s)

2. Is it supposed to be installed at such path?

3. Per it's header, is the file an archive or an image or a document?

4. What file(s) and records and fields are packed into a file, and what transforms were the data transformed with?

michaelmior · on Feb 16, 2024

> the current implementation can't be relied on IMO

What's your reasoning for not relying on this? (It seems to me that this would be application-dependent at the very least.)

jdiff · on Feb 16, 2024

I'm not the person you asked, but I'm not sure I understand your question and I'd like to. It whiffed multiple common softballs, to the point it brings into question the claims made about its performance. What reasoning is there to trust it?

EnigmaFlare · on Feb 17, 2024

It had 3 failures. How is that a sign it's untrustworthy? I'm sure all alternatives have more than 3 failures. You might be making assumptions about the distribution of successes and failures (GP didn't say how many files they tested to find those 3) or how "soft" they were. In an extreme case, they might even have been crafted adversarial examples. But even if not, they might have features that really do look more like some other file type from the point of view of the classifier even if it's not easily apparent to a human. Being strictly superior to a competent human is a pretty high bar to set.

epcoa · on Feb 17, 2024

> or how "soft" they were.

From the comment: It identified some simple HTML files (html, head, title, body, p tags and not much else) as "MS Visual Basic source (VBA)", "ASP source (code)", and "Generic text document" where the `file` utility correctly identified all such examples as "HTML document text".

That's pretty soft. Nothing "adversarial" claimed either.

> Being strictly superior to a competent human is a pretty high bar to set.

The bar is the file utility.

EnigmaFlare · on Feb 17, 2024

Those are only soft to a human. I looked at a couple and I picked them correctly but I don't know what details the classifier was seeing which I was blind to. Not to say it was correct, just that we can't call them soft just because they're short and easy for a human.

> The bar is the file utility.

It has higher accuracy than that. You would reject it just because the failures are different even though they're less?

jdiff · on Feb 17, 2024

Yes. Unpredictable failures are significantly worse than predictable ones. If file messes up, it's because it decided a ZIP-based document was a generic ZIP file. If Magika messes up, it's entirely random. I can work around file's failure modes, especially if it's one I work with often. Magika's failure modes strike at random and are not possible to anticipate. File also bails out when it doesn't know, a very common failure mode in Magika is that it confidently returns a random answer when it wasn't trained on a file type.

EnigmaFlare · on Feb 18, 2024

Your original statement was that having a couple of failures brings into question its claims about performance. It doesn't because it doesn't claim such high performance. 99.31% is lower than perhaps 997 out of 1000 or whatever the GP tested. Of course having unpredictable failures is a worry but it's a different worry.

jdiff · on Feb 18, 2024

They uploaded 3 sample files for the authors, there were more failures than that, and the failures that GP and others have experienced are of a less tolerable nature. This is the point I was making, that the value added by classifying files with no rigid structure is offset heavily by its unpredictable shortcomings and difficult-to-detect failure modes.

If you have a point of your own to make I'd prefer you jump to it. Nitpicking baseless assumptions like how many files the evil GP had to sift through in order to breathlessly bring us 3 bad eggs is not something I find worthwhile.

EnigmaFlare · on Feb 21, 2024

The point I'm making is that you drew a conclusion based on insufficient information, apparently by making assumptions about the distribution of failures or the definition of "easy".

michaelmior · on Feb 16, 2024

> It whiffed multiple common softballs

I must have missed this in the article. Where was this?

jdiff · on Feb 16, 2024

...It's in the comment you were responding to. Directly above the section you quoted.

michaelmior · on Feb 18, 2024

I understand that, but it wasn't clear to me where those examples came from.

jdiff · on Feb 18, 2024

It's pretty obvious from the whole comment that they're his own experience. Are you going anywhere with this or are you just saying things?

TomNomNom · on Feb 16, 2024

It provided the wrong file-types for some files, so I cannot rely on its output to be correct.

If you wanted to, for example, use this tool to route different files to different format-specific handlers it would sometimes send files to the wrong handlers.

michaelmior · on Feb 16, 2024

Except a 100% correct implementation doesn't exist AFAIK. So if I want to do anything that makes a decision based on the type of a file, I have to pick some algorithm to do that. If I can do that correctly 99% of the time, that's better than not being able to make that decision at all, which is where I'm left if a perfect implementation doesn't exist.

jdiff · on Feb 16, 2024

Nobody's asking for perfection. But the AI is offering inexplicable and obvious nondeterministic mistakes that the traditional algorithms don't suffer from.

Magika goes wrong and your fonts become audio files and nobody knows why. Magic goes wrong and your ZIP-based documents get mistaken for generic ZIP files. If you work with that edge case a lot, you can anticipate it with traditional algorithms. You can't anticipate nondeterministic hallucination.

duskwuff · on Feb 16, 2024

Seconding this.

Something like Magika is potentially useful as a second pass if conventional methods of detecting a file type fail or yield a low-confidence result. But, for the majority of binary files, those conventional methods are perfectly adequate. If the first few bytes of a file are "GIF89a", you don't need an AI to tell you that it's probably a GIF image.

EnigmaFlare · on Feb 17, 2024

Doesn't seem all that non-deterministic. I tested the vba.html example multiple times and it always said it was VBA. I added a space between </HEAD> and <BODY> and it correctly picked HTML as most likely but with a low confidence.

So I think we can say it's sensitive to mysterious features, not that it's non-deterministic. Still leads to your same conclusion that you can't anticipate the failures. But I don't think you can with traditional tools either. Some magic numbers are just plain text (like MZ) which could legitimately accidentally appear at the beginning of a plain text file, for example.

jsnell · on Feb 16, 2024

Where are you getting the non-determinism part from? It would seem surprising for there to be anything non-deterministic about an ML model like this, and nothing in the original reports seems to suggest that either.

TeMPOraL · on Feb 16, 2024

Large ML models tend to be uncorrectably non-deterministic simply from doing lots of floating point math in parallel. Addition and multiplication of floats is neither commutative nor associative - you may get different results depending on the order in which you add/multiply numbers.

eapriv · on Feb 17, 2024

Addition and multiplication of floats are commutative.

Gormo · on Feb 16, 2024

> It would seem surprising for there to be anything non-deterministic about an ML model like this

I think there may be some confusion of ideas going in here. Machine learning is fundamentally stochastic, so it is non-deterministic almost by definition.

stevepike · on Feb 16, 2024

Oh man, this brings me back! Almost 10 years ago I was working on a rails app trying to detect the file type of uploaded spreadsheets (xlsx files were being detected as application/zip, which is technically true but useless).

I found "magic" that could detect these and submitted a patch at https://bugs.freedesktop.org/show_bug.cgi?id=78797. My patch got rejected for needing to look at the first 3KB bytes of the file to figure out the type. They had a hard limit that they wouldn't see past the first 256 bytes. Now in 2024 we're doing this with deep learning! It'd be cool if google released some speed performance benchmarks here against the old-fashioned implementations. Obviously it'd be slower, but is it 1000x or 10^6x?

ebursztein · on Feb 16, 2024

Co-author of Magika here (Elie) so we didn't include the measurements in the blog post to avoid making it too long but we did those measurements.

Overall file takes about 6ms (single file) 2.26ms per files when scanning multiples. Magika is at 65ms single file and 5.3ms when scanning multiples.

So Magika is for the worst case scenario about 10x slower due to the time it takes to load the model and 2x slower on repeated detection. This is why we said it is not that much slower.

We will have more performance measurements in the upcoming research paper. Hope that answer the question

chmod775 · on Feb 16, 2024

Is that single-threaded libmagic vs Magika using every core on the system? What are the numbers like if you run multiple libmagic instances in parallel for multiple files, or limit both libmagic and magika to a single core?

Testing it on my own system, magika seems to use a lot more CPU-time:

    file /usr/lib/*  0,34s user 0,54s system 43% cpu 2,010 total
    ./file-parallel.sh  0,85s user 1,91s system 580% cpu 0,477 total
    bin/magika /usr/lib/*  92,73s user 1,11s system 393% cpu 23,869 total

Looks about 50x slower to me. There's 5k files in my lib folder. It's definitely still impressively fast given how the identification is done, but the difference is far from negligible.

jpk · on Feb 16, 2024

Do you have a sense of performance in terms of energy use? 2x slower is fine, but is that at the same wattage, or more?

alephnan · on Feb 16, 2024

That sounds like a nit / premature optimization.

Electricity is cheap. If this is sufficiently or actually important for your org, you should measure it yourself. There are too many variables and factors subject to your org’s hardware.

cornholio · on Feb 16, 2024

The hardware requirements of a massively parallel algorithm can't possibly be "a nit" in any universe inhabited by rational beings.

djxfade · on Feb 16, 2024

Totally disagree. Most end users are on laptops and mobile devices these days, not desktop towers. Thus power efficiency is important for battery life. Performance per watt would be an interesting comparison.

true_religion · on Feb 16, 2024

What end users are working with arbitrary files that they don’t know the identification of?

This entire use case seems to be one suited for servers handling user media.

wongarsu · on Feb 16, 2024

File managers that render preview images. Even detecting which software to open the file with when you click it.

Of course on Windows the convention is to use the file extension, but on other platforms the convention is to look at the file contents

michaelmior · on Feb 16, 2024

> on other platforms the convention is to look at the file contents

MacOS (that is, Finder) also looks at the extension. That has also been the case with any file manager I've used on Linux distros that I can recall.

jdiff · on Feb 16, 2024

You might be surprised. Rename your Photo.JPG as Photo.PNG and you'll still get a perfectly fine thumbnail. The extension is a hint, but it isn't definitive, especially when you start downloading from the web.

r0ze-at-hn · on Feb 16, 2024

Browsers often need to guess a file type

michaelt · on Feb 16, 2024

Theoretically? Anyone running a virus scanner.

Of course, it's arguably unlikely a virus scanner would opt for an ML-based approach, as they specifically need to be robust against adversarial inputs.

michaelmior · on Feb 16, 2024

> it's arguably unlikely a virus scanner would opt for an ML-based approach

Several major players such as Norton, McAfee, and Symantec all at least claim to use AI/ML in their antivirus products.

scq · on Feb 16, 2024

You'd be surprised what an AV scanner would do.

https://twitter.com/taviso/status/732365178872856577

vertis · on Feb 16, 2024

I mean if you care about that you shouldn't be running anything that isn't highly optimized. Don't open webpages that might be CPU or GPU intensive. Don't run Electron apps, or really anything that isn't built in a compiled language.

Certainly you should do an audit of all the Android and iOS apps as well, to make sure they've been made in a efficient manner.

Block ads as well, they waste power.

This file identification is SUCH a small aspect of everything that is burning power in your laptop or phone as to be laughable.

_puk · on Feb 16, 2024

Whilst energy usage is indeed a small aspect this early on when using bespoke models, we do have to consider that this is a model for simply identifying a file type.

What happens when we introduce more bespoke models for manipulating the data in that file?

This feels like it could slowly boil to the point of programs using magnitudes higher power, at which point it'll be hard to claw it back.

vertis · on Feb 16, 2024

That's a slippery slope argument, which is a common logical fallacy[0]. This model being inefficient compared to the best possible implementation does not mean that future additions will also be inefficient.

It's the equivalent to saying many people programming in Ruby is causing all future programs to be less efficient. Which is not true. In fact, many people programming in Ruby has caused Ruby to become more efficient because it gets optimised as it gets used more (or Python for that matter).

It's not as energy efficient as C, but it hasn't caused it to get worse and worse, and spiral out of control.

Likewise smart contracts are incredibly inefficient mechanisms of computation. The result is mostly that people don't use them for any meaningful amounts of computation, that all gets done "Off Chain".

Generative AI is definitely less efficient, but it's likely to improve over time, and indeed things like quantization has allowed models that would normally to require much more substantial hardware resources (and therefore, more energy intensive) to be run on smaller systems.

[0]: https://en.wikipedia.org/wiki/Slippery_slope

diffeomorphism · on Feb 16, 2024

That is a fallacy fallacy. Just because some slopes are not slippery that does not mean none of them are.

samatman · on Feb 17, 2024

The slippery slope fallacy is: "this is a slope. you will slip down it." and is always fallacious. Always. The valid form of such an argument is: "this is a slope, and it is a slippery one, therefore, you will slip down it."

diffeomorphism · on Feb 17, 2024

No, it isn't.

samatman · on Feb 17, 2024

Yeah. Yeah, it is.

thfuran · on Feb 16, 2024

>This feels like it could slowly boil to the point of programs using magnitudes higher power, at which point it'll be hard to claw it back.

We're already there. Modern software is, by and large, profoundly inefficient.

underdeserver · on Feb 16, 2024

In general you're right, but I can't think of a single local use for identifying file types by a human on a laptop - at least, one with scale where this matters. It's all going to be SaaS services where people upload stuff.

prmph · on Feb 16, 2024

We are building a data analysis tool with great UX, where users select data files, which are then parsed and uploaded to S3 directly, on their client machines. The server only takes over after this step.

Since the data files can be large, this approach bypasses having to trnasfer the file twice, first to the server, and then to S3 after parsing.

DontSignAnytng · on Feb 16, 2024

This dont sound like very common scenario.

metafunctor · on Feb 16, 2024

I've ended up implementing a layer on top of "magic" which, if magic detects application/zip, reads the zip file manifest and checks for telltale file names to reliably detect Office files.

The "magic" library does not seem to be equipped with the capabilities needed to be robust against the zip manifest being ordered in a different way than expected.

But this deep learning approach... I don't know. It might be hard to shoehorn in to many applications where the traditional methods have negligible memory and compute costs and the accuracy is basically 100% for cases that matter (detecting particular file types of interest). But when looking at a large random collection of unknown blobs, yeah, I can see how this could be great.

stevepike · on Feb 16, 2024

If you're curious, here's how I solved it for ruby back in the day. Still used magic bytes, but added an overlay on top of the freedesktop.org DB: https://github.com/mimemagicrb/mimemagic/pull/20

comboy · on Feb 16, 2024

Many commenters seem to be using magic instead of file, any reasons?

e1g · on Feb 16, 2024

magic is the core detection logic of file that was extracted out to be available as a library. So these days file is just a higher level wrapper around magic

comboy · on Feb 16, 2024

thanks

renonce · on Feb 16, 2024

From the first paragraph:

> enabling precise file identification within milliseconds, even when running on a CPU.

Maybe your old-fashioned implementations were detecting in microseconds?

stevepike · on Feb 16, 2024

Yeah I saw that, but that could cover a pretty wide range and it's not clear to me whether that relies on preloading a model.

ryanjshaw · on Feb 16, 2024

> At inference time Magika uses Onnx as an inference engine to ensure files are identified in a matter of milliseconds, almost as fast as a non-AI tool even on CPU.

brabel · on Feb 16, 2024

> They had a hard limit that they wouldn't see past the first 256 bytes.

Then they could never detect zip files with certainty, given that to do that you need to read up to 65KB (+ 22) at the END of the file. The reason is that the zip archive format allows "gargabe" bytes both in the beginning of the file and in between local file headers.... and it's actually not uncommon to prepend a program that self-extracts the archive, for example. The only way to know if a file is a valid zip archive is to look for the End of Central Directory Entry, which is always at the end of the file AND allows for a comment of unknown length at the end (and as the comment length field takes 2 bytes, the comment can be up to 65K long).

jeffbee · on Feb 16, 2024

That's why the whole question is ill formed. A file does not have exactly one type. It may be a valid input in various contexts. A zip archive may also very well be something else.

aidenn0 · on Feb 16, 2024

FWIW, file can now distinguish many types of zip containers, including Oxml files.

m0shen · on Feb 16, 2024

As someone that has worked in a space that has to deal with uploaded files for the last few years, and someone who maintains a WASM libmagic Node package ( https://github.com/moshen/wasmagic ) , I have to say I really love seeing new entries into the file type detection space.

Though I have to say when looking at the Node module, I don't understand why they released it.

Their docs say it's slow:

https://github.com/google/magika/blob/120205323e260dad4e5877...

It loads the model an runtime:

https://github.com/google/magika/blob/120205323e260dad4e5877...

They mark it as Experimental in the documentation, but it seems like it was just made for the web demo.

Also as others have mentioned. The model appears to only detect 116 file types:

https://github.com/google/magika/blob/120205323e260dad4e5877...

Where libmagic detects... a lot. Over 1600 last time I checked:

https://github.com/file/file/tree/4cbd5c8f0851201d203755b76c...

I guess I'm confused by this release. Sure it detected most of my list of sample files, but in a sample set of 4 zip files, it misidentified one.

m0shen · on Feb 16, 2024

Made a small test to try it out: https://gist.github.com/moshen/784ee4a38439f00b17855233617e9...

    hyperfine ./magika.bash ./file.bash
    Benchmark 1: ./magika.bash
      Time (mean ± σ):     706.2 ms ±  21.1 ms    [User: 10520.3 ms, System: 1604.6 ms]
      Range (min … max):   684.0 ms … 738.9 ms    10 runs
    
    Benchmark 2: ./file.bash
      Time (mean ± σ):      23.6 ms ±   1.1 ms    [User: 15.7 ms, System: 7.9 ms]
      Range (min … max):    22.4 ms …  29.0 ms    111 runs
    
    Summary
      './file.bash' ran
       29.88 ± 1.65 times faster than './magika.bash'

barrkel · on Feb 16, 2024

Realistically, either you're identifying one file interactively and you don't care about latency differences in the 10s of ms, or you're identifying in bulk (batch command line or online in response to requests), in which case you should measure the marginal cost and exclude Python startup and model loading times.

chmod775 · on Feb 16, 2024

Going by those number it's taking almost a second to run, not 10s of ms. And going by those numbers, it's doing something massively parallel in that time. So basically all your cores will spike to 100% for almost a second during those one-shot identifications. It looks like GP has a 12-16 threads CPU, and it is using those while still being 30 times slower than single-threaded libmagic.

That tool needs 100x more CPU time just to figure out some filetypes than vim needs to open a file from a cold start (which presumably includes using libmagic to check the type).

If I had to wait a second just to open something during which that thing uses every resource available on my computer to the fullest, I'd probably break my keyboard. Try using that thing as a drop-in file replacement, open some folder in your favorite file manager, and watch your computer slow to a crawl as your file manager tries to figure out what thumbnails to render.

It's utterly unsuitable for "interactive" identifications.

m0shen · on Feb 16, 2024

My little script is trying to identify in bulk, at least by passing 165 file paths to `magika`, and `file`.

Though, I absolutely agree with you. I think realistically it's better to do this kind of thing in a library rather than shell out to it at all. I was just trying to get an idea on how it generally compares.

Another note, I was trying to be generous to `magicka` here because when it's single file identification, it's about 160-180ms on my machine vs <1ms for `file`. I realize that's going to be quite a bit of python startup in that number, which is why I didn't go with it when pushing that benchmark up earlier. I'll probably push an update to that gist to include the single file benchmark as well.

m0shen · on Feb 16, 2024

I've updated this script with some single-file cli numbers, which are (as expected) not good. Mostly just comparing python startup time for that.

    make
    sqlite3 < analyze.sql
    file_avg              python_avg         python_x_times_slower_single_cli
    --------------------  -----------------  --------------------------------
    0.000874874856301821  0.179884610224334  205.611818568799
    file_avg            python_avg     python_x_times_slower_bulk_cli
    ------------------  -------------  ------------------------------
    0.0231715865881818  0.69613745142  30.0427184289163

ebursztein · on Feb 16, 2024

We did release the npm package because indeed we create a web demo and thought people might want to also use it. We know it is not as fast as the python version or a C++ version -- which why we did mark it as experimental.

The release include the python package and the cli which are quite fast and is the main way we did expect people to use -- sorry if that hasn't be clear in the post.

The goal of the release is to offer a tool that is far more accurate that other tools and works on the major file types as we hope it to be useful to the community.

Glad to hear it worked on your files

m0shen · on Feb 16, 2024

Thank you for the release! I understand you're just getting it out the door. I just hope to see it delivered as a native library or something more reusable.

I did try the python cli, but it seems to be about 30x slower than `file` for the random bag of files I checked.

I'll probably take some time this weekend to make a couple of issues around misidentified files.

I'll definitely be adding this to my toolset!

invernizzi · on Feb 16, 2024

Hello! We wrote the Node library as a first functional version. Its API is already stable, but it's a bit slower than the Python library for two reasons: it loads the model at runtime, and it doesn't do batch lookups, meaning it calls the model for each file. Other than that, it's just as fast for single file lookups, which is the most common usecase.

m0shen · on Feb 16, 2024

Good to know! Thank you. I'll definitely be trying it out. Though, I might download and hardcode the model ;)

I also appreciate the use of ONNX here, as I'm already thinking about using another version of the runtime.

Do you think you'll open source your F1 benchmark?

tudorw · on Feb 16, 2024

Can we do the 1600 if known, if not, let the AI take a guess?

m0shen · on Feb 16, 2024

Absolutely, and honestly in a non-interactive ingestion workflow you're probably doing multiple checks anyway. I've worked with systems that call multiple libraries and hand-coded validation for each incoming file.

Maybe it's my general malaise, or disillusionment with the software industry, but when I wrote that I was really just expecting more.

michaelt · on Feb 16, 2024

> The model appears to only detect 116 file types [...] Where libmagic detects... a lot. Over 1600 last time I checked

As I'm sure you know, in a lot of applications, you're preparing things for a downstream process which supports far fewer than 1600 file types.

For example, a printer driver might call on file to check if an input is postscript or PDF, to choose the appropriate converter - and for any other format, just reject the input.

Or someone training an ML model to generate Python code might have a load of files they've scraped from the web, but might want to discard anything that isn't Python.

theon144 · on Feb 16, 2024

Okay, but your one file type is more likely to be included in the 1600 that libmagic supports rather than Magika's 116?

For that matter, the file types I care about are unfortunately misdetected by Magika (which is also an important point - the `file` command at least gives up and says "data" when it doesn't know, whereas the Magika demo gives a confidently wrong answer).

I don't want to criticize the release because it's not meant to be a production-ready piece of software, and I'm sure the current 116 types isn't a hard limit, but I do understand the parent comment's contention.

eapriv · on Feb 17, 2024

Surely identifying just one file type (or two, as in your example) is a much simpler task that shouldn’t rely on horribly inefficient and imprecise “AI” tools?

lebean · on Feb 16, 2024

It's for researchers, probably.

m0shen · on Feb 16, 2024

Yeah, there is this line:

    By open-sourcing Magika, we aim to help other software improve their file identification accuracy and offer researchers a reliable method for identifying file types at scale.

Which implies a production-ready release for general usage, as well as usage by security researchers.

Eiim · on Feb 16, 2024

I ran a quick test on 100 semi-random files I had laying around. Of those, 81 were detected correctly, 6 were detected as the wrong file type, and 12 were detected with an unspecific file type (unknown binary/generic text) when a more specific type existed. In 4 of the unspecific cases, a low-confidence guess was provided, which was wrong in each case. However, almost all of the files which were detected wrong/unspecific are of types not supported by Magika, with one exception of a JSON file containing a lot of JS code as text, which was detected as JS code. For comparison, file 5.45 (the version I happened to have installed) got 83 correct, 6 wrong, and 10 not specific. It detected the weird JSON correctly, but also had its own strange issues, such as detecting a CSV as just "data". The "wrong" here was somewhat skewed by the 4 GLSL shader code files that were in the dataset for some reason, all of which it detected as C code (Magika called them unknown). The other two "wrong" detections were also code formats that it seems it doesn't support. It was also able to output a lot more information about the media files. Not sure what to make of these tests but perhaps they're useful to somebody.

pizzalife · on Feb 16, 2024

> The "wrong" here was somewhat skewed by the 4 GLSL shader code files that were in the dataset for some reason, all of which it detected as C code

To be fair though, a snippet of GLSL shader code can be perfectly valid C.

Eiim · on Feb 16, 2024

Indeed, which is why I felt the need to call it out here. I'm not certain if the files on question actually happened to be valid C but whether that's a meaningful mistake regardless is left to the reader to decide.

lifthrasiir · on Feb 16, 2024

I'm extremely confused about the claim that other tools have a worse precision or recall for APK or JAR files which are very much regular. Like, they should be a valid ZIP file with `META-INF/MANIFEST.MF` present (at least), and APK would need `classes.dex` as well, but at this point there is no other format that can be confused with APK or JAR I believe. I'd like to see which file was causing unexpected drop on precision or recall.

Someone · on Feb 16, 2024

People do create JAR files without a META-INF/MANIFEST.MF entry.

The tooling even supports it. https://docs.oracle.com/en/java/javase/21/docs/specs/man/jar...:

  -M or --no-manifest
     Doesn't create a manifest file for the entries

HtmlProgrammer · on Feb 16, 2024

Minecraft mods 14 years ago used to tell you to open the JAR and delete the META-INF when installing them so can’t rely on that one…

supriyo-biswas · on Feb 16, 2024

The `file` command checks only the first few bytes, and doesn’t parse the structure of the file. APK files are indeed reported as Zip archives by the latest version of `file`.

m0shen · on Feb 16, 2024

This is false in every sense for https://www.darwinsys.com/file/ (probably the most used file version). It depends on the magic for a specific file, but it can check any part of your file. Many Linux distros are years out of date, you might be using a very old version.

FILE_45:

    ./src/file -m magic/magic.mgc ../../OpenCalc.v2.3.1.apk
    ../../OpenCalc.v2.3.1.apk: Android package (APK), with zipflinger virtual entry, with APK Signing Block

supriyo-biswas · on Feb 16, 2024

Interesting! I checked with file 5.44 from Ubuntu 23.10 and 5.45 on macOS using homebrew, and in both cases, I got “Zip archive data, at least v2.0 to extract” for the file here[1]. I don’t have an Android phone to check and I’m also not familiar with Android tooling, so is this a corrupt APK?

[1] https://download.apkpure.net/custom/com.apkpure.aegon-319781...

m0shen · on Feb 16, 2024

That doesn't appear to be a valid link. Try building `file` from source and using the provided default magic database.

supriyo-biswas · on Feb 16, 2024

I also tried this with the sources of file from the homepage you linked above, and I still get the same results.

You could try this for yourself using the same APKPure file which I uploaded at the following alternative link[1]. Further, while this could be a corrupt APK, I can’t see any signs of that from a cursory inspection as both the `classes.dex` and `META-INF` directory are present, and this is APKPure’s own APK, instead of an APK contributed for an app contributed by a third-party.

[1] https://wormhole.app/Mebmy#CDv86juV9H4aRCL2DSJeDw

charcircuit · on Feb 16, 2024

apks are also zipaligned so it's not like random users are going to be making them either

awaythrow999 · on Feb 16, 2024

Wonder how this would handle a polyglot[0][1], that is valid as a PDF document, a ZIP archive, and a Bash script that runs a Python webserver, which hosts Kaitai Struct’s WebIDE which, allowing you to view the file’s own annotated bytes.

[0]: https://www.alchemistowl.org/pocorgtfo/

[1]: https://www.alchemistowl.org/pocorgtfo/pocorgtfo16.pdf

Edit: just tested, and it does only identify the zip layer

rvnx · on Feb 16, 2024

You can try it here: https://google.github.io/magika/

It's relatively limited compared to `file` (~10% coverage), it's more like a specialized classificator for basic file formats, so such cases are really out-of-scope.

I guess it's more for detecting common file formats then with high recall.

However, where is the actual source of the model ? Let's say I want to add a new file format myself.

Apparently only the source of the interpreter is here, not the source of the model nor the training set, which is the most important thing.

tempay · on Feb 16, 2024

Is there anything about the performance on unknown files?

I've tried a few that aren't "basic" but are widely used enough to be well supported in libmagic and it thinks they're zip files. I know enough about the underlying formats to know they're not using zip as a container under-the-hood.

kevincox · on Feb 16, 2024

Apparenty the Super Mario Bros. 3 ROM is 100% a SWF file.

Cool that you can use it online though. Might end up using it like that. Although it seems like it may focus on common formats.

alexandreyc · on Feb 16, 2024

Yes, I totally agree; it's not what I would qualify as open source.

Do you plan to release the training code along the research paper? What about the dataset?

In any case, it's very neat to have ML-based technique and lightweight model for such tasks!

lopkeny12ko · on Feb 16, 2024

I don't understand why this needs to exist. Isn't file type detection inherently deterministic by nature? A valid tar archive will always have the same first few magic bytes. An ELF binary has a universal ELF magic and header. If the magic is bad, then the file is corrupted and not a valid XYZ file. What's the value in throwing in "heuristics" and probabilistic inference into a process that is black and white by design.

TacticalCoder · on Feb 16, 2024

> What's the value in throwing in "heuristics" and probabilistic inference into a process that is black and white by design.

I use the file command all the time. The value is when you get this:

    ... $  file somefile.xyz
    somefile.xyz: data

AIUI from reading TFA, magika can determine more filetypes than what the file command can detect.

It'd actually be very easy to determine if there's any value in magika: run file on every file on your filesystem and then for every file where the file command returns "data", run magika and see if magika is right.

If it's right, there's your value.

P.S: it may also be easier to run on Windows than the file command? But then I can't do much to help people who are on Windows.

Eiim · on Feb 16, 2024

From elsewhere in this thread, it appears that Magika detects far fewer file types than file (116 vs ~1600), which makes sense. For file, you just need to drop in a few rules to add a new, somewhat obscure type. An AI approach like Magika will need lots of training and test data for each new file type. Where Magika might have a leg up is with distinguishing different textual data files (i.e., source code), but I don't see that as a particularly big use case honestly.

cle · on Feb 16, 2024

It's not always deterministic, sometimes it's fuzzy depending on the file type. Example of this is a one-line CSV file. I tested one case of that, libmagic detects it as a text file while magika correctly detects it as a CSV (and gives a confidence score, which is killer).

alkonaut · on Feb 16, 2024

But even with determinism, it's not always right. It's not too rare to find a text file with a byte order mark indicating UTF16 (0xFE 0xFF) but then actually containing utf-8. But what "format" does it have then? Is it UTF-8 or UTF-16? Same with e.g. a jar file missing a manifest. That's just a zip, even though I'm sure some runtime might eat it.

But the question is when you have the issue of having to guess the format of a file? Is it when reverse engineering? Last time I did something like this was in the 90's when trying to pick apart some texture from a directory of files called asset0001.k and it turns out it was a bitmap or whatever. Fun times.

vintermann · on Feb 16, 2024

Consider, it's perfectly possible for a file to fit two or more file formats - polyglot files are a hobby for some people.

And there are also a billion formats that are not uniquely determined by magic bytes. You don't have to go further than text files.

KOLANICH · on Feb 16, 2024

This tool doesn't work this way.

potatoman22 · on Feb 16, 2024

This also works for formats like Python, HTML, and JSON.

lopkeny12ko · on Feb 16, 2024

I still don't see how this is useful. The only time I want to answer the question "what type of file is this" is if it is an opaque blob of binary data. If it's a plain text file like Python, HTML, or JSON, I can figure that out by just catting the file.

LiamPowell · on Feb 16, 2024

file (https://www.darwinsys.com/file/) already detects all these formats.

ebursztein · on Feb 16, 2024

Indeed but as pointed out in the blog post -- file is significantly less accurate that Magika. There are also some file type that we support and file doesn't as reported in the table.

LiamPowell · on Feb 16, 2024

I can't immediately find the dataset used for benchmarking. Is file actually failing on common files or just particularly nasty examples? If it's the latter then how does it compare to Magika on files that an average person is likely to see?

schleck8 · on Feb 16, 2024

> Is file actually failing on common files or just particularly nasty examples? If it's the latter then how does it compare to Magika on files that an average person is likely to see?

That's not the point in file type guessing is it? Google employs it as an additional security measure for user submitted content which absolutely makes sense given what malware devs do with file types.

amelius · on Feb 16, 2024

Yes, but shouldn't the file type be part of the file, or (better) of the metadata of the file?

Knowing is better than guessing.

YoshiRulz · on Feb 16, 2024

So instead of spending some of their human resources to improve libmagic, they used some of their computing power to create an "open source" neural net, which is technically more accurate than the "error-prone" hand-written rules (ignoring that it supports far fewer filetypes), and which is much less effective in an adversarial context, and they want it to "help other software improve their file identification accuracy," which of course it can't since neural nets aren't introspectable. Thanks guys.

12_throw_away · on Feb 16, 2024

Come on, can't you help but be impressed by this amazing AI tech? That gives us sci-fi tools like ... a less-accurate, incomplete, stochastic, un-debuggable, slower, electricity-guzzling version of `file`.

og_kalu · on Feb 16, 2024

>So instead of spending some of their human resources to improve libmagic

A large megacorp can work on multiple things at once.

>an "open source" neural net, which is technically more accurate than the "error-prone" hand-written rules (ignoring that it supports far fewer filetypes)

You say that like it's a contradiction but it's not.

>and which is much less effective in an adversarial context,

Is it? This seems like an assumption.

>and they want it to "help other software improve their file identification accuracy," which of course it can't since neural nets aren't introspectable.

Being introspectable or not has no bearing on the accuracy of a system.

YoshiRulz · on Feb 18, 2024

> > an "open source" neural net, which is technically more accurate than the "error-prone" hand-written rules (ignoring that it supports far fewer filetypes)

> You say that like it's a contradiction but it's not.

> > and they want it to "help other software improve their file identification accuracy," which of course it can't since neural nets aren't introspectable.

> Being introspectable or not has no bearing on the accuracy of a system.

"Open source" and "neural net" is the contradiction, as I went on to write. Even if magika were a more accurate version of file, the implication that it could "help [libmagic] improve" isn't really true, because how do you distill the knowledge from it into a patch for libmagic?

My point re: their "error-prone" claim is that their comparison was disingenuous due to the functionality difference between the tools. (Also with their implication that AIs work perfectly, though this one sounds pretty good by the numbers. I of course accept that there's likely to be some bugs in code written by humans.)

> > and which is much less effective in an adversarial context,

> Is it? This seems like an assumption.

It is, one based on what I've heard about AI classifiers over the years. Other commenters here are interested in this point, but while I don't see anyone experimenting on magika (it's new after all), the fact it's not mentioned in the article leads me to believe they didn't try to attack themselves. (Or did, but with bad results, and so decided not to include that. Funnily enough they did mention adversarial attacks on manually-written classifiers...)

thorum · on Feb 16, 2024

Supported file types: https://github.com/google/magika/blob/main/docs/supported-co...

s1mon · on Feb 16, 2024

It's surprising that there are so many file types that seem relatively common which are missing from this list. There are no raw image file formats. There's nothing for CAD - either source files or neutral files. There's no MIDI files, or any other music creation types. There's no APL, Pascal, COBOL, assembly source file formats etc.

vintermann · on Feb 16, 2024

Well, what they used this for at Google was apparently scanning their user's files for things they shouldn't store in the cloud. Probably they don't care much about MIDI.

kevincox · on Feb 16, 2024

Worse it seems that for unknown formats it confidently claims that it is one of the known formats. Rather than saying "unknown" or "binary data".

_3u10 · on Feb 16, 2024

No tracker / .mod files either, just use file.

ebursztein · on Feb 16, 2024

Thanks for the list, we will probably try to extend the list of format supported in future revision.

photoGrant · on Feb 16, 2024

Yeah this quickly went from 'additional helpful tool in the kit' to 'probably should use something else first'

vunderba · on Feb 16, 2024

As somebody who's dealt with the ambiguity of attempting to use file signatures in order to identify file type, this seems like a pretty useful library. Especially since it seems to be able to distinguish between different types of text files based on their format/content e.g. CSV, markdown, etc.

NiloCK · on Feb 16, 2024

A somewhat surprising and genuinely useful application of the family of techniques.

I wonder how susceptible it is to adversarial binaries or, hah, prompt-injected binaries.

nicklecompte · on Feb 16, 2024

Elsewhere in the thread kevincox[1] points out that it's extremely susceptible to adversarial binaries:

> Worse it seems that for unknown formats it confidently claims that it is one of the known formats. Rather than saying "unknown" or "binary data".

Seems like this is genuinely useless for anybody but AI researchers.

[1] https://news.ycombinator.com/item?id=39395677

jamesdwilson · on Feb 16, 2024

For the extremely limited number of file types supported, I question the utility of this compared to `magic`

star4040 · on Feb 16, 2024

It gets a lot of binary file formats wrong for me out-of-the-box. I think it needs to be a bit more effective before we can truly determine the effectiveness of such exploits.

queuebert · on Feb 16, 2024

But they reported >99% accuracy on their cherry-picked dataset! /s

dghlsakjg · on Feb 16, 2024

“These aren’t the binaries you are looking for…”

Vt71fcAqt7 · on Feb 16, 2024

This feels like old school google. I like that it's just a static webpage that basically can't be shut down or sunsetted. It reminds of when Google just made useful stuff and gave them away for free on a webpage like translate and google books. Obviously less life changing than the above but still a great option to have when I need this.

userbinator · on Feb 16, 2024

Today web browsers, code editors, and countless other software rely on file-type detection to decide how to properly render a file.

"web browsers"? Odd to see this coming from Google itself. https://en.wikipedia.org/wiki/Content_sniffing was widely criticised for being problematic for security.

rafram · on Feb 16, 2024

Content sniffing can be disabled by the server (X-Content-Type-Options: nosniff), but it’s still used by default. Web browsers have to assume that servers are stupid, and that for relatively harmless cases, it’s fine to e.g. render a PNG loaded by an <img> even if it’s served as text/plain.

TacticalCoder · on Feb 16, 2024

To me the obvious use case is to first use the file command but then, when file returns "DATA" (meaning it couldn't guess the file type), call magika.

I guess I'll be writing a wrapper (only for when using my shell in interactive mode) around file doing just that when I come back from vacation. I hate it when file cannot do its thing.

Put it this way: I use file a lot and I know at times it cannot detect a filetype. But is file often wrong when it does have a match? I don't think so...

So in most of the cases I'd have file correctly give me the filetype, very quickly but then in those rare cases where file cannot find anything, I'd then use the slower but apparently more capable magika.

SnowflakeOnIce · on Feb 16, 2024

I have seen 'file' misclassify many things when running it at large scale (millions of files) from a hodgepodge of sources. Unrelated types getting called 'GPG Private Keys', for example.

For textual data types, 'file' gets confused often, or doesn't give a precise type. GitHub's 'linguist' [1] tool does much better here, but is structured in such a way that it is difficult to call it on an arbitrary file or bytestring that doesn't reside in a git repo.

I'd love to have a classification tool that can more granularly classify textual files! It may not be Magika _today_ since it only supports 116-something types. For this use case, an ML-based approach will be more successful than an approach based solely on handwritten heuristic rules. I'm excited to see where this goes.

krick · on Feb 16, 2024

What are use-cases for this? I mean, obviously detecting the filetype is useful, but we kinda already have plenty of tools to do that, and I cannot imagine, why we need some "smart" way of doing this. If you are not a human, and you are not sure what is it (like, an unknown file being uploaded to a server) you would be better off just rejecting it completely, right? After all, there's absolutely no way an "AI powered" tool can be more reliable than some dumb, err-on-safer-side heuristic, and you wouldn't want to trust that thing to protect you from malicious payloads.

nindalf · on Feb 16, 2024

> no way an "AI powered" tool can be more reliable

The article provides accuracy benchmarks.

> you would be better off just rejecting it completely

They mention using it in gmail and Drive, neither of which have the luxury of rejecting files willy-nilly.

fuzztester · on Feb 16, 2024

I have not tried it recently, but IIRC, Gmail does reject attachments which are zip files, for security reasons.

wildrhythms · on Feb 16, 2024

Gmail nukes zips if they contain an executable or some other 'prohibited' file type. Most email providers block executable attachments.

n2d4 · on Feb 16, 2024

Virus detection is mentioned in the article. Code editors need to find the programming language for syntax highlighting of code before you give it a name. Your desktop OS needs to know which program to open files with. Or, recovering files from a corrupted drive. Etc

It's easy to distinguish, say, a PNG from a JPG file (or anything else that has well-defined magic bytes). But some files look virtually identical (eg. .jar files are really just .zip files). Also see polyglot files [1].

If you allow an `unknown` label or human intervention, then yes, magic bytes might be enough, but sometimes you'd rather have a 99% chance to be right about 95% of files vs. a 100% chance to be right about 50% of files.

[1] https://en.wikipedia.org/wiki/Polyglot_(computing)

woliveirajr · on Feb 16, 2024

Reminds me when someone asked (at StackOverflow) on how to recognize binaries for different architetures, like x86 or ARM-something or Apple M1 and so on.

I gave the idea to use the technique of NCD (Normalized compression distance), based on Kolmogorov complexity. Celibrasi, R. was one great researcher in this area, and I think he worked at Google at some point.

Using AI seems to follow the same path: "learn" what represents some specific file and then compare the unknown file to those references (AI:all the parameters, NCD:compression against a known type).

jjsimpso · on Feb 16, 2024

I wrote an implementation of libmagic in Racket a few years ago(https://github.com/jjsimpso/magic). File type identification is a pretty interesting topic.

As others have noted, libmagic detects many more file types than Magika, but I can see Magika being useful for text files in particular, because anything written by humans doesn't have a rigid format.

VikingCoder · on Feb 16, 2024

What does it do with an Actually Portable Executable compiled by Cosmopolitan libc compiler?

supriyo-biswas · on Feb 16, 2024

It’s reported as a PE executable, `file` on the other hand reports it as a “DOS/MBR boot sector.”

20after4 · on Feb 16, 2024

I just want to say thank you for the release. There are quite a lot of complaints in the comments but I think this is a useful and worthwhile contribution and I appreciate the authors for going through the effort to get it approved for open source release. It would be great if the model training data was included (or at lease documentation about how to reproduce it.) but that doesn’t preclude this being useful. Thanks!

glassonion999 · on Feb 25, 2024

I created a demo site for Magika. https://9revolution9.com/tools/security/file_scanner/

alexott · on Feb 17, 2024

Mime type detection is very interesting thing. I wrote media type detection for McAfee Web Gateway 7.x and because it was a high performance proxy, the detection speed was a major focus, but also the precision, especially for "container types, like, MS Office, OLE-based files, etc. The base of it was a simple Lisp-like language that allowed to write signatures very fast, and everything was combined with very aggressive caching of the data, so we avoided to read data again and again, and used internal caches a lot. In tests, the detection was ~10x faster than file, and with more flexible language we got more file types recognized precisely. Although there were challenges with some formats, like, OLE-based files had FAT directory structure at the end of the file, and you were need to walk the tree to find the top-level structure to distinguish Excel file from Excel file embedded into Word.

Streams detection was also quite funny task...

alexott · on Feb 17, 2024

Ah, I remember, the self-extracted .msi file was one of the quite challenging files - it's executable, it's a .cab file, and OLE2-container

vrnvu · on Feb 16, 2024

At $job we have been using Apache Tika for years.

Works but occasionally having bugs and weird collisions when working with billions of files.

Happy to see new contributions in the space.

johnea · on Feb 16, 2024

The results of which you'll never be 100% sure are correct...

wruza · on Feb 16, 2024

They missed such an opportunity to name it "fail". It's like "file" but with "ai" in it.

tamrix · on Feb 16, 2024

What about faile?

rfoo · on Feb 16, 2024

But file(2) is already like that - my data files without headers are reported randomly as disk images, compressed archives or even executables for never-heard-of machines.

plesiv · on Feb 16, 2024

Other methods use heuristics to guess many filetypes and in the benchmark they show worse performance (in terms of precision). Assuming benchmarks are not biased, the fact that this approach uses AI heuristics instead of hard-coded heuristics shouldn't make it strictly worse.

flohofwoe · on Feb 16, 2024

I wonder how it performs with detecting C vs C++ vs ObjC vs ObjC++ and for bonus points: the common C/C++ subset (which is an incompatible C fork), also extra bonus points for detecting language version compatibility (e.g. C89 vs C99 vs C11...).

Separating C from C++ and ObjC is where the file type detection on Github traditionally had problems with (but has been getting dramatically better over time), from an "AI-powered" solution which has been trained on the entire internet I would expect to do better right from the start.

The list here doesn't even mention any of those languages except C though:

https://github.com/google/magika/blob/main/docs/supported-co...

aidenn0 · on Feb 16, 2024

But will it let you print on Tuesday[1]?

1: https://bugs.launchpad.net/ubuntu/+source/cupsys/+bug/255161...

queuebert · on Feb 16, 2024

For a subscription fee.

thangalin · on Feb 16, 2024

My FOSS desktop text editor performs a subset of file type identification using the first 12 bytes, detecting the type quite quickly:

* https://gitlab.com/DaveJarvis/KeenWrite/-/blob/main/src/main...

There's a much larger list of file signatures at:

* https://github.com/veniware/Space-Maker/blob/master/FileSign...

delijati · on Feb 17, 2024

Nice aka perfect timing.I just restored "some" files (40GB) with [1] But the filetype detection of photorec set some wrong file types.

Edit: It would be super helpful if the "suffix" could be added as output so i can move the files to the right directory [2] ;)

[1] https://www.cgsecurity.org/wiki/PhotoRec [2] https://github.com/google/magika/issues/63

account-5 · on Feb 16, 2024

Assuming that I've not misunderstood, how does this compare to things like: TrID [0]?? Apart from being open source.

[0] https://mark0.net/soft-trid-e.html

JacobThreeThree · on Feb 16, 2024

The bulk of the short article is a set of performance benchmarks comparing Magika to TrID and others.

account-5 · on Feb 16, 2024

Argh, the risks of browsing the web without JavaScript and/or third party scripts enabled, you miss content, because rendering text and images on the modern web can't be done without them, apparently. (Sarcasm).

You are of course correct. I can see the images showing the comparison. Apologies.

Andugal · on Feb 16, 2024

I have a question: Is something like Magika enough to check if a file is malicious or not?

Example: users can upload PNG file (and only PNG is accepted). If Malika detects that the file is a PNG, does this mean the file is clean?

nicklecompte · on Feb 16, 2024

This comment from kevincox[1] says the answer is a hard "no":

> Worse it seems that for unknown formats it confidently claims that it is one of the known formats. Rather than saying "unknown" or "binary data".

There are other comments in this thread that make me think Google contaminated their test data with training data and the 99% results should not be taken at face value. OTOH I am not particularly surprised that Magika would be better than the other tools at distinguishing semi-unstructured plain text e.g. Java source vs. C++ source or YAMLs versus INIs. But that's a very different use case than many security applications. The comments here suggest Magika is especially susceptible to binary obfuscation.

[1] https://news.ycombinator.com/item?id=39395677

TacticalCoder · on Feb 16, 2024

If that PNG of yours is not just an example note that you can detect easily if the PNG as any extra data (which may or may not indicate an attempt as mischief) and reject the (rare) PNGs with extra data. I ran a script checking the thousands of PNGs on my system and found three with extra data, all three probably due to the "PNG acropalypse" bug (but mischief cannot be ruled out).

P.S: btw I'm not implying using extra data that shouldn't be there in a PNG is the only way to have a malicious PNG.