Hacker News new | past | comments | ask | show | jobs | submit login
Magika: AI powered fast and efficient file type identification (googleblog.com)
695 points by alphabetting on Feb 16, 2024 | hide | past | favorite | 251 comments



This looks cool. I ran this on some web crawl data I have locally, so: all files you'd find on regular websites; HTML, CSS, JavaScript, fonts etc.

It identified some simple HTML files (html, head, title, body, p tags and not much else) as "MS Visual Basic source (VBA)", "ASP source (code)", and "Generic text document" where the `file` utility correctly identified all such examples as "HTML document text".

Some woff and woff2 files it identified as "TrueType Font Data", others are "Unknown binary data (unknown)" with low confidence guesses ranging from FLAC audio to ISO 9660. Again, the `file` utility correctly identifies these files as "Web Open Font Format".

I like the idea, but the current implementation can't be relied on IMO; especially not for automation.

A minor pet peeve also: it doesn't seem to detect when its output is a pipe and strip the shell colour escapes resulting in `^[[1;37` and `^[[0;39m` wrapping every line if you pipe the output into a vim buffer or similar.


Thanks for the feedback -- we will look into it. If you can share with us the list of URL that would be very helpful so we can reproduce - send us an email at magika-dev@google.com if that is possible.

For crawling we have planned a head only model to avoid fetching the whole file but it is not ready yet -- we weren't sure what use-cases would emerge so that is good to know that such model might be useful.

We mostly use Magika internally to route files for AV scanning as we wrote in the blog post, so it is possible that despite our best effort to test Magika extensively on various file types it is not as good on fonts format as it should be. We will look into.

Thanks again for sharing your experience with Magika this is very useful.


Sure thing :)

Here's[0] a .tgz file with 3 files in it that are misidentified by magika but correctly identified by the `file` utility: asp.html, vba.html, unknown.woff

These are files that were in one of my crawl datasets.

[0]: https://poc.lol/files/magika-test.tgz


Thank you - we are adding them to our test suit for the next version.


Super, thank you! I look forward to it :)

I've worked on similar problems recently so I'm well aware of how difficult this is. An example I've given people is in automatically detecting base64-encoded data. It seems easy at first, but any four, eight, or twelve (etc) letter word is technically valid base64, so you need to decide if and how those things should be excluded.


Do you have permission to redistribute these files?


LOL nice b8 m8. For the rest of you who are curious, the files look like this:

    <HTML><HEAD>
    <TITLE>Access Denied</TITLE>
    </HEAD><BODY>
    <H1>Access Denied</H1>
     
    You don't have permission to access "http&#58;&#47;&#47;placement&#46;api&#46;test4&#46;example&#46;com&#47;" on this server.<P>
    Reference&#32;&#35;18&#46;9cb0f748&#46;1695037739&#46;283e2e00
    </BODY>
    </HTML>
Legend. "Do you have permission" hahaha.


You are asking what if this guy has "web crawl data" that google does not have?

And what if he says no, he does not have permission.


> You are asking what if this guy has "web crawl data" that google does not have?

No, I'm asking if he has permission to redistribute these files.


Are you attempting to assert that use of these files solely for the purpose of improving a software system meant to classify file types does not fall under fair use?

https://en.wikipedia.org/wiki/Fair_use


I'm asking a question.

Here's another one for you: Do you believe that all pictures you have ever taken, all emails you have ever written, all code you have ever written could be posted here on this forum to improve someone else's software system?

If so, could you go ahead and post that zip? I'd like to ingest it in my model.


Your question seems orthogonal to the situation. The three files posted seem to be the minimum amount of information required to reproduce the bug. Fair use encompasses a LOT of uses of otherwise copyrighted work, and this seems clearly to be one.


I don't see how publicly posting them on a forum is

> the minimum amount of information required to reproduce the bug

MAYBE if they had communicated privately that'd be an argument that made sense.


So you don't think that software development which happens in public web forums deserve fair use protection?


That's an interesting way to frame "publicly posted someone else's data without their consent for anyone to see and download"


I notice you're so invested that you haven't noticed that the files have been renamed and zipped such that they're not even indexable. How you'd expect anyone not participating in software development to find them is yet to be explained.


[flagged]


Have fun, buddy!


It's three files that were scraped from (and so publicly available on) the web. That's not at all similar to your strawful analogy.


I'm over here trying to fathom the lack of control over one's own life it would take to cause someone to turn into an online copyright cop, when the data in question isn't even their own, is clearly divorced from any context which would make it useful for anything other than fixing the bug, and about which the original copyright holder hasn't complained.

Some people just want to argue.

If the copyright holder has a problem with the use, they are perfectly entitled to spend some of their dollar bills to file a law suit, as part of which the contents of the files can be entered into the public record for all to legally access, as was done with Scientology.

I don't expect anyone would be so daft.


Literally just asked a question and that seems to have set you off, bud. Are you alright? Do you need to feed your LLM more data to keep it happy?


I'm always happy to stand up for folks who make things over people who want to police them. Especially when nothing wrong has happened. Maybe take a walk and get some fresh air?


I share your distaste for people whose only contribution is subtraction but suggest you lay off the sarcasm though. Trolls; don't feed. (Well done on your project BTW)


I don't see any sarcasm from me in the thread. I had serious questions. Perhaps you could point out what you see? Thanks for the supportive words about the project.


Perhaps I misread "Maybe take a walk and get some fresh air?" - no worries though.


I've certainly seen people say similar things facetiously, but I was being genuine. I'm not sure if beeboobaa was trolling or not, I try to take what folks say at face value. They seemed to be pretty attached to a particular point of view, though. Happens to all of us. The thing for attachment is time and space and new experiences. Walks are great for those things, and also the best for organizing thoughts. Einstein loved taking walks for these reasons, and me too. It feels better to suggest something helpful when discussion derails, than to hurl insults as happens all too frequently.


Literally all you did is bitch and moan about someone asking a simple question, lol. Go touch grass.


I already had my walk this morning, thanks! If you'd like to learn more about copyright law, including about all the ways it's fuzzy around the edges for legitimate uses like this one, I highly recommend groklaw.net. PJ did wonderful work writing about such boring topics in personable and readable ways. I hope you have a great day!


no thanks, not interested in your american nonsense laws. lecturing people who are asking SOMEONE ELSE a question is a terrible personality trait btw


181 out of 195 countries and counting!

https://en.wikipedia.org/wiki/Berne_Convention

Look at that map!

https://upload.wikimedia.org/wikipedia/commons/7/76/Berne_Co...

P.S. Berne doesn't sound like a very American name.

You would really learn a lot from reading Groklaw. Of course, I can't make you. Good luck in the world though!


man, you really are putting a lot of effort into justifying stealing other people's content


Thanks for such great opportunities to post educational content to Hacker News! I genuinely hope some things go your way, man. Rooting for you. Go get 'em.


If you can’t undermine someone’s argument, undermine their nationality. American tech culture doesn’t do this as much as it should, perhaps because we know eventually those folks wake up.


Not sure what your point is, but why would i care to learn about the laws of some other dude's country that he's using to support his bizarro arguments?


> why would i care to learn about the laws of some other dude's country

The website you're attempting to police other people's behavior on is hosted in the country you're complaining about. Lol.

Maybe there is a website local to your country where your ideas would be better received?


You're so brave


Thanks!


What is the MIME type of a .tar file; and what are the MIME types of the constituent concatenated files within an archive format like e.g. tar?

hachoir/subfile/main.py: https://github.com/vstinner/hachoir/blob/main/hachoir/subfil...

File signature: https://en.wikipedia.org/wiki/File_signature

PhotoRec: https://en.wikipedia.org/wiki/PhotoRec

"File Format Gallery for Kaitai Struct"; 185+ binary file format specifications: https://formats.kaitai.io/

Table of ': https://formats.kaitai.io/xref.html

AntiVirus software > Identification methods > Signature-based detection, Heuristics, and ML/AI data mining: https://en.wikipedia.org/wiki/Antivirus_software#Identificat...

Executable compression; packer/loader: https://en.wikipedia.org/wiki/Executable_compression

Shellcode database > MSF: https://en.wikipedia.org/wiki/Shellcode_database

sigtool.c: https://github.com/Cisco-Talos/clamav/blob/main/sigtool/sigt...

clamav sigtool: https://www.google.com/search?q=clamav+sigtool

https://blog.didierstevens.com/2017/07/14/clamav-sigtool-dec... :

  sigtool –-find-sigs "$name" | sigtool –-decode-sigs 
List of file signatures: https://en.wikipedia.org/wiki/List_of_file_signatures

And then also clusterfuzz/oss-fuzz scans .txt source files with (sandboxed) Static and Dynamic Analysis tools, and `debsums`/`rpm -Va` verify that files on disk have the same (GPG signed) checksums as the package they are supposed to have been installed from, and a file-based HIDS builds a database of file hashes and compares what's on disk in a later scan with what was presumed good, and ~gdesktop LLM tools scan every file, and there are extended filesystem attributes for label-based MAC systems like SELinux, oh and NTFS ADS.

A sufficient cryptographic hash function yields random bits with uniform probability. DRBG Deterministic Random Bit Generators need high entropy random bits in order to continuously re-seed the RNG random number generator. Is it safe to assume that hashing (1) every file on disk, or (2) any given file on disk at random, will yield random bits with uniform probability; and (3) why Argon2 instead of e.g. only two rounds of SHA256?

https://github.com/google/osv.dev/blob/master/README.md#usin... :

> We provide a Go based tool that will scan your dependencies, and check them against the OSV database for known vulnerabilities via the OSV API. ... With package metadata, not (a file hash, package) database that could be generated from OSV and the actual package files instead of their manifest of already-calculated checksums.

Might as well be heating a pool on the roof with all of this waste heat from hashing binaries build from code of unknown static and dynamic quality.

Add'l useful formats:

> Currently it is able to scan various lockfiles, debian docker containers, SPDX and CycloneDB SBOMs, and git repositories

Things like bittorrent magnet URIs, Named Data Networking, and IPFS are (file-hash based) "Content addressable storage": https://en.wikipedia.org/wiki/Content-addressable_storage


I’m not sure what this comment is trying to say


File-based hashing is done is so many places, there's so much heat.

Sub- file-based hashing with feature engineering is necessary for AV, which must take packing, obfuscating, loading, and dynamic analysis into account in addition to zip archives and magic file numbers.

AV AntiVirus applications with LLMs: what do you train it on, what are some of the existing signature databases.

https://SigStore.dev/ (The Linux Foundation) also has a hash-file inverted index for released artifacts.

Also otoh with a time limit,

1. What file is this? Dirname, basename, hashes(s)

2. Is it supposed to be installed at such path?

3. Per it's header, is the file an archive or an image or a document?

4. What file(s) and records and fields are packed into a file, and what transforms were the data transformed with?


> the current implementation can't be relied on IMO

What's your reasoning for not relying on this? (It seems to me that this would be application-dependent at the very least.)


I'm not the person you asked, but I'm not sure I understand your question and I'd like to. It whiffed multiple common softballs, to the point it brings into question the claims made about its performance. What reasoning is there to trust it?


It had 3 failures. How is that a sign it's untrustworthy? I'm sure all alternatives have more than 3 failures. You might be making assumptions about the distribution of successes and failures (GP didn't say how many files they tested to find those 3) or how "soft" they were. In an extreme case, they might even have been crafted adversarial examples. But even if not, they might have features that really do look more like some other file type from the point of view of the classifier even if it's not easily apparent to a human. Being strictly superior to a competent human is a pretty high bar to set.


> or how "soft" they were.

From the comment: It identified some simple HTML files (html, head, title, body, p tags and not much else) as "MS Visual Basic source (VBA)", "ASP source (code)", and "Generic text document" where the `file` utility correctly identified all such examples as "HTML document text".

That's pretty soft. Nothing "adversarial" claimed either.

> Being strictly superior to a competent human is a pretty high bar to set.

The bar is the file utility.


Those are only soft to a human. I looked at a couple and I picked them correctly but I don't know what details the classifier was seeing which I was blind to. Not to say it was correct, just that we can't call them soft just because they're short and easy for a human.

> The bar is the file utility.

It has higher accuracy than that. You would reject it just because the failures are different even though they're less?


Yes. Unpredictable failures are significantly worse than predictable ones. If file messes up, it's because it decided a ZIP-based document was a generic ZIP file. If Magika messes up, it's entirely random. I can work around file's failure modes, especially if it's one I work with often. Magika's failure modes strike at random and are not possible to anticipate. File also bails out when it doesn't know, a very common failure mode in Magika is that it confidently returns a random answer when it wasn't trained on a file type.


Your original statement was that having a couple of failures brings into question its claims about performance. It doesn't because it doesn't claim such high performance. 99.31% is lower than perhaps 997 out of 1000 or whatever the GP tested. Of course having unpredictable failures is a worry but it's a different worry.


They uploaded 3 sample files for the authors, there were more failures than that, and the failures that GP and others have experienced are of a less tolerable nature. This is the point I was making, that the value added by classifying files with no rigid structure is offset heavily by its unpredictable shortcomings and difficult-to-detect failure modes.

If you have a point of your own to make I'd prefer you jump to it. Nitpicking baseless assumptions like how many files the evil GP had to sift through in order to breathlessly bring us 3 bad eggs is not something I find worthwhile.


The point I'm making is that you drew a conclusion based on insufficient information, apparently by making assumptions about the distribution of failures or the definition of "easy".


> It whiffed multiple common softballs

I must have missed this in the article. Where was this?


...It's in the comment you were responding to. Directly above the section you quoted.


I understand that, but it wasn't clear to me where those examples came from.


It's pretty obvious from the whole comment that they're his own experience. Are you going anywhere with this or are you just saying things?


It provided the wrong file-types for some files, so I cannot rely on its output to be correct.

If you wanted to, for example, use this tool to route different files to different format-specific handlers it would sometimes send files to the wrong handlers.


Except a 100% correct implementation doesn't exist AFAIK. So if I want to do anything that makes a decision based on the type of a file, I have to pick some algorithm to do that. If I can do that correctly 99% of the time, that's better than not being able to make that decision at all, which is where I'm left if a perfect implementation doesn't exist.


Nobody's asking for perfection. But the AI is offering inexplicable and obvious nondeterministic mistakes that the traditional algorithms don't suffer from.

Magika goes wrong and your fonts become audio files and nobody knows why. Magic goes wrong and your ZIP-based documents get mistaken for generic ZIP files. If you work with that edge case a lot, you can anticipate it with traditional algorithms. You can't anticipate nondeterministic hallucination.


Seconding this.

Something like Magika is potentially useful as a second pass if conventional methods of detecting a file type fail or yield a low-confidence result. But, for the majority of binary files, those conventional methods are perfectly adequate. If the first few bytes of a file are "GIF89a", you don't need an AI to tell you that it's probably a GIF image.


Doesn't seem all that non-deterministic. I tested the vba.html example multiple times and it always said it was VBA. I added a space between </HEAD> and <BODY> and it correctly picked HTML as most likely but with a low confidence.

So I think we can say it's sensitive to mysterious features, not that it's non-deterministic. Still leads to your same conclusion that you can't anticipate the failures. But I don't think you can with traditional tools either. Some magic numbers are just plain text (like MZ) which could legitimately accidentally appear at the beginning of a plain text file, for example.


Where are you getting the non-determinism part from? It would seem surprising for there to be anything non-deterministic about an ML model like this, and nothing in the original reports seems to suggest that either.


Large ML models tend to be uncorrectably non-deterministic simply from doing lots of floating point math in parallel. Addition and multiplication of floats is neither commutative nor associative - you may get different results depending on the order in which you add/multiply numbers.


Addition and multiplication of floats are commutative.


> It would seem surprising for there to be anything non-deterministic about an ML model like this

I think there may be some confusion of ideas going in here. Machine learning is fundamentally stochastic, so it is non-deterministic almost by definition.


Oh man, this brings me back! Almost 10 years ago I was working on a rails app trying to detect the file type of uploaded spreadsheets (xlsx files were being detected as application/zip, which is technically true but useless).

I found "magic" that could detect these and submitted a patch at https://bugs.freedesktop.org/show_bug.cgi?id=78797. My patch got rejected for needing to look at the first 3KB bytes of the file to figure out the type. They had a hard limit that they wouldn't see past the first 256 bytes. Now in 2024 we're doing this with deep learning! It'd be cool if google released some speed performance benchmarks here against the old-fashioned implementations. Obviously it'd be slower, but is it 1000x or 10^6x?


Co-author of Magika here (Elie) so we didn't include the measurements in the blog post to avoid making it too long but we did those measurements.

Overall file takes about 6ms (single file) 2.26ms per files when scanning multiples. Magika is at 65ms single file and 5.3ms when scanning multiples.

So Magika is for the worst case scenario about 10x slower due to the time it takes to load the model and 2x slower on repeated detection. This is why we said it is not that much slower.

We will have more performance measurements in the upcoming research paper. Hope that answer the question


Is that single-threaded libmagic vs Magika using every core on the system? What are the numbers like if you run multiple libmagic instances in parallel for multiple files, or limit both libmagic and magika to a single core?

Testing it on my own system, magika seems to use a lot more CPU-time:

    file /usr/lib/*  0,34s user 0,54s system 43% cpu 2,010 total
    ./file-parallel.sh  0,85s user 1,91s system 580% cpu 0,477 total
    bin/magika /usr/lib/*  92,73s user 1,11s system 393% cpu 23,869 total
Looks about 50x slower to me. There's 5k files in my lib folder. It's definitely still impressively fast given how the identification is done, but the difference is far from negligible.


Do you have a sense of performance in terms of energy use? 2x slower is fine, but is that at the same wattage, or more?


That sounds like a nit / premature optimization.

Electricity is cheap. If this is sufficiently or actually important for your org, you should measure it yourself. There are too many variables and factors subject to your org’s hardware.


The hardware requirements of a massively parallel algorithm can't possibly be "a nit" in any universe inhabited by rational beings.


Totally disagree. Most end users are on laptops and mobile devices these days, not desktop towers. Thus power efficiency is important for battery life. Performance per watt would be an interesting comparison.


What end users are working with arbitrary files that they don’t know the identification of?

This entire use case seems to be one suited for servers handling user media.


File managers that render preview images. Even detecting which software to open the file with when you click it.

Of course on Windows the convention is to use the file extension, but on other platforms the convention is to look at the file contents


> on other platforms the convention is to look at the file contents

MacOS (that is, Finder) also looks at the extension. That has also been the case with any file manager I've used on Linux distros that I can recall.


You might be surprised. Rename your Photo.JPG as Photo.PNG and you'll still get a perfectly fine thumbnail. The extension is a hint, but it isn't definitive, especially when you start downloading from the web.


Browsers often need to guess a file type


Theoretically? Anyone running a virus scanner.

Of course, it's arguably unlikely a virus scanner would opt for an ML-based approach, as they specifically need to be robust against adversarial inputs.


> it's arguably unlikely a virus scanner would opt for an ML-based approach

Several major players such as Norton, McAfee, and Symantec all at least claim to use AI/ML in their antivirus products.


You'd be surprised what an AV scanner would do.

https://twitter.com/taviso/status/732365178872856577


I mean if you care about that you shouldn't be running anything that isn't highly optimized. Don't open webpages that might be CPU or GPU intensive. Don't run Electron apps, or really anything that isn't built in a compiled language.

Certainly you should do an audit of all the Android and iOS apps as well, to make sure they've been made in a efficient manner.

Block ads as well, they waste power.

This file identification is SUCH a small aspect of everything that is burning power in your laptop or phone as to be laughable.


Whilst energy usage is indeed a small aspect this early on when using bespoke models, we do have to consider that this is a model for simply identifying a file type.

What happens when we introduce more bespoke models for manipulating the data in that file?

This feels like it could slowly boil to the point of programs using magnitudes higher power, at which point it'll be hard to claw it back.


That's a slippery slope argument, which is a common logical fallacy[0]. This model being inefficient compared to the best possible implementation does not mean that future additions will also be inefficient.

It's the equivalent to saying many people programming in Ruby is causing all future programs to be less efficient. Which is not true. In fact, many people programming in Ruby has caused Ruby to become more efficient because it gets optimised as it gets used more (or Python for that matter).

It's not as energy efficient as C, but it hasn't caused it to get worse and worse, and spiral out of control.

Likewise smart contracts are incredibly inefficient mechanisms of computation. The result is mostly that people don't use them for any meaningful amounts of computation, that all gets done "Off Chain".

Generative AI is definitely less efficient, but it's likely to improve over time, and indeed things like quantization has allowed models that would normally to require much more substantial hardware resources (and therefore, more energy intensive) to be run on smaller systems.

[0]: https://en.wikipedia.org/wiki/Slippery_slope


That is a fallacy fallacy. Just because some slopes are not slippery that does not mean none of them are.


The slippery slope fallacy is: "this is a slope. you will slip down it." and is always fallacious. Always. The valid form of such an argument is: "this is a slope, and it is a slippery one, therefore, you will slip down it."


No, it isn't.


Yeah. Yeah, it is.


>This feels like it could slowly boil to the point of programs using magnitudes higher power, at which point it'll be hard to claw it back.

We're already there. Modern software is, by and large, profoundly inefficient.


In general you're right, but I can't think of a single local use for identifying file types by a human on a laptop - at least, one with scale where this matters. It's all going to be SaaS services where people upload stuff.


We are building a data analysis tool with great UX, where users select data files, which are then parsed and uploaded to S3 directly, on their client machines. The server only takes over after this step.

Since the data files can be large, this approach bypasses having to trnasfer the file twice, first to the server, and then to S3 after parsing.


This dont sound like very common scenario.


I've ended up implementing a layer on top of "magic" which, if magic detects application/zip, reads the zip file manifest and checks for telltale file names to reliably detect Office files.

The "magic" library does not seem to be equipped with the capabilities needed to be robust against the zip manifest being ordered in a different way than expected.

But this deep learning approach... I don't know. It might be hard to shoehorn in to many applications where the traditional methods have negligible memory and compute costs and the accuracy is basically 100% for cases that matter (detecting particular file types of interest). But when looking at a large random collection of unknown blobs, yeah, I can see how this could be great.


If you're curious, here's how I solved it for ruby back in the day. Still used magic bytes, but added an overlay on top of the freedesktop.org DB: https://github.com/mimemagicrb/mimemagic/pull/20


Many commenters seem to be using magic instead of file, any reasons?


magic is the core detection logic of file that was extracted out to be available as a library. So these days file is just a higher level wrapper around magic


thanks


From the first paragraph:

> enabling precise file identification within milliseconds, even when running on a CPU.

Maybe your old-fashioned implementations were detecting in microseconds?


Yeah I saw that, but that could cover a pretty wide range and it's not clear to me whether that relies on preloading a model.


> At inference time Magika uses Onnx as an inference engine to ensure files are identified in a matter of milliseconds, almost as fast as a non-AI tool even on CPU.


> They had a hard limit that they wouldn't see past the first 256 bytes.

Then they could never detect zip files with certainty, given that to do that you need to read up to 65KB (+ 22) at the END of the file. The reason is that the zip archive format allows "gargabe" bytes both in the beginning of the file and in between local file headers.... and it's actually not uncommon to prepend a program that self-extracts the archive, for example. The only way to know if a file is a valid zip archive is to look for the End of Central Directory Entry, which is always at the end of the file AND allows for a comment of unknown length at the end (and as the comment length field takes 2 bytes, the comment can be up to 65K long).


That's why the whole question is ill formed. A file does not have exactly one type. It may be a valid input in various contexts. A zip archive may also very well be something else.


FWIW, file can now distinguish many types of zip containers, including Oxml files.


As someone that has worked in a space that has to deal with uploaded files for the last few years, and someone who maintains a WASM libmagic Node package ( https://github.com/moshen/wasmagic ) , I have to say I really love seeing new entries into the file type detection space.

Though I have to say when looking at the Node module, I don't understand why they released it.

Their docs say it's slow:

https://github.com/google/magika/blob/120205323e260dad4e5877...

It loads the model an runtime:

https://github.com/google/magika/blob/120205323e260dad4e5877...

They mark it as Experimental in the documentation, but it seems like it was just made for the web demo.

Also as others have mentioned. The model appears to only detect 116 file types:

https://github.com/google/magika/blob/120205323e260dad4e5877...

Where libmagic detects... a lot. Over 1600 last time I checked:

https://github.com/file/file/tree/4cbd5c8f0851201d203755b76c...

I guess I'm confused by this release. Sure it detected most of my list of sample files, but in a sample set of 4 zip files, it misidentified one.


Made a small test to try it out: https://gist.github.com/moshen/784ee4a38439f00b17855233617e9...

    hyperfine ./magika.bash ./file.bash
    Benchmark 1: ./magika.bash
      Time (mean ± σ):     706.2 ms ±  21.1 ms    [User: 10520.3 ms, System: 1604.6 ms]
      Range (min … max):   684.0 ms … 738.9 ms    10 runs
    
    Benchmark 2: ./file.bash
      Time (mean ± σ):      23.6 ms ±   1.1 ms    [User: 15.7 ms, System: 7.9 ms]
      Range (min … max):    22.4 ms …  29.0 ms    111 runs
    
    Summary
      './file.bash' ran
       29.88 ± 1.65 times faster than './magika.bash'


Realistically, either you're identifying one file interactively and you don't care about latency differences in the 10s of ms, or you're identifying in bulk (batch command line or online in response to requests), in which case you should measure the marginal cost and exclude Python startup and model loading times.


Going by those number it's taking almost a second to run, not 10s of ms. And going by those numbers, it's doing something massively parallel in that time. So basically all your cores will spike to 100% for almost a second during those one-shot identifications. It looks like GP has a 12-16 threads CPU, and it is using those while still being 30 times slower than single-threaded libmagic.

That tool needs 100x more CPU time just to figure out some filetypes than vim needs to open a file from a cold start (which presumably includes using libmagic to check the type).

If I had to wait a second just to open something during which that thing uses every resource available on my computer to the fullest, I'd probably break my keyboard. Try using that thing as a drop-in file replacement, open some folder in your favorite file manager, and watch your computer slow to a crawl as your file manager tries to figure out what thumbnails to render.

It's utterly unsuitable for "interactive" identifications.


My little script is trying to identify in bulk, at least by passing 165 file paths to `magika`, and `file`.

Though, I absolutely agree with you. I think realistically it's better to do this kind of thing in a library rather than shell out to it at all. I was just trying to get an idea on how it generally compares.

Another note, I was trying to be generous to `magicka` here because when it's single file identification, it's about 160-180ms on my machine vs <1ms for `file`. I realize that's going to be quite a bit of python startup in that number, which is why I didn't go with it when pushing that benchmark up earlier. I'll probably push an update to that gist to include the single file benchmark as well.


I've updated this script with some single-file cli numbers, which are (as expected) not good. Mostly just comparing python startup time for that.

    make
    sqlite3 < analyze.sql
    file_avg              python_avg         python_x_times_slower_single_cli
    --------------------  -----------------  --------------------------------
    0.000874874856301821  0.179884610224334  205.611818568799
    file_avg            python_avg     python_x_times_slower_bulk_cli
    ------------------  -------------  ------------------------------
    0.0231715865881818  0.69613745142  30.0427184289163


We did release the npm package because indeed we create a web demo and thought people might want to also use it. We know it is not as fast as the python version or a C++ version -- which why we did mark it as experimental.

The release include the python package and the cli which are quite fast and is the main way we did expect people to use -- sorry if that hasn't be clear in the post.

The goal of the release is to offer a tool that is far more accurate that other tools and works on the major file types as we hope it to be useful to the community.

Glad to hear it worked on your files


Thank you for the release! I understand you're just getting it out the door. I just hope to see it delivered as a native library or something more reusable.

I did try the python cli, but it seems to be about 30x slower than `file` for the random bag of files I checked.

I'll probably take some time this weekend to make a couple of issues around misidentified files.

I'll definitely be adding this to my toolset!


Hello! We wrote the Node library as a first functional version. Its API is already stable, but it's a bit slower than the Python library for two reasons: it loads the model at runtime, and it doesn't do batch lookups, meaning it calls the model for each file. Other than that, it's just as fast for single file lookups, which is the most common usecase.


Good to know! Thank you. I'll definitely be trying it out. Though, I might download and hardcode the model ;)

I also appreciate the use of ONNX here, as I'm already thinking about using another version of the runtime.

Do you think you'll open source your F1 benchmark?


Can we do the 1600 if known, if not, let the AI take a guess?


Absolutely, and honestly in a non-interactive ingestion workflow you're probably doing multiple checks anyway. I've worked with systems that call multiple libraries and hand-coded validation for each incoming file.

Maybe it's my general malaise, or disillusionment with the software industry, but when I wrote that I was really just expecting more.


> The model appears to only detect 116 file types [...] Where libmagic detects... a lot. Over 1600 last time I checked

As I'm sure you know, in a lot of applications, you're preparing things for a downstream process which supports far fewer than 1600 file types.

For example, a printer driver might call on file to check if an input is postscript or PDF, to choose the appropriate converter - and for any other format, just reject the input.

Or someone training an ML model to generate Python code might have a load of files they've scraped from the web, but might want to discard anything that isn't Python.


Okay, but your one file type is more likely to be included in the 1600 that libmagic supports rather than Magika's 116?

For that matter, the file types I care about are unfortunately misdetected by Magika (which is also an important point - the `file` command at least gives up and says "data" when it doesn't know, whereas the Magika demo gives a confidently wrong answer).

I don't want to criticize the release because it's not meant to be a production-ready piece of software, and I'm sure the current 116 types isn't a hard limit, but I do understand the parent comment's contention.


Surely identifying just one file type (or two, as in your example) is a much simpler task that shouldn’t rely on horribly inefficient and imprecise “AI” tools?


It's for researchers, probably.


Yeah, there is this line:

    By open-sourcing Magika, we aim to help other software improve their file identification accuracy and offer researchers a reliable method for identifying file types at scale.
Which implies a production-ready release for general usage, as well as usage by security researchers.


I ran a quick test on 100 semi-random files I had laying around. Of those, 81 were detected correctly, 6 were detected as the wrong file type, and 12 were detected with an unspecific file type (unknown binary/generic text) when a more specific type existed. In 4 of the unspecific cases, a low-confidence guess was provided, which was wrong in each case. However, almost all of the files which were detected wrong/unspecific are of types not supported by Magika, with one exception of a JSON file containing a lot of JS code as text, which was detected as JS code. For comparison, file 5.45 (the version I happened to have installed) got 83 correct, 6 wrong, and 10 not specific. It detected the weird JSON correctly, but also had its own strange issues, such as detecting a CSV as just "data". The "wrong" here was somewhat skewed by the 4 GLSL shader code files that were in the dataset for some reason, all of which it detected as C code (Magika called them unknown). The other two "wrong" detections were also code formats that it seems it doesn't support. It was also able to output a lot more information about the media files. Not sure what to make of these tests but perhaps they're useful to somebody.


> The "wrong" here was somewhat skewed by the 4 GLSL shader code files that were in the dataset for some reason, all of which it detected as C code

To be fair though, a snippet of GLSL shader code can be perfectly valid C.


Indeed, which is why I felt the need to call it out here. I'm not certain if the files on question actually happened to be valid C but whether that's a meaningful mistake regardless is left to the reader to decide.


I'm extremely confused about the claim that other tools have a worse precision or recall for APK or JAR files which are very much regular. Like, they should be a valid ZIP file with `META-INF/MANIFEST.MF` present (at least), and APK would need `classes.dex` as well, but at this point there is no other format that can be confused with APK or JAR I believe. I'd like to see which file was causing unexpected drop on precision or recall.


People do create JAR files without a META-INF/MANIFEST.MF entry.

The tooling even supports it. https://docs.oracle.com/en/java/javase/21/docs/specs/man/jar...:

  -M or --no-manifest
     Doesn't create a manifest file for the entries


Minecraft mods 14 years ago used to tell you to open the JAR and delete the META-INF when installing them so can’t rely on that one…


The `file` command checks only the first few bytes, and doesn’t parse the structure of the file. APK files are indeed reported as Zip archives by the latest version of `file`.


This is false in every sense for https://www.darwinsys.com/file/ (probably the most used file version). It depends on the magic for a specific file, but it can check any part of your file. Many Linux distros are years out of date, you might be using a very old version.

FILE_45:

    ./src/file -m magic/magic.mgc ../../OpenCalc.v2.3.1.apk
    ../../OpenCalc.v2.3.1.apk: Android package (APK), with zipflinger virtual entry, with APK Signing Block


Interesting! I checked with file 5.44 from Ubuntu 23.10 and 5.45 on macOS using homebrew, and in both cases, I got “Zip archive data, at least v2.0 to extract” for the file here[1]. I don’t have an Android phone to check and I’m also not familiar with Android tooling, so is this a corrupt APK?

[1] https://download.apkpure.net/custom/com.apkpure.aegon-319781...


That doesn't appear to be a valid link. Try building `file` from source and using the provided default magic database.


I also tried this with the sources of file from the homepage you linked above, and I still get the same results.

You could try this for yourself using the same APKPure file which I uploaded at the following alternative link[1]. Further, while this could be a corrupt APK, I can’t see any signs of that from a cursory inspection as both the `classes.dex` and `META-INF` directory are present, and this is APKPure’s own APK, instead of an APK contributed for an app contributed by a third-party.

[1] https://wormhole.app/Mebmy#CDv86juV9H4aRCL2DSJeDw


apks are also zipaligned so it's not like random users are going to be making them either


Wonder how this would handle a polyglot[0][1], that is valid as a PDF document, a ZIP archive, and a Bash script that runs a Python webserver, which hosts Kaitai Struct’s WebIDE which, allowing you to view the file’s own annotated bytes.

[0]: https://www.alchemistowl.org/pocorgtfo/

[1]: https://www.alchemistowl.org/pocorgtfo/pocorgtfo16.pdf

Edit: just tested, and it does only identify the zip layer


You can try it here: https://google.github.io/magika/

It's relatively limited compared to `file` (~10% coverage), it's more like a specialized classificator for basic file formats, so such cases are really out-of-scope.

I guess it's more for detecting common file formats then with high recall.

However, where is the actual source of the model ? Let's say I want to add a new file format myself.

Apparently only the source of the interpreter is here, not the source of the model nor the training set, which is the most important thing.


Is there anything about the performance on unknown files?

I've tried a few that aren't "basic" but are widely used enough to be well supported in libmagic and it thinks they're zip files. I know enough about the underlying formats to know they're not using zip as a container under-the-hood.


Apparenty the Super Mario Bros. 3 ROM is 100% a SWF file.

Cool that you can use it online though. Might end up using it like that. Although it seems like it may focus on common formats.


Yes, I totally agree; it's not what I would qualify as open source.

Do you plan to release the training code along the research paper? What about the dataset?

In any case, it's very neat to have ML-based technique and lightweight model for such tasks!


I don't understand why this needs to exist. Isn't file type detection inherently deterministic by nature? A valid tar archive will always have the same first few magic bytes. An ELF binary has a universal ELF magic and header. If the magic is bad, then the file is corrupted and not a valid XYZ file. What's the value in throwing in "heuristics" and probabilistic inference into a process that is black and white by design.


> What's the value in throwing in "heuristics" and probabilistic inference into a process that is black and white by design.

I use the file command all the time. The value is when you get this:

    ... $  file somefile.xyz
    somefile.xyz: data
AIUI from reading TFA, magika can determine more filetypes than what the file command can detect.

It'd actually be very easy to determine if there's any value in magika: run file on every file on your filesystem and then for every file where the file command returns "data", run magika and see if magika is right.

If it's right, there's your value.

P.S: it may also be easier to run on Windows than the file command? But then I can't do much to help people who are on Windows.


From elsewhere in this thread, it appears that Magika detects far fewer file types than file (116 vs ~1600), which makes sense. For file, you just need to drop in a few rules to add a new, somewhat obscure type. An AI approach like Magika will need lots of training and test data for each new file type. Where Magika might have a leg up is with distinguishing different textual data files (i.e., source code), but I don't see that as a particularly big use case honestly.


It's not always deterministic, sometimes it's fuzzy depending on the file type. Example of this is a one-line CSV file. I tested one case of that, libmagic detects it as a text file while magika correctly detects it as a CSV (and gives a confidence score, which is killer).


But even with determinism, it's not always right. It's not too rare to find a text file with a byte order mark indicating UTF16 (0xFE 0xFF) but then actually containing utf-8. But what "format" does it have then? Is it UTF-8 or UTF-16? Same with e.g. a jar file missing a manifest. That's just a zip, even though I'm sure some runtime might eat it.

But the question is when you have the issue of having to guess the format of a file? Is it when reverse engineering? Last time I did something like this was in the 90's when trying to pick apart some texture from a directory of files called asset0001.k and it turns out it was a bitmap or whatever. Fun times.


Consider, it's perfectly possible for a file to fit two or more file formats - polyglot files are a hobby for some people.

And there are also a billion formats that are not uniquely determined by magic bytes. You don't have to go further than text files.


This tool doesn't work this way.


This also works for formats like Python, HTML, and JSON.


I still don't see how this is useful. The only time I want to answer the question "what type of file is this" is if it is an opaque blob of binary data. If it's a plain text file like Python, HTML, or JSON, I can figure that out by just catting the file.


file (https://www.darwinsys.com/file/) already detects all these formats.


Indeed but as pointed out in the blog post -- file is significantly less accurate that Magika. There are also some file type that we support and file doesn't as reported in the table.


I can't immediately find the dataset used for benchmarking. Is file actually failing on common files or just particularly nasty examples? If it's the latter then how does it compare to Magika on files that an average person is likely to see?


> Is file actually failing on common files or just particularly nasty examples? If it's the latter then how does it compare to Magika on files that an average person is likely to see?

That's not the point in file type guessing is it? Google employs it as an additional security measure for user submitted content which absolutely makes sense given what malware devs do with file types.


Yes, but shouldn't the file type be part of the file, or (better) of the metadata of the file?

Knowing is better than guessing.


So instead of spending some of their human resources to improve libmagic, they used some of their computing power to create an "open source" neural net, which is technically more accurate than the "error-prone" hand-written rules (ignoring that it supports far fewer filetypes), and which is much less effective in an adversarial context, and they want it to "help other software improve their file identification accuracy," which of course it can't since neural nets aren't introspectable. Thanks guys.


Come on, can't you help but be impressed by this amazing AI tech? That gives us sci-fi tools like ... a less-accurate, incomplete, stochastic, un-debuggable, slower, electricity-guzzling version of `file`.


>So instead of spending some of their human resources to improve libmagic

A large megacorp can work on multiple things at once.

>an "open source" neural net, which is technically more accurate than the "error-prone" hand-written rules (ignoring that it supports far fewer filetypes)

You say that like it's a contradiction but it's not.

>and which is much less effective in an adversarial context,

Is it? This seems like an assumption.

>and they want it to "help other software improve their file identification accuracy," which of course it can't since neural nets aren't introspectable.

Being introspectable or not has no bearing on the accuracy of a system.


> > an "open source" neural net, which is technically more accurate than the "error-prone" hand-written rules (ignoring that it supports far fewer filetypes)

> You say that like it's a contradiction but it's not.

> > and they want it to "help other software improve their file identification accuracy," which of course it can't since neural nets aren't introspectable.

> Being introspectable or not has no bearing on the accuracy of a system.

"Open source" and "neural net" is the contradiction, as I went on to write. Even if magika were a more accurate version of file, the implication that it could "help [libmagic] improve" isn't really true, because how do you distill the knowledge from it into a patch for libmagic?

My point re: their "error-prone" claim is that their comparison was disingenuous due to the functionality difference between the tools. (Also with their implication that AIs work perfectly, though this one sounds pretty good by the numbers. I of course accept that there's likely to be some bugs in code written by humans.)

> > and which is much less effective in an adversarial context,

> Is it? This seems like an assumption.

It is, one based on what I've heard about AI classifiers over the years. Other commenters here are interested in this point, but while I don't see anyone experimenting on magika (it's new after all), the fact it's not mentioned in the article leads me to believe they didn't try to attack themselves. (Or did, but with bad results, and so decided not to include that. Funnily enough they did mention adversarial attacks on manually-written classifiers...)



It's surprising that there are so many file types that seem relatively common which are missing from this list. There are no raw image file formats. There's nothing for CAD - either source files or neutral files. There's no MIDI files, or any other music creation types. There's no APL, Pascal, COBOL, assembly source file formats etc.


Well, what they used this for at Google was apparently scanning their user's files for things they shouldn't store in the cloud. Probably they don't care much about MIDI.


Worse it seems that for unknown formats it confidently claims that it is one of the known formats. Rather than saying "unknown" or "binary data".


No tracker / .mod files either, just use file.


Thanks for the list, we will probably try to extend the list of format supported in future revision.


Yeah this quickly went from 'additional helpful tool in the kit' to 'probably should use something else first'


As somebody who's dealt with the ambiguity of attempting to use file signatures in order to identify file type, this seems like a pretty useful library. Especially since it seems to be able to distinguish between different types of text files based on their format/content e.g. CSV, markdown, etc.


A somewhat surprising and genuinely useful application of the family of techniques.

I wonder how susceptible it is to adversarial binaries or, hah, prompt-injected binaries.


Elsewhere in the thread kevincox[1] points out that it's extremely susceptible to adversarial binaries:

> Worse it seems that for unknown formats it confidently claims that it is one of the known formats. Rather than saying "unknown" or "binary data".

Seems like this is genuinely useless for anybody but AI researchers.

[1] https://news.ycombinator.com/item?id=39395677


For the extremely limited number of file types supported, I question the utility of this compared to `magic`


It gets a lot of binary file formats wrong for me out-of-the-box. I think it needs to be a bit more effective before we can truly determine the effectiveness of such exploits.


But they reported >99% accuracy on their cherry-picked dataset! /s


“These aren’t the binaries you are looking for…”


This feels like old school google. I like that it's just a static webpage that basically can't be shut down or sunsetted. It reminds of when Google just made useful stuff and gave them away for free on a webpage like translate and google books. Obviously less life changing than the above but still a great option to have when I need this.


Today web browsers, code editors, and countless other software rely on file-type detection to decide how to properly render a file.

"web browsers"? Odd to see this coming from Google itself. https://en.wikipedia.org/wiki/Content_sniffing was widely criticised for being problematic for security.


Content sniffing can be disabled by the server (X-Content-Type-Options: nosniff), but it’s still used by default. Web browsers have to assume that servers are stupid, and that for relatively harmless cases, it’s fine to e.g. render a PNG loaded by an <img> even if it’s served as text/plain.


To me the obvious use case is to first use the file command but then, when file returns "DATA" (meaning it couldn't guess the file type), call magika.

I guess I'll be writing a wrapper (only for when using my shell in interactive mode) around file doing just that when I come back from vacation. I hate it when file cannot do its thing.

Put it this way: I use file a lot and I know at times it cannot detect a filetype. But is file often wrong when it does have a match? I don't think so...

So in most of the cases I'd have file correctly give me the filetype, very quickly but then in those rare cases where file cannot find anything, I'd then use the slower but apparently more capable magika.


I have seen 'file' misclassify many things when running it at large scale (millions of files) from a hodgepodge of sources. Unrelated types getting called 'GPG Private Keys', for example.

For textual data types, 'file' gets confused often, or doesn't give a precise type. GitHub's 'linguist' [1] tool does much better here, but is structured in such a way that it is difficult to call it on an arbitrary file or bytestring that doesn't reside in a git repo.

I'd love to have a classification tool that can more granularly classify textual files! It may not be Magika _today_ since it only supports 116-something types. For this use case, an ML-based approach will be more successful than an approach based solely on handwritten heuristic rules. I'm excited to see where this goes.


What are use-cases for this? I mean, obviously detecting the filetype is useful, but we kinda already have plenty of tools to do that, and I cannot imagine, why we need some "smart" way of doing this. If you are not a human, and you are not sure what is it (like, an unknown file being uploaded to a server) you would be better off just rejecting it completely, right? After all, there's absolutely no way an "AI powered" tool can be more reliable than some dumb, err-on-safer-side heuristic, and you wouldn't want to trust that thing to protect you from malicious payloads.


> no way an "AI powered" tool can be more reliable

The article provides accuracy benchmarks.

> you would be better off just rejecting it completely

They mention using it in gmail and Drive, neither of which have the luxury of rejecting files willy-nilly.


I have not tried it recently, but IIRC, Gmail does reject attachments which are zip files, for security reasons.


Gmail nukes zips if they contain an executable or some other 'prohibited' file type. Most email providers block executable attachments.


Virus detection is mentioned in the article. Code editors need to find the programming language for syntax highlighting of code before you give it a name. Your desktop OS needs to know which program to open files with. Or, recovering files from a corrupted drive. Etc

It's easy to distinguish, say, a PNG from a JPG file (or anything else that has well-defined magic bytes). But some files look virtually identical (eg. .jar files are really just .zip files). Also see polyglot files [1].

If you allow an `unknown` label or human intervention, then yes, magic bytes might be enough, but sometimes you'd rather have a 99% chance to be right about 95% of files vs. a 100% chance to be right about 50% of files.

[1] https://en.wikipedia.org/wiki/Polyglot_(computing)


Reminds me when someone asked (at StackOverflow) on how to recognize binaries for different architetures, like x86 or ARM-something or Apple M1 and so on.

I gave the idea to use the technique of NCD (Normalized compression distance), based on Kolmogorov complexity. Celibrasi, R. was one great researcher in this area, and I think he worked at Google at some point.

Using AI seems to follow the same path: "learn" what represents some specific file and then compare the unknown file to those references (AI:all the parameters, NCD:compression against a known type).


I wrote an implementation of libmagic in Racket a few years ago(https://github.com/jjsimpso/magic). File type identification is a pretty interesting topic.

As others have noted, libmagic detects many more file types than Magika, but I can see Magika being useful for text files in particular, because anything written by humans doesn't have a rigid format.


What does it do with an Actually Portable Executable compiled by Cosmopolitan libc compiler?


It’s reported as a PE executable, `file` on the other hand reports it as a “DOS/MBR boot sector.”


I just want to say thank you for the release. There are quite a lot of complaints in the comments but I think this is a useful and worthwhile contribution and I appreciate the authors for going through the effort to get it approved for open source release. It would be great if the model training data was included (or at lease documentation about how to reproduce it.) but that doesn’t preclude this being useful. Thanks!



Mime type detection is very interesting thing. I wrote media type detection for McAfee Web Gateway 7.x and because it was a high performance proxy, the detection speed was a major focus, but also the precision, especially for "container types, like, MS Office, OLE-based files, etc. The base of it was a simple Lisp-like language that allowed to write signatures very fast, and everything was combined with very aggressive caching of the data, so we avoided to read data again and again, and used internal caches a lot. In tests, the detection was ~10x faster than file, and with more flexible language we got more file types recognized precisely. Although there were challenges with some formats, like, OLE-based files had FAT directory structure at the end of the file, and you were need to walk the tree to find the top-level structure to distinguish Excel file from Excel file embedded into Word.

Streams detection was also quite funny task...


Ah, I remember, the self-extracted .msi file was one of the quite challenging files - it's executable, it's a .cab file, and OLE2-container


At $job we have been using Apache Tika for years.

Works but occasionally having bugs and weird collisions when working with billions of files.

Happy to see new contributions in the space.


The results of which you'll never be 100% sure are correct...


They missed such an opportunity to name it "fail". It's like "file" but with "ai" in it.


What about faile?


But file(2) is already like that - my data files without headers are reported randomly as disk images, compressed archives or even executables for never-heard-of machines.


Other methods use heuristics to guess many filetypes and in the benchmark they show worse performance (in terms of precision). Assuming benchmarks are not biased, the fact that this approach uses AI heuristics instead of hard-coded heuristics shouldn't make it strictly worse.


I wonder how it performs with detecting C vs C++ vs ObjC vs ObjC++ and for bonus points: the common C/C++ subset (which is an incompatible C fork), also extra bonus points for detecting language version compatibility (e.g. C89 vs C99 vs C11...).

Separating C from C++ and ObjC is where the file type detection on Github traditionally had problems with (but has been getting dramatically better over time), from an "AI-powered" solution which has been trained on the entire internet I would expect to do better right from the start.

The list here doesn't even mention any of those languages except C though:

https://github.com/google/magika/blob/main/docs/supported-co...


But will it let you print on Tuesday[1]?

1: https://bugs.launchpad.net/ubuntu/+source/cupsys/+bug/255161...


For a subscription fee.


My FOSS desktop text editor performs a subset of file type identification using the first 12 bytes, detecting the type quite quickly:

* https://gitlab.com/DaveJarvis/KeenWrite/-/blob/main/src/main...

There's a much larger list of file signatures at:

* https://github.com/veniware/Space-Maker/blob/master/FileSign...


Nice aka perfect timing.I just restored "some" files (40GB) with [1] But the filetype detection of photorec set some wrong file types.

Edit: It would be super helpful if the "suffix" could be added as output so i can move the files to the right directory [2] ;)

[1] https://www.cgsecurity.org/wiki/PhotoRec [2] https://github.com/google/magika/issues/63


Assuming that I've not misunderstood, how does this compare to things like: TrID [0]?? Apart from being open source.

[0] https://mark0.net/soft-trid-e.html


The bulk of the short article is a set of performance benchmarks comparing Magika to TrID and others.


Argh, the risks of browsing the web without JavaScript and/or third party scripts enabled, you miss content, because rendering text and images on the modern web can't be done without them, apparently. (Sarcasm).

You are of course correct. I can see the images showing the comparison. Apologies.


I have a question: Is something like Magika enough to check if a file is malicious or not?

Example: users can upload PNG file (and only PNG is accepted). If Malika detects that the file is a PNG, does this mean the file is clean?


This comment from kevincox[1] says the answer is a hard "no":

> Worse it seems that for unknown formats it confidently claims that it is one of the known formats. Rather than saying "unknown" or "binary data".

There are other comments in this thread that make me think Google contaminated their test data with training data and the 99% results should not be taken at face value. OTOH I am not particularly surprised that Magika would be better than the other tools at distinguishing semi-unstructured plain text e.g. Java source vs. C++ source or YAMLs versus INIs. But that's a very different use case than many security applications. The comments here suggest Magika is especially susceptible to binary obfuscation.

[1] https://news.ycombinator.com/item?id=39395677


If that PNG of yours is not just an example note that you can detect easily if the PNG as any extra data (which may or may not indicate an attempt as mischief) and reject the (rare) PNGs with extra data. I ran a script checking the thousands of PNGs on my system and found three with extra data, all three probably due to the "PNG acropalypse" bug (but mischief cannot be ruled out).

P.S: btw I'm not implying using extra data that shouldn't be there in a PNG is the only way to have a malicious PNG.


The only way to do this reliably is to render the PNG to pixels then render it back to an PNG with a trusted encoder. Of course now you are taking the risk of vulnerabilities in the "render to pixels" step. But the result will be clean.

AKA parse, don't validate.


> does this mean the file is clean?

No.


> Magika: AI powered fast and efficient file type identification

of 116 file types with proprietary puny model with no training code and no dataset.

> We are releasing a paper later this year detailing how the Magika model was trained and its performance on large datasets.

And ? How do you advance industry by this googleblog post and source code that is useless without closed source model ? All I see here is loud marketing name, loud promises, but actually barely anything useful. Hooly rooftop characters sideproject?


It seems like it defeats the purpose of such a tool that this initial version doesn’t have polyglot files. I hope they’re quick to work on that.


Took a .dxf file and fed it to Magika. It says with confidence of 97% that that must be a PowerShell file. A classic .dwg could be "mscompress" (whatever that is), 81%, or a GIF. Both couldn't be further from the truth.

Common files are categorized successfully – but well, yeah that's not really an achievement. Pretty much nothing more than a toy right now.


The real problem with deep learning approaches is hallucination and edge case failures. When someone finally fixes this, I hope it makes the HN front page.


It seems to detect my Android build.gradle.kts as Scala, which I suppose is a kind of hilarious confusion but not exactly useful.


This is useful for detecting file types of unknown blobs with custom file extension, when the file command just returns data. Though it doesn't correctly identify lua code for some reason, it guesses with low probability that it's either ruby or javascript, or anything but lua.


If their “Exif Tool” is https://exiftool.org/ (what else could it be?), I don’t understand why they included it in their tests. Also, how does ExifTool recognize Python and html files?


I wonder what the output will be on polyglot files like run-anywhere binaries produced by cosmopolitan [1]

[1]: https://justine.lol/cosmopolitan/


Why is this piece of code being sold as open source, when in reality it just calls into proprietary ML blob that is tiny and useless, and actual source code of model is closed while properly useful large model is non existing ?


Not into proprietary, the blob is within an Apache-licensed repo. Though there was no code to train it, but the repo contains some info allowing to recreate the code training it. Basically a JSON-based configs containing graph architecture. Even if you didn't have them, the repo contains an ONNX model, from which one can devise the architecture.


I wonder how big of a deal it is that you'd have to retrain the model to support a new or changed file type? It doesn't seem like the repo contains training code, but I could be missing it...


After reading thru all the comments, honestly I still don't get the point of this system. What is potential practical value or applications of this model?


Is it really common enough for files not to be annotated with a useful/correct file type extension (e.g. .mp3, .txt) that a library like this is needed?


Yes!

Sometimes a file has no extension. Other times the extension is a lie. Still other times, you may be dealing with an unnamed bytestring and wish to know what kind of content it is.

This last case happens quite a lot in Nosey Parker [1], a detector of secrets in textual data. There, it is possible to come across unnamed files in Git history, and it would be useful to the user to still indicate what type of file it seems to be.

I added file type detection based on libmagic to Nosey Parker a while back, but it's not compiled in by default because libmagic is slow and complicates the build process. Also, libmagic is implemented as a large C library whose primary job is parsing, which makes the security side of me jittery.

I will likely add enabled-by-default filetype detection to Nosey Parker using Magika's ONNX model.

[1] https://github.com/praetorian-inc/noseyparker


Nothing is ever simple. Even for the most basic .txt files it’s still useful to know what the character encoding is (utf? 8/16? Latin-whatever? etc.) and what the line format is (\n,\cr\lf,\n\lf) as well as determining if some maniac removed all the indentation characters and replaced them with a mystery number of spaces.

Then there are all the container formats that have different kinds of formats embedded in them (mov,mkv,pdf etc.)


A fun read in service of your first point: https://en.wikipedia.org/wiki/Bush_hid_the_facts


At multiple points in my career I've been responsible for apis that accept PDFs. Many non-tech savvy people seeing this, will just change the extension of the file they're uploading to `.pdf`.

To make matters worse, there is some business software out there that will actually bastardize the PDF format and put garbage before the PDF file header. So for some things you end up writing custom validation and cleanup logic anyway.


malware can intentionally obfuscate itself


I guess I'm kind of a dummy on this, but why is it impressive to identify that a .js file is Javascript, a .md file is Markdown, etc?


Because it's done by inspecting the content, not the name of the file.


Very useful.

I wrote an editor that needed file type detection but the results of traditional approaches were flaky.


It can't correctly identify a DXF file in my testing. It categorizes it as plain text.


I use FFMPEG to detect if uploaded files are valid audio files. Would this be much faster?


Can we please god stop using AI like it's a meaningful word? This is really interesting technology; it's hamstrung by association with a predatory marketing term.


The name sounds like the Pokémon Magikarp or the anime series Madoka Magica.


I used an HTML file and added JPEG magic bytes to its header:

magika file.jpg

file.jpg: JPEG image data (image)


Why not detect it by checking the magic number of the buffer?


Not every file has one for starters and many can be incorrect.

Especially in the context of use as a virus scanner, you don’t trust what the file says it is


> So far, libmagic and most other file-type-identification software have been relying on a handcrafted collection of heuristics and custom rules to detect each file format.

This manual approach is both time consuming and error prone as it is hard for humans to create generalized rules by hand.

Pure nonsense. The rules are accurate, based on the actual formats, and not "heuristics".


the rules aren't based on the formats, but on a small portion of them (their magic numbers). this makes them inaccurate (think docx vs zip) and heuristic.


1. Not all file formats are well specified 2. Not all files are precisely following the specification 3. Not all file formats are mutually exclusive

Those facts are clearly reflected in the table.


Besides compound file types, not all formats are well-specified either. Example is CSV.


Voidtools - Everything.. looking at you to implement this


probably a lot of interesting work going on that looks like this for the virustotal db itself.


this couldnt have been released at a better time for me! really needed a library like this.


Tell us why!


Thanks :)


Why? Just check the damn headers. Why do you need a power hungry and complicated AI model to do it? Why?


We have had file(1) for years


This is beyond what file is capable of. It’s also mentioned in the third paragraph.

RTFA.


Some HN readers may not know about file(1) even. It's fine to mention that $subj enhances that, but the rtfa part seems pretty unnecessary.


Yes, it's slower than file(1), uses more energy, recognizes fewer file types, and is less accurate.


FWICT file is more capable, predictable and also faster while being more energy-efficient at the same time.


That's not what the performance table in the article is implying, with a precision and recall higher on Magika hovering around 99%, while magic is at 92% prec and 72% recall.

One can doubt the representativity of their dataset, but if what is in the article is correct, Magika is clearly way more capable and predictable


This group of Linux users used to brag Linux will identify files even if you change the extension, Windoze needs to police you about changing extension, nearly 20 years back.


[flagged]


This a 1mb Keras ML model that’s open source.

I passionately dislike Surveillance Capitalism but bringing it up when it’s completely irrelevant only weakens the argument.

RTFA.


> Please don't comment on whether someone read an article. "Did you even read the article? It mentions that" can be shortened to "The article mentions that".

From the HN guidelines


> can be shortened to "The article mentions that".

I shortened it to RTFA.


how do i pronounce this? Myajika or MaGika? anyhow, its super cool.


Can someone please help me understand why this is useful? The article mentions malware scanning applications, but if I'm sending you a malicious PDF, won't I want to clearly mark it with a .pdf extension so that you open it in your PDF app? Their examples are all very obvious based on file extensions.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: