Hacker News new | past | comments | ask | show | jobs | submit login
Let's Solve the File Format Problem (2019) (archiveteam.org)
126 points by pabs3 on Nov 28, 2020 | hide | past | favorite | 72 comments




oh, so google cache still exists? It is not visible in search anymore, so I assumed it was dead.


On desktop, the small arrow next to the search result url still shows the cached link when available.


Professional UI people are great! I'm so glad modern UIs are so obvious, intuitive, and discoverable.


They say file format problem, but one of the reasons why there are so many different file formats is because current file system abstractions are fundamentally insufficient for a wide variety of use cases, particularly ones where you need to ensure that a user can't accidentally separate or delete a file. How do you keep a cache directory that backs a file with the file if the file system doesn't let you put folders "inside" of files? This is only one of many issues that have resulted in the proliferation of file formats to meet use cases with different performance characteristics that all have the common requirement for data to be bundled together an not be separable. File systems that can decouple the notion of access and data locality from inseparability simply do not exist right now, so we are left with the horrible abstraction that is the archive or zip file.


> particularly ones where you need to ensure that a user can't accidentally separate or delete a file. How do you keep a cache directory that backs a file with the file if the file system doesn't let you put folders "inside" of files?

You are worried that if an application has a directory "cache" with files in it, a user will go into that directory and randomly delete the file "cache/3F3819A1.dat", and that might confuse the application.

But, suppose instead you had a single file "cache.sqlite", containing a table "cache" with an "id" column. What's to stop a user from opening that file with the sqlite command line tool and running "delete cache where id='3F3819A1'"

If a user really wants to muck with your app's data by hand, you can't stop them. (Unless you go with some DRM-like solution in which your app runs in a secure enclave such as Intel SGX)

Whether it is a single file or a directory of files doesn't really make a difference


More like you want users to be able to move their data around, send it, email it, etc, without thinking about 50 files in 10 directories that need to be placed in the right way. Rather just have 1 file with all the relevant data, and any structure you need for that data is embedded in the file itself


So, a zip file?


I think the problem is the user needs to know where to look, and that gets difficult when you bring in sprawling, deeply nested directory trees.


Even with current abstractions this is the case. I'm not worried about users that have sufficient knowledge to intentionally modify a file. Users who know enough to find, much less modify an sqlite database can make their own mess if they want to. I'm more worried about cases where something like a sidecar file gets deleted or overwritten, or fails to be copied along with the main file. Some programs and file formats come up with their own way of embedding sidecar metadata in the main file, but that is a complete hack that has to be used to work around the fact that the sidecar file cannot be made inseparable from the main file while still being just another file. Alternate data streams on windows can sort of do this (xattrs on nixen aren't really up to the task) but they are completely non-portable.


> Alternate data streams on windows can sort of do this (xattrs on nixen aren't really up to the task) but they are completely non-portable.

On Solaris, xattrs really are up to the task; they are basically Windows alternate data streams by another name. Unlike Linux xattrs, their maximum size is the same as that of the main stream of the file. (Likewise, on macOS the maximum xattr size is the same as that of a file's main stream.)

I think if Linux would lift its limit on maximum xattr size then you would have the same basic function across most major operating systems, just with some differences in API. Then if someone could devise a standard API (even as just a portability layer), your problem would be mostly solved. Just also making sure that tools like cp, scp, sftp, zip, tar, etc, know about the xattrs and transport them. (Which is actually already true for some of those tools on some platforms.)


I have used python's pathlib to create a common abstraction across ADS and linux xattrs, but have had to stop at the lowest common denominator that is the linux implementation. It is good to know that some of the other nixen don't have the same restrictions because it means there might be a shorter route to interoperability (especially in light of some of the other threads on the linux file system that have popped up over the last couple of days). The second piece of the puzzle is also exactly as you point out, which is xattr support across all the common tools. Many of them have it already, however it is usually not enabled by default, so it is extremely easy for a user to accidentally copy only the main file data. To give an example of the issue, vim's default save implementation copies files in a way that removes xattrs. Thus, even if xattrs can get closer, the fact that they still require tools to implement correct behaviour makes is a bit like the situation with cooperative multitasking.


This infuriates me. There are so many problems I run into, day-to-day even, which could be readily solved if files/folders just had a stringly-typed key-value metadata store of arbitrary size which was portable-ish across file systems. Heck, even just one bytes field I can dump json or some serde into, that doesn't impinge on the actual file bytestream. xattr is close but no cigar.

Instead you either need sidecar files, archives, or custom binary formats.

It's as if the posix filesystem abstraction crystallized a bit too soon (same like tcp/ip)


I'm curious about the specifics of the problems this would solve, having been doing some thinking about what an ideal file system would look like. In particular, is this something you would need primarily for files you create, or existing files created by other people? And, to be fully useful, would it need to allow the attaching of such metadata to arbitrary files/folders regardless of type?


As a sysadmin, this sounds like a mess. Some tools support them, and some don't? Does the alternate stream/fork data show up in `df` and `du`? Can I cat the alternate streams? How many alternate streams are there?

Why not just leave files as a sequence of bytes and let the applications handle this in their file formats? It's not like we don't have standards and libraries there already.


They already aren't a stream of bytes. There's time and permission metadata, which tar, cp, rsync and friends may or may not handle in various ways based on option flags.

I guess it boils down to what is "application data" and what isn't.


> work around the fact that the sidecar file cannot be made inseparable from the main file while still being just another file.

Hmmm... have you ever heard about the concept of archive? It's a single file that you can copy around, but it contains literally several files inside!


Snark aside, this is obviously not a silver bullet. First reason that comes to mind: if you delete a file on the file system, the disk space can be reclaimed instantly. But if you delete a file in an archive, you must either rewrite the archive (since you can't "move" bytes in a file), or accept that disk space will be wasted.


> But if you delete a file in an archive, you must either rewrite the archive (since you can't "move" bytes in a file), or accept that disk space will be wasted.

Another option is to tell the operating system to deallocate part of the file, turning it into a sparse file. This can be done with fallocate(FALLOC_FL_PUNCH_HOLE) on Linux, FSCTL_SET_ZERO_DATA on Windows.

Linux also has fallocate(FALLOC_FL_COLLAPSE_SIZE) which lets you actually remove a byte range from the middle of a file, shifting up everything that comes after that byte range. So, in fact, you can "move" bytes in a file, on certain filesystems on Linux (ext4, xfs, among others).

Unfortunately, support for sparse files is quite variable across platforms. The macOS equivalent to fallocate(FALLOC_FL_PUNCH_HOLE) if fcntl(F_PUNCHHOLE), but the issue is HFS+ has never supported sparse files. UFS did, but UFS support was dropped from 10.7 onwards. The support is supposed to be back in APFS, but I don't know if it actually works.


>You are worried that if an application has a directory "cache" with files in it, a user will go into that directory and randomly delete the file "cache/3F3819A1.dat", and that might confuse the application.

I have done exactly this for a few applications. I likely wouldn't have done it for the database.

The reason I've done this is disk space. Some software is bad at getting rid of things that they've cached (eg Chrome). Theyend up taking more and more and more and more disk space. I don't understand why my browser needs to take up 10-20 GB of disk space, if it can't even retain 7 months of history.


>The reason I've done this is disk space. Some software is bad at getting rid of things that they've cached (eg Chrome). Theyend up taking more and more and more and more disk space. I don't understand why my browser needs to take up 10-20 GB of disk space, if it can't even retain 7 months of history.

But then you just hide your bad caching design from users by only "showing" what you're doing wrong to those that understand SQL. And those users can still change it.

You should test your app for random cache permutations (user changes .dat file in the wrong way) and partial cache loss (user deletes random file). You have to assume that anything that can be broken will be broken.


Oh, I'm sorry. I meant that I've deleted these "random" files as the user. As a user I have no faith that application developers will properly manage disk space usage for what they cache. If an application like Chrome has no problem taking up 10 GB of space mostly due to caching, then I have no expectation that most other software will do better. Perhaps this information gets communicated to the user somewhere, but it tends to not be done in an easy to understand manner.


There should be an operating system solution for cache so the storage used can be discounted and the space can just be freed as required, so the operating system can delete caches belonging to deleted apps and that apps have failed to clean up, and so the system can show the user which app is hogging space.


For a good lot of cases, sqlite can serve as a perfectly good file format. I actually find it hard to think of a case where it would be hard for sqlite .. perhaps nested markup requirements are the only thing that can necessitate additional structure in the data. If your application needs a file format, choose sqlite and move on.


Using a database is a partial answer at best. File formats don't just dictate the structure of data, they carry semantic info. Using sqlite gives you structure without semantics.


Unless you also store contextual information alongside the raw bytes.

Curious if you can clarify / specify / examplify what "semantic" means in this case?


Understanding a file format means knowing the significance of a given bit. Not just that X has value 13, but that 13 means something and might influence the significance of Y or Z.

Having a table with columns A, B, and C doesn't say a ton about what the values in them mean. Or how they relate to one another. If it's a bunch of decimal numbers, it could be financial data, and the sums expected to sum depending on how another column that indicates a type is set. Or it could be temperature readings. Or distances, with the unit implied or tracked elsewhere.

Generally this is the kind of data and rules for reasoning about data that isn't stored in a database. Certainly not a sqlite database. Column and table names are often vague or unclear, and in my experience it's an exceptionally rare developer who leaves detailed commentary on the relationships between columns and tables and their possible values in their create statements.


It is well possible to capture complex structures in a set of tables. For example, the fossil SCM uses sqlite to store file contents, commit history, deltas and metadata, issue tickets, user access info, custom ticket formats if any, and wiki documentation all in a single file, driven by a single executable. Most applications don't have more complex structure than that. Furthermore, parts of the application can pull the exact bits they need at a given time instead of having to parse all the data, since sqlite can run full fledged queries. Total win IMO.

The argument for this has already been made very well by the sqlite creator.

https://sqlite.org/appfileformat.html


> If it's a bunch of decimal numbers, it could be financial data, and the sums expected to sum depending on how another column that indicates a type is set.

Ideally, this type of invariant would be in the DB, and not as documentation but enforced in code (i.e., by triggers, since it's to complex to be a simple referential integrity constraint.)

> Generally this is the kind of data and rules for reasoning about data that isn't stored in a database.

The ones that are invariants should be enforced in the DB, and the ones that aren't should generally be in in-db documentation. Sqlite in particular lacks the ability to attach the metadata usually used for that purpose, but does store code comments in the schema, which provides roughly-equivalent functionality.


> The ones that are invariants should be enforced in the DB, and the ones that aren't should generally be in in-db documentation. Sqlite in particular lacks the ability to attach the metadata usually used for that purpose, but does store code comments in the schema, which provides roughly-equivalent functionality.

What level of in-db documentation are you in the habit of writing? I can honestly say I've never worked on or encountered a codebase that explained fully the significance of any of its database columns with in-db comments. I would count myself exceptionally lucky to find natural-language prose documenting such details! Generally I find myself puzzling things out by a combination of reading application code, tinkering with tests, and interrogating developers.

I think I've only ever once seen any real use of comments in a real-life codebase. It was a postgres database, where columns that contained PII were commented according to whether it was plaintext or encrypted.


> I think I've only ever once seen any real use of comments in a real-life codebase. It was a postgres database, where columns that contained PII were commented according to whether it was plaintext or encrypted.

~10 years ago I worked on a database frontend that stored schema metadata (version, additional type info) in database comments ;)

I've maybe seen database comments used a handful of times. But I also worked a couple years as a consultant, so saw a lot of databases...


Can you give an example where the semantic data you refer to is in the file format in a way that is better?


The point is that having a way of accessing a column rather than needing to parse to find a byte is only a partial answer to the problem of file formats. The consuming application still needs custom code to comprehend the format, and the physical layout of data within is often only a small part of that.

IMO Sqlite would be a terrible format for e.g. a word processor. Database style formats are good when partial updates are the norm, but if your application typically needs the entire data model in memory, a relational mapping is unpleasant overhead.


> File systems that can decouple the notion of access and data locality from inseparability simply do not exist right now

For the record, it does. The Boomla OS has exactly that, the ability to store files in files. As in, files and directories are the same concept. It is not a posix filesystem of course and the OS is built specifically for web development. (Disclosure: I’m working on the project.)

https://boomla.com/docs/how-it-works/anatomy-of-the-boomla-f...


Here's to hoping https://docs.ipld.io/ helps for acylic data. Not sure what's needed for cyclic (including relational) data.


What's wrong with zip? It's fine. Maybe more than fine if it managed to be this popular for so many applications. From my understanding it's very good on floppy disk too


Consider a use case where you need high concurrent write performance and you don't want the things you are writing to be accidentally dissociated from the file. If you have to write to a zip file there will be overhead every time you append a new "file" because you have to rewrite the central registry at the end of the file.

Basically there is no way to say to you operating system "hey, can I get a no-nonsense, don't what to have to come up with some other conventions that will be broken, way to have some additional storage and file system paths that are just for this file so I don't have to worry about name collisions?" I already have a perfectly good unique name under which I could stash information, but the posix standard basically prevents me from using that name as a way to safely namespace things that I need per file.

However, to your original point I have partially implemented this [0] using zip files on the read side. I treat the internal paths inside the zip file as if they are rooted at the zip file itself. I haven't implemented rewriting though.

0. https://github.com/tgbugs/augpathlib/blob/master/augpathlib/...


From what I can gather they are trying to compose a database of information on what application generates/can read a specific file, primarily by file extension.

This already exists in libmagic (https://github.com/file/file) and can be used on any BSD/Linux system by typing `file [filename]`


The goal is a bit more expansive than libmagic, it is a wiki of file format resources, including extensions, software, format specifications, related formats and more.


Yeah, it includes mechanical specifications of physical media so, say, alien archaeologists can get data off a VHS.

People are reading the title thinking it's about filesystems. It's about cataloging all human information storage methods, from phonographs to zip drives to word perfect files.


libmagic is quite primitive and can't reliably handle the complex data types, like, MS Office documents with embedded objects (especially the versions before 2007), files with common container types (jar/war files are detected just as zip), etc. - the signature language is quite primitive from my point of view. Plus it's quite slow...

I've spent a long time (> 10 years) developing code for detection of file types (without relying onto the file extension), and extraction of the data... And everything was need to be done very fast & reliably. The last project was the content detection for the filtering web proxy, where every millisecond counts. And for signatures, there was a lisp-like language that allowed to describe very complex detection rules, although it's sometime was still necessary to go to the C++, for things like, OLE2 parsing, XML parsing, or listing files in Zip files, etc.

From the open source alternatives, it makes sense to look to the Apache Tika that has both content detection & extraction capabilities, although it's written in Java...


And - in case of need - on Windows there is Trid:

https://mark0.net/soft-trid-e.html

which has also a couple online options and a database of signatures/magic numbers:

https://mark0.net/soft-trid-deflist.html


There is a Linux version too! But sadly, there is not a library which would be incredibly useful for analysing, for example, blobs in a database, or categorising random files uploaded by a user of a website.


libmagic focuses on identifying file type using magic numbers and other properties. The wiki has links to resources for programmatically extracting image size or color palette, for example.


libmagic can also extract metadata for images.


The file format problem is that the file defines the format. Obsession with file name extensions defining format soon crumble; the .001, .002, ... being a clear case of “You know not the f- of which you speak.”

To “solve” file format, you may need a lot of data beyond the file provided. Does the file come from a floppy disk labeled VisiCalc? Did you read the magnetic media correctly? Let’s say you did. You look at a file and then try, the best you can, to determine what it is, maybe given a lot of information about how the computer that read the file interpreted it.

Sure, if you’re talking about more modern files, maybe the filename’s extension is a good hint.

Some files contain metadata headers or footers. They maybe have multiple layers of them. That metadata may have evolved. Perhaps it wasn’t versioned.

Files were stored on cassette which could be listened to as audio or interpreted as containing file definitions.

What if the cassette were warped in the heat? You could still recover the file partially or fully if you knew that you could speed up portions of the recovered audio, perhaps with additional pitch-bending and error-correction. Could you do that without any background information? Maybe, but it may take much more effort.


Open or at least documented file formats pretty much solved the file format problem. It's incredibly rare for me to encounter any file format I can't use natively or easily convert into something else.


Seems to have had the kiss of death:

    Warning: Unknown: Unable to allocate memory for pool. in Unknown on line 0
    
    Warning: require(): Unable to allocate memory for pool. in /usr/local/www/mediawiki/index.php on line 54
    
    Warning: Cannot modify header information - headers already sent in /usr/local/www/mediawiki/includes/WebStart.php on line 63
    
    [..]
Seems kinda weird this stuff isn't cached?


Fwiw the content is below the errors, and is still browseable etc.


It depends on when more RAM becomes available. I was getting different errors and different amounts of content every refresh.


I like this idea of trying to solve the problem . I thought it just about disk file formats.

http://fileformats.archiveteam.org/wiki/Electronic_File_Form...

At work we have one that creates a “.snapshot” directory with folders with backups. It works pretty well. (Net app?) I would like that idea at home (Apple has time machine which I guess is similar)

One thing about Linux, I thought it would eventually surpass the proprietary OS in a lot of areas including file systems. I think it has in a lot of ways but file systems seem to be a bit of a mess.

I’ve heard good things about BRfs and zfs but like all things Linux I’ve heard bad things (you will all your data). There are so many options and changing off the default seems hard.


I'm supposed to comment here and leave this link below as a clue. Two thumbs up.

http://fileformats.archiveteam.org/wiki/Category:File_format...


For my project, I chose txt as the base format. Ascii or Unicode? There's a config for that.

From that, I can go to SQLite, HTML, DAG, whatever. It's easy to work with, easy to back up and synchronize, and as human readable as it gets.


Somewhat confusing is that cataloguing here has markdown and org-mode in separate sections, the former in "markup" and the later in "text-based data".


That makes some sense: org-mode and its ecosystem is vastly more ambitious than Markdown.


For anyone into file formats I can strongly recommend the work of Ange Albertini - polyglots, visual documentation of file layouts, and the legendary PoC||GTFO.


Just solve the php problem.

`Warning: Unknown: Unable to allocate memory for pool. in Unknown on line 0`


They could change the mediawiki config to use file cache ( https://www.mediawiki.org/wiki/Manual:$wgUseFileCache ) which is basically poor-man's version of varnish (it saves all the rendered pages on disk, and outputs that instead of rendering the page if you're not logged in or otherwise cant have a cached page). Would probably help a lot with the hn hug of death. Not sure if it would be enough, but it might be especially if combined with different max connection settings in the webserver to prevent memory starvation.


This is a larger ArchiveTeam problem, not a language problem. They use MediaWiki (Wikipedia) to effectively host static informational pages without a cache in front.


As the main sysadmin for the archive team's web frontend (this one happens to be hosted outside of my control) Yes, its mediawiki, no we cannot use fancy cache, JScott has a special deal with the web hoster that makes the hosting damn near bullet proof. on top of that, the overall group is anti-cloudflare. so we are kinda stuck with what Ive been able to meek out of a cpanel shared hosting plan.


There are a a number of open source friendly hosters that could probably host it for free. These ones come to mind:

https://osuosl.org/ https://www.fosshost.org/ http://www.gplhost.com/ https://www.digitalocean.com/open-source/ https://www.gandi.net/en/gandi-supports


Oh it doesn't work like that. I direct you to the time he got sued for 1 billion dollars. These are "crackpots" that love to threaten our services. He gets crazy as shit takedown notices about once a week. that he promptly ignores


Send me an email (in my profile), I'll give you guys a server and capacity on my CDN.


Oh it doesn't work like that. I direct you to the time he got sued for 1 billion dollars. These are "crackpots" that love to threaten our services. He gets crazy as shit takedown notices about once a week. that he promptly ignores


Indeed. I hate PHP as much as the next dev, but Varnish and the like have been around for ages, and are easy to setup and operate. Dead simple, affordable solution for an essentially static site like this.


Tbf, the expected audience is small, its not like they are trying to sell something so its critical to stay up through some spike.

The architecture is totally fine for their use case. Arcitecturing it to stand up to hn hug of death would be over-engineering and not really make sense given their goals.


From the cached version:

> Hosting is provided gratis thanks to the kind folks at Tranquil Hosting


I'm not sure what point you are trying to make? If AWS gives $10,000 in cloud credits to a startup and their product goes down, it is AWS' fault?


Is that the file format problem? Are they being meta?


The solution is to not log warnings to the browser. It’s information leakage for no benefit, the users can’t fix this issue.


Problem solved if they used a static page here?

Guess the 'hug of death' has exhausted available memory for PHP to display the page.


Oh dear...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: