There's certainly a lot you can do to constrain I/O, but removing side-effects when possible will always be better than merely restricting them.
I do like the idea of immutable files and repeatable builds, but I don't think this negates the benefits of maximising the time you spend working with data.
For instance, there might be a task that automatically cleans up some files, but you want to keep those files around. If the tasks produce data structures that describe the build process, it's much easier to intercept and prevent or reorder the cleanup process. Functions, especially ones that are Turing-complete, are notoriously opaque.
> removing side-effects when possible will always be better than merely restricting them
I disagree. Effects are a very natural mental model for a great deal of problems and constraining yourself to purity is both impractical and quickly experiences diminishing returns.
Furthermore, if you can intercept effects, you can impose purity upon them. For an extreme example, consider application virtualization and containers such as Docker. By intercepting the system call table, you can create a "pure" filesystem from the view outside the container. At the other extreme, take a look at "extensible effects" and the Eff language, which lets you stub any subset of the effects available down to the individual expression!
> If the tasks produce data structures that describe the build process, it's much easier to intercept and prevent or reorder the cleanup process.
If you intercept all file IO, you can recover the same data. The only difference is whether or not you know that data upfront.
> Functions, especially ones that are Turing-complete, are notoriously opaque.
This is true! However, there are a great many build processes that do not know what they depend on or what they will produce until they do some Turing-equivalent work. For example, scanning a C header to find #include statements.
Rather than try to shoehorn all data in to a declarative model, we need both 1) fully declarative and 2) the ability to recover a declaration from the trace of an imperative.
An example of this trick, employed manually, is the notorious .d Makefiles. The C compiler finds all the dependencies, produces a submake file with the .d extension, then make-restarts recursively using the new .d file as part of the dependency graph. However, it's a very unnatural way to think about the problem and it leads to complex multi-pass build processes that are necessarily slower. Instead, the dependency graph could be produced as a side-effect of simply doing the compilation and that graph could be used as part of a higher-level declarative framework.
> Effects are a very natural mental model for a great deal of problems and constraining yourself to purity is both impractical and quickly experiences diminishing returns.
I can't personally recall a problem where purity was feasible but impractical, though I can think of a few examples of the opposite.
> If you intercept all file IO, you can recover the same data.
Yes, if you record all I/O, you could restore files that have been deleted by a previous task.
However, it seems rather more elegant, and more efficient, to prevent the files from being deleted in the first place.
> Rather than try to shoehorn all data in to a declarative model
That isn't what I'm trying to say. There will inevitably be some cases where you need side-effectful I/O.
It's more that I'd rather see a solution start simple and pull in complexity as necessary, rather than start complex and attempt to work back to simplicity with constraints.
The problem with the read-everything-into-memory approach is that this is not how the JVM ecosystem works. Things written for the JVM use the classpath primarily, and things in memory secondarily if at all.
We can't control the fact that the CLJS compiler, for example, is looking for source files on the classpath instead of in some FileSet object proxy. If we admit the use of tools written by the Java community at large we suffer by adding another leaking half-abstraction to the mix.
We actually did some experiments with fuse filesystems but the performance is just not there yet. When fuse performance becomes comparable to Java NIO it may become a viable option, and would solve all of these problems. You could then have a "membrane" approach, where the JVM is only manipulating a filesystem proxy, and you have complete control over when and how to reify that and write to the filesystem.
Reading everything into memory wasn't supposed to be a complete solution. Not everything can be that simple. However, it seems to me to be better to start from a simple base and add complexity in as necessary, then start from a complex base and try to achieve simplicity.
But let's run with the idea of loading everything into some immutable in-memory data structure, just to see where it goes. So long as we write everything in Clojure we're fine, but the moment we start hitting things adapted for the JVM, such as the CLJS compiler, we run into problems as you point out.
However, it's not too hard to conceive of possible solutions. Let's start with a simple, but naive way around it. We'll take the files in memory, write them to a temporary directory, and then generate a CLJS compiler with a classpath pointing to that directory. When the compiler is done, we take the result and load it into memory again.
Again, this is solution that aims for simplicity rather than performance, but optimisations immediately suggest themselves. If the files exist on disk, we symlink them or point the classpath directly at them. If we don't need the CLJS output file's content, we can defer loading it into memory.
Haha, yes! Now we're cookin'! The "simple, but naive way" you describe above is pretty much the way boot does things. I'd say you could look at the boot cljs task to see this but setting up the environment for the CLJS compiler is pretty tricky so the code there isn't as clear and elegant as I'd like.
In boot tasks don't fish around in the filesystem to find things. Not for input nor output. Tasks obtain the list of files they can access via functions: boot.core/src-files, boot.core/tgt-files, et al. These functions return immutable sets of java.io.File objects. However, these Files are usually temp files managed by boot.
Boot does things like symlinking (actually we use hard links to get structural sharing, and the files are owned by boot so we don't have cross-filesystem issues to worry about), and we shuffle around with classpath directories and pods.
So stay tuned for the write-up of the filesystem stuff, I think it might be right up your alley!
It sounds like there's a lot in Boot I'd like, particularly in how it deals with the filesystem. I'm still not convinced about the design, but it's clear I don't know enough about it to make a decision on it.
If nothing else, I'm sure there will be parts in it I'll want to steal ;)
I realised I might not have been very clear in my previous comment. Let me see if I can improve it with an example.
Let's forget about all other considerations and instead consider the simplest possible build system we can conceive. This build system should take a directory structure of source files, and produce a directory structure of output files.
If our sole consideration is simplicity, we might construct a build system like:
So we take every file in the current working directory, read everything into memory, perform some functional transformation that produces a data structure of output files, then write that to disk. This minimises I/O, and gives us a functional data structure to play around with.
It's a naive approach, and one made without regard for memory or efficiency, but given that the amount of memory on a modern machine is far larger than the source directory is likely to be, it actually seems feasible.
However, we can also consider optimisations that don't alter the behaviour. For instance, we could only read in files when their contents are accessed. In order to protect against changes, we could check the modification date, and abort if it changes. It's a compromise, but a small one.
We might also conceive of a system where the contents of the file are memory mapped, or held in some temporary file, or any number of clever ways to avoid keeping the file in memory while not breaking the integrity of the data structure.
This is just a toy example, and lacking in many areas like network I/O, but it's easier to start simple and add complexity when necessary, than it is to start from an assumption of complexity and try to work backward to simplicity. This is why I think it's incorrect to start with side-effectful functions, because that means starting from complexity.
I do like the idea of immutable files and repeatable builds, but I don't think this negates the benefits of maximising the time you spend working with data.
For instance, there might be a task that automatically cleans up some files, but you want to keep those files around. If the tasks produce data structures that describe the build process, it's much easier to intercept and prevent or reorder the cleanup process. Functions, especially ones that are Turing-complete, are notoriously opaque.