I feel that many introductions to scanning and parsing are based on old techniques when memory were scarce. In those times storing your files in memory, was simply impossible. But we live in times where we have gigabytes of ram and there are not many projects that could not be read into memory. (Tell me about a project that has a gigabyte of source files, please.) So, why not assume your input to be stored in a string.
A complete debug, unoptimized build of LLVM/Clang requires about 50GB of disk space. And LLVM/Clang is on the smaller end of large projects; the Linux kernel, OpenJDK, and Mozilla are larger open source projects, and the private repositories of people like Microsoft, Facebook, and Google are an order of magnitude, or two, above those.
Keeping the entire program string in memory doesn't really buy you anything. You still need your AST and other IR representations, and location information that is stored as a compressed 32-bit integer is going to be more memory-efficient than a 64-bit pointer anyways. Furthermore, scanning and parsing is very inherently a process that requires proceeding through your string linearly, so building them on an iterator that implements next() and peek() (or unget()) doesn't limit you.
Yes, I agree with you. For production compilers, with hand-coded parsers, it is a waste of space to store all files in memory. Please note that I was talking about introductions to scanning and parsing. I just feel there is no need to make things more complicated than they need to be, when you are introducing somebody to parsing and scanning. Most people who study computer science, will likely, when they need to do some scanning and parsing, not having to deal with files, but have strings at hand. For an introduction to parsing and scanning, I even would suggest to begin with a back-tracking recursive decent parser, to talk about the essence of parsing. Please note that if you add a little caching, such a parser can have descent performance for many applications that an average software engineer will encounter. For all other applications, standard libraries and/or compilers exist. See https://github.com/FransFaase/IParse for an example of this approach.
Anyway, please do not compare disk space to build with source code size. (I understand that the debug version of uses a lot of static linking with debug information.) I understand that the Linux kernel is 28 million lines of code. Even with 80 characters per line, when I think an average of 40, is far more realistic, that will be under 2 GB. So, yes, you can store all source file in RAM on any descent computer. (I did not find any recent number of lines of code for LLVM/Clang, but extrapolating it, I guess it is in the same order as the Linux Kernel.)
The parser stores the entire AST in memory, it is not acting as if it was scarce.
As for the idea of not starting with a string but instead reading the file on the fly, I think it is actually simpler. The point of storing the entire file in memory is to enable going back and forth, but why would you want to do that? State machine based parsers are fast, robust, and based on a solid theoretical grounds, at least for "simple" (context-free) languages.
I do not know, if you took time to look at the next function in https://github.com/DoctorWkt/acwj/tree/master/01_Scanner , but there is a Putback variable, which seems to imply that the scanner goes back (at most one character). Also having to pass around the result of next, instead of having a current pointer, or at least a current character, makes things more complicated (I feel). Being able to look ahead, without having to consume a character, (I feel) is easier, also for keeping track of the correct line and column number. Note that if Putback is equal to '\n', the Line has already be incremented. Seems that this could lead to errors being reported on the wrong line for terminal symbols at the end of the line.
At my first job they used a programming language that worked on top of c somehow developed by their sister company(used to be the same but split) all that to make a shitty ERP and eh...it was many, many gigabytes partially written by people now retired. I don't think the few images and other such stuff used were part of that (it also had a few random cobol files i encountered) and it had everything from the UI rendering and handling to a webserver i had to communicate with but about which nobody knew how it worked because the guy who made it was enjoying his olden days.
To be totally fair it was the biggest inefficient coding mess I've seen. I still regret not scanning and keeping the customerlist because if I did i'd be deep in that business right now making bank.
This is actually not about a source repository being larger than a gigabyte, but a problem that arises with the C-preprocessor, where for a certain use, a certain header file is included many, many times, and where the position for all includes is kept. Typically a case of severe C-preprocessor abuse, I would say, which probably could be solved in a much better way. This shows how the C-preprocessor is both a blessing and a curse: it fixes many problems, but often causes compilation to be slow, because huge intermediate representations are produced which have to be parsed by the compiler.