Shutting off redisplay was not an uncommon technique on the Alto (the first time I saw it I was somewhat disconcerted, and though the machine had crashed as that wasn't all that uncommon in those days).
The key, if you read Thacker's original design paper, is that screen refresh rate was 2/3 of the bus bandwidth. This was considered absurd: why so much resource devoted to I/O (many machines in those days still had channel controllers, a design which has implicitly returned).
These old tidbits are fun because they make you think about different design tradeoffs. I love the line in K&R (IIRC -- is it still in there?) where they explain that the standard library is required (and say, including the quotes "you mean I have to call a function to do I/O?"). And the implicit defense in the 801 paper (the original RISC paper) that yes, the compiler really could do a good job.
> One way to speed up the Alto is to shut off the display. I tried this and improved the time from 24 minutes to 9 minutes, a remarkable improvement [...]: to display pixels on the screen, the CPU must move the pixels for each scan line from RAM to the display board, 30 times a second.
Ah, same as the Sinclair ZX81 fast mode for the same reasons. The ZX80 was always running in fast mode and displayed the screen only if no program was running. Old times.
Yes, I had the TS1000, a slightly modified version for the US market (with NTSC video output). I remember its built-in BASIC actually had "FAST" and "SLOW" commands. I love that they felt "SLOW" was an appropriate name for a command.
I guess I had always assumed this was about memory bandwidth, but this Alto article raises the possibility that the CPU may have been doing double duty (probably for cost reasons and not flexibility).
The way this actually worked is a giant hack, explained here: http://zx81.us/zx81vid.txt But basically yes it really was about 4 times faster in FAST mode.
The C64 would also speed up for similar reasons (the VIC (graphics) chip could control the bus, so number of cycles "stolen" depended on your graphics - e.g. every scanline with sprites present it would steal more cycles).
C64, too. It was common to blank the screen during intensive processing--many games with audio samples would blank the screen while playing their tiny collection of dodgy voice lines, though a few didn't.
An even larger speedup could be realized by performing "solid guessing" like Fractint: Render at 1/2 resolution (that's 1/4 the number of pixels), then take a second pass only in areas where there's a color boundary. This cuts render times down to less than half.
When I was in grad school I (briefly) worked on a rendering engine based on this concept. It was trying to do ray tracing for an entire scene in real-time (original work would've been circa 2003, mine around 2006). With dynamic content.
The algorithm was, roughly, to render 16x16 pixels at random in the scene. If the variance in colors was low, you're done. If it's high, you cut it in half (kd-tree). Pick a region at random and a random 16x16 pixels and render them. If it's low variance you're good, otherwise split. Eventually things like a wall would have maybe 4 regions describing it, though taking up a large portion of the final image. You repeat this random selection and rendering continuously. There was also a way to merge regions should the scene change and they have a similar color though I've forgotten the details.
It was quite fast on commodity hardware from 2004/2005.
Coolness. My buddy Roy created AccuRender, a plugin for AutoCAD and others. He did something similiar, back in the day; I'd have to ask him about the details. Strategy went from a scanline to a progressive rendering. While slower overall, the quick results enabled quick sanity checks, a huge UI productivity boon. Now I have to ask him, relive the glory years. :)
IIRC, there's an even more powerful trick available: render the edges of a rectangle. If there are no color boundaries on the edges, you can usually assume it'll be solid-colored and skip rendering any of the contents. Some exceptions apply.
An enhancement to this method which I've seen is to check the edges of the rectangle for colour boundaries, and also check single random point inside. Presumably you can extend to n x n rectangles and k points inside as a function of n.
Could this be improved even further by starting with 1/512 resolution (or whatever divides well) and making more passes only where there's a boundary - basically as a quadtree?
Very nice. I think the whole art of dealing with limited resources are lost on new programmers who have gigabytes of rams and gigacycles of compute. I love the old 1K demos for that reason, it really puts the puzzle back into programming.
In my experience, if you are familiar with RTL level synthesis of hardware then microprogramming isn't a huge leap (to me they feel very similar).
You just need to solve more computationally intensive problems. There are plenty of simulations and mathematical graphics you can make which take hours of rendering time on a fast modern machine, where optimizing the inner loops can have a big impact.
Constraints make a lot of creative endeavors interesting. I have no idea why Lego modeling is so compelling when you can just print miniatures, but for me looking at people's work is like crack.
You're just solving the wrong problems then. Try rendering the whole Earth at sub-cm resolution or something. There are so many things we can't do right now because we're too constrained...
Always a controversial position :-) Back in the day hosting the full development environment would often level out the differences in performance, not as much today. With mobile development you don't have much of a choice, your stuff has to run on the phone.
It still holds though, when developers get to only test on Google Pixie and Samsung S8, while customers are on ZTE Blade, the stuff might even install, but run it won't.
I like to have a decent machine for development, especially when running Visual Studio (which I do not use often, but Visual Studio is not exactly lightweight).
But for testing I absolutely prefer to use a low-end machine. I used to have a SparcStation 20 at home (it died a few years ago, though), and I loved that machine for that precise reason. (Also: Big-endian vs. Little-endian issues) Damn, that beast was SLOW. Not just slow: It only had 64 MB of RAM, and I had removed the hard drive because it was way too loud, so the entire thing ran over the network. Swapping over a half-duplex 10MBit line makes you really think about memory usage. ;-)
I've made java services respond with 500us on average for in-memory queries. Yeah, average, including GC stops. This is where things grow really fun, because every lock, every array access, every method call becomes a problem.
It's an interesting world. Closed and not entirely relevant to most developers.
What are good, up to date resources for performance tuning to that degree in Java? I've looked into this a few times and keep hitting walls where the resources I have found say "let the JVM do it's thing".
Mechanical sympathy is a good source, probably the source for java specific topics. In general, you'll want to really, really learn the internals of the JIT so you can structure your application to cooperate with the JIT and the GC.
Beyond that, I've found the biggest throughput gains and/or latency decreases by implementing the right threading or lack of threading, locking or lock-free implementations, and then tuning, profiling and optimizing things just right for the target hardware.
That's just a little sentence, but it's an amazingly hard topic. There are not many resources about this out there beyond just looking at high performance systems, regardless of language. The innodb storage engine is a fun project with some interesting locking, and I like reading about the things the kernel does.
"However, Alto microcode is pretty crazy, so I'm not going to try a microcode Mandelbrot."
I don't understand that logic, coming from somebody who has a restored Xerox Alto and wrote a Mandelbrot generator in BCPL for it "to learn how to use the Alto's bitmapped display, not make the fastest Mandelbrot set".
I also expect this guy will get source code for the 'simple' things such as an OR instruction or a fixed-width multiplication instruction within a few weeks.
There apparently are only 16 basic micro-instructions (”The ALU function field controls the SN74181 ALU. This device can do a total of 48 arithmetic and logical operations, most of which are relatively useless. The 4-bit field is mapped by a PROM into the 16 most useful functions”), and OR is one of them:
ALUF FIELD FUNCTION
0 BUS (A)
1 T (B)
2 BUS OR T* (A+B)
3 BUS AND T (AB)
4 BUS XOR T (A XOR B)
5 BUS + I* (A PLUS I)
6 BUS - I* (A MINUS I)
7 BUS + T (A PLUS B)
1OB BUS - T (A MINUS B)
11B BUS - T - 1 (A MINUS B MINUS I)
12B BUS + T + 1* (A PLUS B PLUS I)
13B BUS + SKIP* (A PLUS I)
14B BUS.T* (AND) (AB)
15B BUS AND NOT T (A & NOT B)
16B-17B UNDEFINED
So, following the best source I have, (that PDF) which states:
”For the most part, since the Alto is such a simple machine, writing Alto microcode is a straightforward exercise in rule-following”
that shouldn’t be too hard, provided that there is a free spot for writing the microcode instruction, or that you can find an instruction to give up in exchange for a simple OR.
Edit 2: that PDF also describes the way you do an OR, so that apparently was the way to go:
”To "or" together the contents of ACO and ACI; results ACO:
The microcode isn't quite as simple as you make it sound. Trust me, I've needed to understand a lot of the Alto's microcode to debug disk and other issues, and it's brain-twisting.
There are the 16 basic ALU micro-instructions, but these are combined with 8 different bus sources and two sets of "special functions" to make fairly complex micro-instructions. The special functions are a bit crazy and also totally change meaning depending on which hardware task is running.
Control flow in microcode is pretty wacky. Each micro-instruction contains the address of the next micro-instruction, so it's like a goto in every instruction with instructions randomly scattered through memory. Additionally, branches are done by the hardware setting bits in the address based on conditions, so you end up with basically a computed goto. Finally, branches happen the instruction after the condition. So even just following the control flow of microcode is hard.
Another fun thing with microcode is that to read memory you write the address to the memory address register. Then you need to wait two microinstructions before you can read or write the memory value.
Yes, I saw a bit of that, and did realize what you meant by pretty crazy. Rereading my last edit, I also notice that the smiley I had there got lost in editing. Apologies for that.
If you do have the microcode for AND, and a RAM board to store custom microcode, creating an OR instruction shouldn't be insanely hard, though, even without an assembler (finding a spot to put it without losing another instruction may be impossible, though, and that could mean a fairly large rewrite of your OS and applications)
Well, the microcode for AND is simple; it's this line
G17: L_ ACDEST AND T, TASK, :SHIFT;
...the problem is that that's in the middle of the decode (there's a computed jump into this lot from the line labelled DIS1:). So you could change the AND insn to do OR instead, trivially (use "OR" here where the line says "AND"), but you probably wanted to keep AND. So you'd need to at least add enough of your own decode to be able to put the insn somewhere else. And as you say that looks like the real problem: http://users.rcn.com/crfriend/museum/doco/DG/Nova/base-instr... suggests that given 16 bit instructions and the format they've used there isn't much space left. The best you could do would be to steal some of the 'io instruction' space and have it operate on fixed accumulator registers the way MUL and DIV do.
(The decode is remarkably confusing even knowing what it's supposed to be doing; there seems to be a lot of implicitly jumping to entries in decode tables based on bits of the instruction...)
Looks like you've figured a lot of it out, but if you want more details... The way instruction decoding works is the hardware slams some of the instruction bits onto the microaddress bus, causing a 16-way dispatch based on the instruction, selecting which instruction to execute. In detail, looking at the microcode:
"IR" is the special instruction register. Loading it from memory (MD=Memory Data) causes a dispatch after the next microinstruction. ":GETAD" is a microcode branch, so the next microinstruction after DIS1 would normally be at label GETAD. However, IR causes bits 0,5,6,7 of the instruction to get OR'd into the microaddress, resulting in the dispatch to G1, G2 etc. But how does the microassembler know that G1, G2, etc need to be aligned properly for this to work? That's what the !17,20 directive does: 17 and 20 are the necessary alignment in octal, followed by the list of labels that need to be aligned.
Further instruction decoding is done with the IDISP and ACSOURCE special microcode functions, which do more complex decoding (based on multiple bit combinations) resulting in similar dispatches. So the branch to :DOINS or :SHIFT will do a second dispatch, based on the earlier ACSOURCE.(See page 30 of the hardware manual for details.)
Actually, the PARC guys stole all of the io instruction space for other things on the Alto. Only the math and load/store instructions are compatible with the Nova.
Even if I were crazy enough to write microcode, there's a problem that we don't have the microcode assembler MU.RUN. Maybe it will turn up on one of the PARC disks we're archiving...
Thanks - I searched that archive for MU.RUN but somehow missed it. We got a gateway from the Living Computer Museum that lets us transfer files to the real Alto. So I guess I need another excuse for why I'm not writing microcode :-)
Interesting, I hope that the archives end up available to the public. It'd be really neat to build some kind of functional emulator like the visual 6502 to play around in. Do it on a block level for the CPU so you can examine the state as things happen.
Funny you should mention that, because the Living Computer Museum wrote a high-quality emulator (ContrAlto) for the Alto. It's super-accurate, even emulating the microcode and hardware, and runs at the speed of the live machine. It includes a debugging mode that lets you examine machine state and step through machine instructions and microcode. Source: https://github.com/livingcomputermuseum/ContrAlto
Awesome I'll check that out. I think I remember looking back when this started and found SALTO but it appears to be incomplete and buggy. ContrAlto looks interesting.
Pardon my ignorance, but does the Alto multi-task between user programs? If the answer to that question is affirmative, how does the machine cope with simultaneously running two programs that load different custom microcode? Does it load and unload and reload at every context switch? Or does it simply not multi-task?
The Alto is a "personal computer" and only runs one program at a time. The operating system provides primitive task management (sort of like threads) so the operating system can for instance process network packets while a program is running, or a program can contain multiple tasks. Thus, a program can load the microcode it needs when it starts up and doesn't need to worry about conflicts.
This task management is entirely orthogonal to the 16 tasks that are running at the microcode level to refresh the display and handle I/O devices.
Fascinating article. One question, you write that: "Using what I learned with the Mandelbrot, I wrote a program to display images; an example is below." - I couldn't find source code for that on github - is it published anywhere?
It would be interesting to see some code that exercise more parts of the system, beyond the Mandelbrot generator.
About coding style - I guess this is tabs-vs-spaces territory - but is there a particular reason why you only indent blocks in the loops, not in the procedure/functions (like under main)?
I also wonder a bit about some of the magic constants - like 30705 - is there an overhead to use variables in BCPL (eg: indirect de-reference, no automatic in-lining by the compiler)?
Finally, how about procedure call overhead? Granted, the Mandelbrot generator is rather simple (shifts and other tricks notwithstanding) - so I can see why it makes sense to keep it all in a single procedure - but what does a call/ret look like on the Alto? (eg: in simplified assembler, in terms of stack-push, registers etc)?
Magic constants: I'm lazy so it's easier to leave them inline when I'm hacking on code. I don't know if there's runtime overhead.
BCPL procedure call overhead: kind of nasty. There's no stack support in the instruction set, just jump-and-link. So a called procedure first calls a subroutine "getframe" that sets up the stack frame (very similar to a C stack frame). Then a second subroutine "moveargs" moves the call arguments into the stack frame. At the end of the procedure call, a third subroutine call cleans up and does the return.
Thank you! It's interesting to see how similar this code is to assembly code calling OS procedures, after declaring them external - I've recently been playing (again) with assembly for the win64 arch, and one sample program (a mish-mash of example code available for nasm, fasm and "go" assembler - is rather similar IMNHO - note this just displays a window, no loading of (image) files):
; Assemble and link with:
; nasm -f win64 .\hello-nasm.asm
; golink .\hello-nasm.obj \
; Kernel32.dll User32.dll /entry:WinMain
global WinMain
extern ExitProcess
extern MessageBoxA
section '.text'
WinMain:
sub rsp,8*5 ; reserve stack for API use and make
; stack dqword aligned
xor rcx, rcx ; uType = MB_OK
lea rdx, [szCaption] ; LPCSTR lpCaption
lea r8, [szTitle] ; LPCSTR lpText
xor r9d, r9d ; hWnd = HWND_DESKTOP
call MessageBoxA
mov ecx,eax
call ExitProcess
section '.rdata'
szTitle: db 'Hello, Title!', 0
szCaption: db 'Hello, World!', 0
Just like FAST on a commodore 128, it would run at 2 MHz instead of 1, and the screen went black. To go back to 1MHz, you would use the SLOW command IIRC
That was a bit different in that the CPU clock frequency was actually changed. The screen blanking in that instance was because the VIC-II could only run at 1MHz, so the screen would get messed up in 40 column mode if the CPU (and by extension the bus speed) was switched to 2MHz.
You do get a speedup by just disabling the screen too on both the C64 and C128, though, as with the display on the VIC hogs a lot of memory bus cycles. That's a bit more similar to the situation on the Alto, except instead of different "hardware threads" they are separate chips and the VIC just has the ability to take control of the bus whenever it needs memory access.
My Computer Org&Architecture course in college had us programming on a PDP-8, which has an even more ridiculously restricted instruction set. I always intended to build a Mandelbrot renderer for it, but never got around to it.
The actual physical PDP-8 we had never had a printer hooked up, so actually drawing the thing would have been impossible, other than using the blinkenlights to render a line at a time.
It is time to coin a new phrase: the Hacker News inverse comment-interest ratio. The fewer comments on a story, the more likely it is to be extremely interesting and relevant to a hardcore geek.
Anytime I see a slightly-curious headline with only 0-3 comments, I know it's going to be good.
There's some kind of velocity factor in the position ranking, so a story that quickly gets a bunch of votes will have a high rank and few comments. But the comments are likely to come as the link sits there for a while.
I think this is fairly accurate. It's really easy to upvote something you find interesting, but takes much more time to come up with good comments. Usually these types of articles will generate quite a few comments even though they start out on the top with basically none.
The key, if you read Thacker's original design paper, is that screen refresh rate was 2/3 of the bus bandwidth. This was considered absurd: why so much resource devoted to I/O (many machines in those days still had channel controllers, a design which has implicitly returned).
These old tidbits are fun because they make you think about different design tradeoffs. I love the line in K&R (IIRC -- is it still in there?) where they explain that the standard library is required (and say, including the quotes "you mean I have to call a function to do I/O?"). And the implicit defense in the 801 paper (the original RISC paper) that yes, the compiler really could do a good job.