Ah, Boids! One of my favorite examples of systems that can be "embarassingly parallel". I managed to write (with very little optimisation - not even using quadtrees to optimise neighbour lookup) a GPU accelerated version about a year ago that managed to stay above 30fps even up to about 5000 individual agents.
It also hammered home the problem of device/host memory reigions which you've got to deal with when doing GPU programming. I found that other, simpler, systems didn't manage to get as much speedup as Boids did, as transferring data, not computation, was where they bottlenecked. With Boids, by contrast, I could keep everything (computation, and rendering) on the GPU, and never even have to touch the CPU, aside from calling various CUDA and OpenCL functions.
Can you really call them agents if they are calculated on GPU? I always assumed that using GPU would reduce them to particles (colour + position). They were probably stored in a huge array as values rather than a linked list of objects?
They were stored as values in a huge array. I fail, however, to see how that causes them to be distinguished from "agents"? Surely they still fulfil the properties required to be "agents" in an abstract sense? They gather information about their environment (the positions of other agents), deliberate (work out where to move to next), and then act (change their position). That process is entirely abstracted from the storage method of the agent data itself!
I was inspired by this to create my own a few months ago. It uses a quadtree to optimize nearest neighbor lookup to ideally reduce complexity from n^2 to nlogn on each tick.
http://jrhdoty.github.io/SwarmJS/
The best data structure to use depends a lot on what you're actually doing with your neighbours.
If what you actually want to do is to have each boid be affected by each boid in a fixed-size neighbourhood, the best acceleration structure is actually just fixed-size bins.
I mean if your world is 1024 by 768, and a boid is affected by all other boids within 64 pixels of it, you should just break the world up into 16 by 12 bins, each representing a 64 by 64 pixel block. Each of these blocks is just a growable array.
All of a boid's "close enough" neighbours will be found in the 8-neighbourhood of its block.
One important thing to note while implementing it is that you have to be a little smart about how you move boids from one bin to another. You don't want to get quadratic behaviour because you're removing individual boids from the middle of lists again and again, you don't want to accidentally move a boid twice in an iteration and so on. The best way to do this is to make a list of "boids to add" and "boids to delete" for each bin, which you build as you call "move", and which you apply after you've done your movement calculations on all of your boids.
There are a few ways to be clever on going about it. One as you mentioned is to do a two phase scan one to move and one to correct the bins.
Also instead of a square shaped range you can get a hexagonal range by dividing corner boxes diagonally to two and then checking which half of square the boid is. :)
I think almost every AI-guy have tried making a boids simulation. They are such a good way to show emergent behavior from very simple rules.
I'm a TA for an AI-course at my uni, and this semester my students had to program a boids simulation for the first project. In addition to what's shown here, the boids also had to avoid obstacles and flee from predators. I think the students liked it, it's very rewarding to code and actually see the result.
The students also had to implement sliders, for how much cohesion, separation and alignment is weighted. It's pretty fun to change them and watch the behavior of the boids change.
It uses processing.js when I worked through it I just ended up implementing everything with canvas and making my own versions of the processing specific classes and functions needed.
It also hammered home the problem of device/host memory reigions which you've got to deal with when doing GPU programming. I found that other, simpler, systems didn't manage to get as much speedup as Boids did, as transferring data, not computation, was where they bottlenecked. With Boids, by contrast, I could keep everything (computation, and rendering) on the GPU, and never even have to touch the CPU, aside from calling various CUDA and OpenCL functions.