TaikiSan's comments

TaikiSan · on Jan 16, 2019

Possibly! We didn't consider databases because this logic is implemented in agents injected in our users' applications. On top of limiting what we had access to, it also put a lot of pressure on us to minimise our CPU & memory overhead.

Because of that, we believed using a database (even in-memory) would have resulted in an unacceptable overhead.

nh2 · on Jan 16, 2019

Note that "database" does not necessarily mean "networked database server" to many people.

For example, using the Sqlite library counts as a database for most, and it runs in your process, in your memory. It is essentially a collection of functions for fast data structure manipulation.

TaikiSan · on Jan 16, 2019

Absolutely! This would have been a problem when a IP range was blocked however. You would have to test 32 subnets for any given IPv4 (or 128 for IPv6!). We knew it was not ideal, but it had the least downsides at the time.

daveFNbuck · on Jan 16, 2019

This is the question I had the entire time I was reading your post too. Doing set membership testing in constant time is already a solved problem, so why go through the trouble of radix trees when you could use any number of pre-built solutions?

I think it would have been a much more interesting and useful read if you had included the reasons that a radix tree is a better solution for your use case than hashing and how you implement range blocking.

I also wasn't able to understand the advantage of path compression or why being fixed-length is important.

TaikiSan · on Jan 16, 2019

We didn't spend too much time on the specific of our choice, because we chose to make something more widely usable (this is how radix trees work) than the specific product evolution we did. That being said, we probably were a bit light on details.

Going with an hash map was an option, but we feared the ballooning memory footprint when a large number of IPs is blocked, performance outliers (if you're unlucky enough for a lot of blocked IPs/ranges to end up in the same bucket) and we wanted the solution to work as well with IPv6 (where checking each bit takes a bit longer). As for using existing optimal constant time membership tests, we couldn't go with too complex algorithms because we would have to reimplement them and maintain them in each of our agent's language. Radix trees were attractive in this regard because this is a widely available data structure and when it's not, it's fairly easy to implement from scratch.

As for range blocking, it's actually very simple: we block the network IP and tag the node as "range". This way, when in the process of looking up an IP we hit a range node, we know we don't have to look any farther. This has the downside of preventing us from processing differently specific parts of the range but it's not a problem in our architecture.

Path compression lets you save on nodes, which is good for memory and make looking up an IP faster (because you have to go through fewer nodes before hitting a leaf or a divergence). Fixed length makes using bit mask to compare a value to the node simpler and take care of alignment issues, but we could do without it if we had to.

daveFNbuck · on Jan 16, 2019

Thanks for the explanations, that makes a lot of sense. Is the memory footprint a concern for individual customers, or is this more about reducing the overall footprint across your combined customers?

TaikiSan · on Jan 17, 2019

I'm not sure what you're referring to. The radix tree is stored within the agent context which is running within our user's application. Because of this architecture, making the radix tree larger make the memory footprint of our user's application larger. Keeping it small was thus advantageous for all our customers, although only one ever had a problem due to them.

daveFNbuck · on Jan 18, 2019

I didn't think memory would be a major concern for storing what sounds like less than 100kb of data on a server-side application.

ganonm · on Jan 16, 2019

If I understand the post correctly, the advantage for e.g. inserting a subnet is that most of the overall IP address value is shared between each IP in the subnet. In a Radix tree, this shared data will only be represented once in the three (since the entire path from root to leaf makes up the entry value). For a hash based solution, yes you get constant time insertion and constant time access, but you need a unique entry of the entire IP for each individual entry. This increases overall memory requirements.

If query time became a bottleneck, would you consider putting a Bloom Filter in front of the Radix tree? This would have low memory usage and a low false positive rate. Any positive hits would then do an exact check on the radix tree. So, for say 99% of non-blacklisted IPs, you would get constant time blacklist checks.

daveFNbuck · on Jan 16, 2019

The problem they were solving had to do with blocking 10,000 IP addresses. This would require maybe 100kb in a hash table, so memory requirements probably aren't the issue.

dgritsko · on Jan 16, 2019

Wouldn't the benefit of path compression mean reducing the memory footprint of the data structure?

And regarding fixed-length: "This lets us optimize the lookup code." More detail here would be cool but it wasn't really the point of the article.

daveFNbuck · on Jan 16, 2019

My guess is that the path compression speeds up lookups by increasing data locality and reducing the number of fetches needed. Memory footprint probably isn't a big concern for storing 10,000 entries.

loeg · on Jan 16, 2019

Path compression just means less pointer dereferences — fewer cache misses and branch mispredictions.