This paper is needlessly hard to read (4 pages in before there's even an explanation!) so first a refresher: A traditional bloom filter sets k bits to 1, using k hash functions to determine which bits. So for input 'i', we might get something like:
h1(i) = 0
h2(i) = 5
h3(i) = 7
and upon insertion, the bloom filter would resemble:
1 | 0 | 0 | 0 | 0 | 1 | 0 | 1
to check if an item is the bloom filter, you check the corresponding bits. After you add items to a bloom filter there's some probability that the corresponding bits for items that haven't been added are also set. So it can sometimes lie to you and say something is included, even when it's not.
IBLTs replace the bits with counts of items added, and a sum of the key and value. So where Cs is the count sum, Ks is the Key sum and Vs is the value sum, the filter might look like;
Cs=1 Ks=10 Vs=20 | Cs=0 Ks=0 Vs=0 | ...
the positions in the table are computed just as before; the count in the count increments for every item added, and the key and value "sums" are a union of the keys and values that have been added (this might be a simple rolling count of a small checksum, or an xor).
Items can be removed from the table individually, by subtracting them from the relevant counts (just like a counting bloom filter) and rolling sums. A "dump" of the table can be derived by iterating through all possible values (you need to know what they might be) and deleting them one by one, and it takes trickery to support duplicate entries.
This seems like a big downside; if the input set is constrained and there is so small a finite set of inputs that you can just iterate over them, then I think it is easier to use composite numbers as your invertible lookup table.
Here's how; for each potential input element, assign it a unique prime number, keep this assignment in a static lookup table. Start the table with the value "1". To store an element, multiply the current table value by the element's value. A composite number will be generated that is the product of all elements inserted into the table.
To check for element presence; just use mod. Duplicates work; if the value is still congruent, then the element is still there. To remove an element, just divide by its value. A surprisingly large number of elements can be represented this way in a 64k-sized composite number.
A good example of this this technique is to model Internet AS paths; give every AS number a unique prime number (there are only about 100k of them) and model AS path as the composite of its component ASes; using vectorized DIV/MOD instructions it is incredibly fast to large graph operations.
Minor correction - the method the paper describes to 'dump' the table is a bit different.
It sounds like you're proposing that they iterate through all possible keys, check if the key is in the table, and if it is, dump that key and delete it. This would take O(2^b) time, where b is the number of bits in the key. Obviously, for reasonably-sized b, this can get a bit out of hand.
The paper takes a different approach. It looks for a cell with Cs = 1. Cs = 1 means Ks and Vs represent exactly a key-value pair. Dump that (Ks,Vs) pair, then delete (Ks,Vs) from the table. In the process of deleting, there will probably be some other cell that goes from Cs = 2 to Cs = 1. continue from there. Eventually, you should get all Cs = 0. If you ever get stuck, you're unable to dump the entire list.
That method takes O(n), where n is the number of elements in the table, and the fastest you could expect to dump any kind of list.
What you describe with composite numbers would not work: data structures like IBLTs make sense when you want to store a small set S from a large universe U, and you want to use a number of bits that is linear in S and sublinear in U.
In your example, just storing the static lookup table with one prime number for each key in the universe costs about |U| log |U| bits. Furthermore, each cell needs to store a product of these numbers, and even if you can bound with k the number of keys colliding in the same cell, you need k log |U| bits for each cell.
This is very inefficient, and if you think about it you can achieve the same functionality with a standard hash table, which takes less space. In general, every time you think "I could use prime factorization to store sets", there is usually a more efficient solution, probably involving bitvectors.
The usage scenario of IBLTs is quite different from the usual scenario of storing key-value mappings: it is useful when you have an algorithm that adds and delete keys, and at any time during the execution of the algorithm the number of distinct keys can be arbitrarily large, but you know that at the end of the algorithm the number of distinct keys is "small". With a standard hash table, you would need a space that is proportional to the maximum number of distinct elements reached during the execution of the algorithm; with IBLTs you only need space that is proportional to the final set size.
A toy example is the following: someone tells you a set of n numbers, and then tells you again all the numbers in the set except for k; you want to return the k missing numbers. With a hash table you would need O(n) space, with a IBLT you just need O(k).
To achieve this, they use a standard trick in Bloom-like data structures (and perfect hashes, etc.): the property that there is always a cell with exactly one element; this property is preserved when you remove that one element, so there will be another cell with exactly one element, and so on, until you can retrieve all of them. This property is guaranteed with random hash functions with the right table size, and can be proved by showing that the hypergraph induced by the hash mappings is "peelable", which is a generalization of a standard graph with no cycles, i.e. a tree: in a tree there is always a node with degree 1, and when you remove that you leave at least another node with degree 1, and so on.
> A good example of this this technique is to model Internet AS paths; give every AS number a unique prime number (there are only about 100k of them) and model AS path as the composite of its component ASes; using vectorized DIV/MOD instructions it is incredibly fast to large graph operations.
AS paths are ordered; what interesting graph algorithms could be performed on paths in which order has been lost?
Unfortunately, the business of the academics is writing papers, and not writing code/blog posts. Usually they also have to write project proposals, teach, go to conferences etc. If you write a paper with simple words, it will probably be rejected since "it's trivial" or "it's not well formalized" ;)
If you just want insertion, deletion, resizing and merging with better data locality and roughly same performance as Bloom Filter, also consider Quotient Filter http://arxiv.org/pdf/1208.0290.pdf
Why use this instead of putting a bloom filter in front of a fixed hash table? In the unlucky case that two pairs have total hash collision with the IBLT you can't recover anything. Iteration cost is the same. If you need delete just use a count in the filter instead of a hash. If the table fills up, drop entries. I guess the IBLT would be able to recover info if the number of insertions crossed the threshold and deletions brought the count back down.
I was recently looking at bloom filters as a possible piece of a permissions system in a typical database object model. Imagine:
- users have various permissions
- other objects in the system require one or more permissions to be viewable by those users
you can accomplish this easily with joins alone, but it becomes a performance issue rather quickly. I have a large-scale system and my intuition was that supplemental indexes based on bloom filters could help me solve this problem. i haven't figured out an exact way to apply them however.
No, not even if you were willing to put up with arbitrary loss of some entries. For example, one of the required behaviors of a Python dictionary is that one can iterate through it, but this data structure can only be iterated if the size is small enough.
When would be a good reason to use such invertible bloom lookup tables, in my knowledge, isn't a bloom filter's purpose is to store large amount of data and give very quick replies as to whether a piece of data is in the basket but not with 100% accuracy?
I'm sure jbyers didn't mean that it shouldn't have been posted. Older posts are more than welcome here. It's just the convention to add the year in parens (and for commenters to mention when we missed one).
Hacker News is "news" in the sense that a used clothing store near my house is called "New To You" :)
IBLTs replace the bits with counts of items added, and a sum of the key and value. So where Cs is the count sum, Ks is the Key sum and Vs is the value sum, the filter might look like;
the positions in the table are computed just as before; the count in the count increments for every item added, and the key and value "sums" are a union of the keys and values that have been added (this might be a simple rolling count of a small checksum, or an xor).Items can be removed from the table individually, by subtracting them from the relevant counts (just like a counting bloom filter) and rolling sums. A "dump" of the table can be derived by iterating through all possible values (you need to know what they might be) and deleting them one by one, and it takes trickery to support duplicate entries.
This seems like a big downside; if the input set is constrained and there is so small a finite set of inputs that you can just iterate over them, then I think it is easier to use composite numbers as your invertible lookup table.
Here's how; for each potential input element, assign it a unique prime number, keep this assignment in a static lookup table. Start the table with the value "1". To store an element, multiply the current table value by the element's value. A composite number will be generated that is the product of all elements inserted into the table.
To check for element presence; just use mod. Duplicates work; if the value is still congruent, then the element is still there. To remove an element, just divide by its value. A surprisingly large number of elements can be represented this way in a 64k-sized composite number.
A good example of this this technique is to model Internet AS paths; give every AS number a unique prime number (there are only about 100k of them) and model AS path as the composite of its component ASes; using vectorized DIV/MOD instructions it is incredibly fast to large graph operations.