Commit fc654aaf authored 7 years ago by Vladimir Davydov Committed by Konstantin Osipov 7 years ago

vinyl: introduce bloom filters for partial key lookups

Currently, we store and use bloom only for full-key lookups. However,
there are use cases when we can also benefit from maintaining bloom
filters for partial keys as well - see #3177 for example. So this patch
replaces the current full-key bloom filter with a multipart one, which
is basically a set of bloom filters, one per each partial key. Old bloom
filters stored on disk will be recovered as is so users will see the
benefit of this patch only after major compaction takes place.

When a key or tuple is checked against a multipart bloom filter, we
check all its partial keys to reduce the false positive result.
Nevertheless there's no size optimization as per now. E.g. even if the
cardinality of a partial key is the same as of the full key, we will
still store two full-sized bloom filters although we could probably save
some space in this case by assuming that checking against the bloom
corresponding to a partial key would reduce the false positive rate of
full key lookups. This is addressed later in the series.

Before this patch we used a bloom spectrum object to construct a bloom
filter. A bloom spectrum is basically a set of bloom filters ranging in
size. The point of using a spectrum is that we don't know what the run
size will be while we are writing it so we create 10 bloom filters and
choose the best of them after we are done. With the default bloom fpr of
0.05 it is 10 byte overhead per record, which seems to be OK. However,
if we try to optimize other parameters as well, e.g. the number of hash
functions, the cost of a spectrum will become prohibitive. Funny thing
is a tuple hash is only 4 bytes long, which means if we stored all
hashes in an array and built a bloom filter after we'd written a run, we
would reduce the memory footprint by more than half! And that would only
slightly increase the run write time as scanning a memory map of hashes
and constructing a bloom filter is cheap in comparison to mering runs.
Putting it all together, we stop using bloom spectrum in this patch,
instead we stash all hashes in a new bloom builder object and use them
to build a perfect bloom filer after the run has been written and we
know the cardinality of each partial key.

Closes #3177

parent f03fd4db

No related branches found

No related tags found

Hide whitespace changes

Inline Side-by-side

Showing with 952 additions and 241 deletions

Please register or to comment