vinyl: introduce bloom filters for partial key lookups
Currently, we store and use bloom only for full-key lookups. However, there are use cases when we can also benefit from maintaining bloom filters for partial keys as well - see #3177 for example. So this patch replaces the current full-key bloom filter with a multipart one, which is basically a set of bloom filters, one per each partial key. Old bloom filters stored on disk will be recovered as is so users will see the benefit of this patch only after major compaction takes place. When a key or tuple is checked against a multipart bloom filter, we check all its partial keys to reduce the false positive result. Nevertheless there's no size optimization as per now. E.g. even if the cardinality of a partial key is the same as of the full key, we will still store two full-sized bloom filters although we could probably save some space in this case by assuming that checking against the bloom corresponding to a partial key would reduce the false positive rate of full key lookups. This is addressed later in the series. Before this patch we used a bloom spectrum object to construct a bloom filter. A bloom spectrum is basically a set of bloom filters ranging in size. The point of using a spectrum is that we don't know what the run size will be while we are writing it so we create 10 bloom filters and choose the best of them after we are done. With the default bloom fpr of 0.05 it is 10 byte overhead per record, which seems to be OK. However, if we try to optimize other parameters as well, e.g. the number of hash functions, the cost of a spectrum will become prohibitive. Funny thing is a tuple hash is only 4 bytes long, which means if we stored all hashes in an array and built a bloom filter after we'd written a run, we would reduce the memory footprint by more than half! And that would only slightly increase the run write time as scanning a memory map of hashes and constructing a bloom filter is cheap in comparison to mering runs. Putting it all together, we stop using bloom spectrum in this patch, instead we stash all hashes in a new bloom builder object and use them to build a perfect bloom filer after the run has been written and we know the cardinality of each partial key. Closes #3177
Showing
- src/box/CMakeLists.txt 1 addition, 0 deletionssrc/box/CMakeLists.txt
- src/box/iproto_constants.c 2 additions, 1 deletionsrc/box/iproto_constants.c
- src/box/iproto_constants.h 3 additions, 1 deletionsrc/box/iproto_constants.h
- src/box/tuple_bloom.c 330 additions, 0 deletionssrc/box/tuple_bloom.c
- src/box/tuple_bloom.h 210 additions, 0 deletionssrc/box/tuple_bloom.h
- src/box/tuple_compare.cc 24 additions, 0 deletionssrc/box/tuple_compare.cc
- src/box/tuple_compare.h 12 additions, 0 deletionssrc/box/tuple_compare.h
- src/box/tuple_hash.cc 12 additions, 1 deletionsrc/box/tuple_hash.cc
- src/box/tuple_hash.h 30 additions, 0 deletionssrc/box/tuple_hash.h
- src/box/vy_run.c 94 additions, 152 deletionssrc/box/vy_run.c
- src/box/vy_run.h 10 additions, 13 deletionssrc/box/vy_run.h
- src/box/vy_scheduler.c 2 additions, 16 deletionssrc/box/vy_scheduler.c
- src/box/xlog.c 0 additions, 27 deletionssrc/box/xlog.c
- src/box/xlog.h 0 additions, 9 deletionssrc/box/xlog.h
- test/unit/vy_point_lookup.c 1 addition, 1 deletiontest/unit/vy_point_lookup.c
- test/vinyl/bloom.result 151 additions, 6 deletionstest/vinyl/bloom.result
- test/vinyl/bloom.test.lua 62 additions, 6 deletionstest/vinyl/bloom.test.lua
- test/vinyl/info.result 8 additions, 8 deletionstest/vinyl/info.result
Loading
Please register or sign in to comment