Design planner statistics implementation
As a final result of the research about cost-based optimization (#259 (closed), #273 (closed)) we want a document with the full design of the planner statistics subsystem. It should answer the questions:
- What approach do we use to collect statistics?
- update probability structures (hll, cms, etc) per each row modification. Can we still estimate the error if we update them per N rows (#275 (closed))?
- sample from the table or index (by manual command or in fiber) and build histograms.
- something else?
- What information do we collect? For example: total, nulls, mcv, ndv, histogram boundaries.
- What type of histograms do we use (if any?)
- How do we transfer statistics from the storages to the router?
- Fetch only required part of the data during the query optimization (first calculate buckets to find Tarantool instances we really need)?
- Fetch and cache all statistics from all Tarantool instances (manually, fiber)
- Describe the row selectivity estimation logic for all supported plan nodes in sbroad (by example)
Edited by Denis Smirnov