Skip to content

Design planner statistics implementation

As a final result of the research about cost-based optimization (#259 (closed), #273 (closed)) we want a document with the full design of the planner statistics subsystem. It should answer the questions:

  1. What approach do we use to collect statistics?
    • update probability structures (hll, cms, etc) per each row modification. Can we still estimate the error if we update them per N rows (#275 (closed))?
    • sample from the table or index (by manual command or in fiber) and build histograms.
    • something else?
  2. What information do we collect? For example: total, nulls, mcv, ndv, histogram boundaries.
  3. What type of histograms do we use (if any?)
  4. How do we transfer statistics from the storages to the router?
    • Fetch only required part of the data during the query optimization (first calculate buckets to find Tarantool instances we really need)?
    • Fetch and cache all statistics from all Tarantool instances (manually, fiber)
  5. Describe the row selectivity estimation logic for all supported plan nodes in sbroad (by example)
Edited by Denis Smirnov