Research: optimization of distributed JOIN
There is a proposal from @darthunix to optimize distributed JOIN query from network/logical side. As soon as JOIN query requires moving data from one instance to another (gathering on router data from both instances, where all calculations are made, after which new data returned to one of the instances), we may optimize data movement so that we have to transfer less data through network.
As an outline of solution there are several variants:
- In case of long JOIN queries sequence analyze it using combinatorics
- Somehow count the selectivity of JOIN queries:
- May implement sampling in the Tarantool core (there were already made some attempts: this, this, this and this )
- May implement statistics gathering on sbroad side: e.g. store information after every INSERT query
- May see articles about how such an optimization is implemented in other databases: PostgreSQL, Greenplum, Yugabyte, CockroachDB
Edited by Emir Vildanov