Research: optimization of distributed JOIN

There is a proposal from @darthunix to optimize distributed JOIN query from network/logical side. As soon as JOIN query requires moving data from one instance to another (gathering on router data from both instances, where all calculations are made, after which new data returned to one of the instances), we may optimize data movement so that we have to transfer less data through network.

As an outline of solution there are several variants:

In case of long JOIN queries sequence analyze it using combinatorics
Somehow count the selectivity of JOIN queries:
- May look at UNIQUE indexes (they will certainly return exactly one tuple on simple SELECT on UNIQUE column)
- May read articles about it. Examples: this, this
May implement sampling in the Tarantool core (there were already made some attempts: this, this, this and this )
May implement statistics gathering on sbroad side: e.g. store information after every INSERT query
May see articles about how such an optimization is implemented in other databases: PostgreSQL, Greenplum, Yugabyte, CockroachDB

Edited Oct 04, 2022 by Emir Vildanov