Skip to content

Refactor join conflict resolution: use one algorithm for all cases

Currently our codebase contains two join conflict resolution algorithms:

  1. The legacy one: tries to compute motion policy for each bool node in condition subtree. To do that it first computes distribution of rows that are located under those bool nodes.
  2. Single distribution handler: because Single distribution required adding motions not only to inner child, but to outer as well, it was decided to implement a different approach for handling join when one of its children had Single distribution.

Maintaining both these algorithms is painful: we need tests for single and non-single joins, in addition it is very easy to make a mistake by adding some feature to one algo and forgetting to do it for another algo. Furthermore, right now we are approaching joins on global tables, and we will need more custom logic for handling expressions that refer to global tables.

I suggest we refactor the code and leave only one algorithm. Second algorithm could be adopted for all common cases in my opinion, as for the first algorithm: I don't like the idea of computing distribution for rows. Distribution is a property of relational operators, not expressions. It leads to some bugs like #539 (closed).