文章目录

1. join中的on条件和where条件的区别1.1. left join1.2. right join1.3. inner join

2. 多个left join执行顺序3. 同时有inner join和left join时的执行顺序和结果4. SQL中存在多个join时的join执行顺序和结果

1. join中的on条件和where条件的区别

--建表

create table t1(id int, value int) partitioned by (ds string);

create table t2(id int, value int) partitioned by (ds string);

create table t3(c1 int, c2 int, c3 int);

--数据装载,t1表

insert overwrite table t1 partition(ds='20220120') select '1','2022';

insert overwrite table t1 partition(ds='20220121') select '2','2022';

insert overwrite table t1 partition(ds='20220122') select '2','2022';

--数据装载,t2表

insert overwrite table t2 partition(ds='20220120') select '1','120';

insert overwrite table t2 partition(ds='20220121') select '1','120';

insert into table t2 partition(ds='20220121') select '3','120';

--数据装载,t3表

insert into table t3 select '1','33','33';

insert into table t3 select '1','34','33';

insert into table t3 select '3','33','33';

insert into table t3 select '4','33','33';

1.1. left join

SQL案例

SELECT *

FROM t1

LEFT JOIN t2

ON t1.id = t2.id

where t1.ds = '20220120';

-- 执行结果为

+--------+-----------+-----------+--------+-----------+-----------+

| t1.id | t1.value | t1.ds | t2.id | t2.value | t2.ds |

+--------+-----------+-----------+--------+-----------+-----------+

| 1 | 2022 | 20220120 | 1 | 120 | 20220120 |

| 1 | 2022 | 20220120 | 1 | 120 | 20220121 |

+--------+-----------+-----------+--------+-----------+-----------+

对于上述执行结果相信并没有任何的问题,实际执行的结果和我们的预期是完全一致的。

SELECT *

FROM t1

LEFT JOIN t2

ON t1.id = t2.id AND t1.ds = '20220120';

-- 执行结果为,t1.ds = '20220120'条件未生效

+--------+-----------+-----------+--------+-----------+-----------+

| t1.id | t1.value | t1.ds | t2.id | t2.value | t2.ds |

+--------+-----------+-----------+--------+-----------+-----------+

| 1 | 2022 | 20220120 | 1 | 120 | 20220120 |

| 1 | 2022 | 20220120 | 1 | 120 | 20220121 |

| 2 | 2022 | 20220121 | NULL | NULL | NULL |

| 2 | 2022 | 20220122 | NULL | NULL | NULL |

+--------+-----------+-----------+--------+-----------+-----------+

执行结果中返回了t1表中的全部数据(是不是和预期的结果并不一样),从结果看 join 条件中的 AND t1.ds = '20220120' 并没有“生效”,为什么这样?

而如下所示,在on条件中对t2表进行筛选 t2.ds = '20220120',则返回结果中过滤了指定的数据。

SELECT *

FROM t1

LEFT JOIN t2

ON t1.id = t2.id AND t2.ds = '20220120';

-- 执行结果为,t2.ds = '20220120'条件生效

+--------+-----------+-----------+--------+-----------+-----------+

| t1.id | t1.value | t1.ds | t2.id | t2.value | t2.ds |

+--------+-----------+-----------+--------+-----------+-----------+

| 1 | 2022 | 20220120 | 1 | 120 | 20220120 |

| 2 | 2022 | 20220121 | NULL | NULL | NULL |

| 2 | 2022 | 20220122 | NULL | NULL | NULL |

+--------+-----------+-----------+--------+-----------+-----------+

分别查看上述三个SQL的执行计划

SELECT *

FROM t1

LEFT JOIN t2

ON t1.id = t2.id

where t1.ds = '20220120';

== Physical Plan ==

*(2) BroadcastHashJoin [id#24], [id#27], LeftOuter, BuildRight

:- *(2) FileScan orc zhanglei.t1[id#24,value#25,ds#26] Batched: true, Format: ORC, Location: PrunedInMemoryFileIndex[hdfs://bdptest/warehouse/tablespace/managed/hive/zhanglei.db/t1/ds=20220120], PartitionCount: 1, PartitionFilters: [isnotnull(ds#26), (ds#26 = 20220120)], PushedFilters: [], ReadSchema: struct

+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)))

+- *(1) Project [id#27, value#28, ds#29]

+- *(1) Filter isnotnull(id#27)

+- *(1) FileScan orc zhanglei.t2[id#27,value#28,ds#29] Batched: true, Format: ORC, Location: CatalogFileIndex[hdfs://bdptest/warehouse/tablespace/managed/hive/zhanglei.db/t2], PartitionCount: 2, PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct

从上述执行计划中得知where条件中t1.ds = '20220120'生效,体现在对表t1的读取上只读取了ds=20220120分区。

SELECT *

FROM t1

LEFT JOIN t2

ON t1.id = t2.id AND t1.ds = '20220120';

== Physical Plan ==

*(2) BroadcastHashJoin [id#32], [id#35], LeftOuter, BuildRight, (ds#34 = 20220120)

:- *(2) FileScan orc zhanglei.t1[id#32,value#33,ds#34] Batched: true, Format: ORC, Location: CatalogFileIndex[hdfs://bdptest/warehouse/tablespace/managed/hive/zhanglei.db/t1], PartitionCount: 3, PartitionFilters: [], PushedFilters: [], ReadSchema: struct

+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)))

+- *(1) Project [id#35, value#36, ds#37]

+- *(1) Filter isnotnull(id#35)

+- *(1) FileScan orc zhanglei.t2[id#35,value#36,ds#37] Batched: true, Format: ORC, Location: CatalogFileIndex[hdfs://bdptest/warehouse/tablespace/managed/hive/zhanglei.db/t2], PartitionCount: 2, PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct

从上述执行计划中得知on条件中t1.ds = '20220120'并没有生效,因为并没有过滤条件,仅仅在BroadcastHashJoin中进行了体现。

SELECT *

FROM t1

LEFT JOIN t2

ON t1.id = t2.id AND t2.ds = '20220120';

== Physical Plan ==

*(2) BroadcastHashJoin [id#40], [id#43], LeftOuter, BuildRight

:- *(2) FileScan orc zhanglei.t1[id#40,value#41,ds#42] Batched: true, Format: ORC, Location: CatalogFileIndex[hdfs://bdptest/warehouse/tablespace/managed/hive/zhanglei.db/t1], PartitionCount: 3, PartitionFilters: [], PushedFilters: [], ReadSchema: struct

+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)))

+- *(1) Project [id#43, value#44, ds#45]

+- *(1) Filter isnotnull(id#43)

+- *(1) FileScan orc zhanglei.t2[id#43,value#44,ds#45] Batched: true, Format: ORC, Location: PrunedInMemoryFileIndex[hdfs://bdptest/warehouse/tablespace/managed/hive/zhanglei.db/t2/ds=20220120], PartitionCount: 1, PartitionFilters: [isnotnull(ds#45), (ds#45 = 20220120)], PushedFilters: [IsNotNull(id)], ReadSchema: struct

从上述执行计划中得知where条件中t2.ds = '20220120'生效,体现在对表t2的读取上只读取了ds=20220120分区。

1.2. right join

SELECT *

FROM t1

RIGHT JOIN t2

ON t1.id = t2.id AND t1.ds = '20220121';

-- 执行结果,t1.ds = '20220121'的条件生效

+--------+-----------+--------+--------+-----------+-----------+

| t1.id | t1.value | t1.ds | t2.id | t2.value | t2.ds |

+--------+-----------+--------+--------+-----------+-----------+

| NULL | NULL | NULL | 1 | 120 | 20220120 |

| NULL | NULL | NULL | 1 | 120 | 20220121 |

| NULL | NULL | NULL | 3 | 120 | 20220121 |

+--------+-----------+--------+--------+-----------+-----------+

SELECT *

FROM t1

RIGHT JOIN t2

ON t1.id = t2.id AND t2.ds = '20220120';

-- 执行结果为,t2.ds = '20220120' 的条件未生效

+--------+-----------+-----------+--------+-----------+-----------+

| t1.id | t1.value | t1.ds | t2.id | t2.value | t2.ds |

+--------+-----------+-----------+--------+-----------+-----------+

| 1 | 2022 | 20220120 | 1 | 120 | 20220120 |

| NULL | NULL | NULL | 1 | 120 | 20220121 |

| NULL | NULL | NULL | 3 | 120 | 20220121 |

+--------+-----------+-----------+--------+-----------+-----------+

!!! note “” 从上述执行结果中看,left join的on中t1(左表)的条件并不会生效,但是t2(右表)的条件会生效。right join的on中(右表)的条件并不会生效,但是左表的条件会生效。

问题:在left join的on条件中为什么左表的筛选条件不会生效,而右表的筛选条件会生效???

1.3. inner join

SELECT *

FROM t1

JOIN t2

ON t1.id = t2.id

where t1.ds = '20220120';

--

SELECT *

FROM t1

JOIN t2

ON t1.id = t2.id AND t1.ds = '20220120';

-- 以上两个SQL执行结果相同

+--------+-----------+-----------+--------+-----------+-----------+

| t1.id | t1.value | t1.ds | t2.id | t2.value | t2.ds |

+--------+-----------+-----------+--------+-----------+-----------+

| 1 | 2022 | 20220120 | 1 | 120 | 20220120 |

| 1 | 2022 | 20220120 | 1 | 120 | 20220121 |

+--------+-----------+-----------+--------+-----------+-----------+

在inner join中 on 和 where 条件中的条件都会正常生效。

SELECT *

FROM t1

JOIN t2

ON t1.id = t2.id AND t2.ds = '20220120';

-- 执行结果如下

+--------+-----------+-----------+--------+-----------+-----------+

| t1.id | t1.value | t1.ds | t2.id | t2.value | t2.ds |

+--------+-----------+-----------+--------+-----------+-----------+

| 1 | 2022 | 20220120 | 1 | 120 | 20220120 |

+--------+-----------+-----------+--------+-----------+-----------+

!!! note “” 在inner join中on中的条件无论左右表的条件都会生效。

2. 多个left join执行顺序

继续给t2和t3表中插入测试数据

insert into table t2 partition(ds='20220121') select '3','120';

insert into table t3 select '1','33','33';

insert into table t3 select '1','34','33';

insert into table t3 select '3','33','33';

insert into table t3 select '4','33','33';

案例SQL

SELECT *

FROM t1

LEFT JOIN t2

ON t1.id = t2.id

LEFT JOIN t3

ON t2.id = c1;

-- 执行结果为

+--------+-----------+-----------+--------+-----------+-----------+--------+--------+--------+

| t1.id | t1.value | t1.ds | t2.id | t2.value | t2.ds | t3.c1 | t3.c2 | t3.c3 |

+--------+-----------+-----------+--------+-----------+-----------+--------+--------+--------+

| 1 | 2022 | 20220120 | 1 | 120 | 20220120 | 1 | 33 | 33 |

| 1 | 2022 | 20220120 | 1 | 120 | 20220120 | 1 | 34 | 33 |

| 1 | 2022 | 20220120 | 1 | 120 | 20220121 | 1 | 33 | 33 |

| 1 | 2022 | 20220120 | 1 | 120 | 20220121 | 1 | 34 | 33 |

| 2 | 2022 | 20220122 | NULL | NULL | NULL | NULL | NULL | NULL |

| 2 | 2022 | 20220121 | NULL | NULL | NULL | NULL | NULL | NULL |

+--------+-----------+-----------+--------+-----------+-----------+--------+--------+--------+

从结果得知,多个表进行left join时,是按照join顺序进行的。即先由t1和t2表left join形成一个虚拟表后,再和t3表进行left join成最终结果。

3. 同时有inner join和left join时的执行顺序和结果

SELECT *

FROM t2

JOIN t3

ON t2.id = t3.c1;

-- 执行结果如下

+--------+-----------+-----------+--------+--------+--------+

| t2.id | t2.value | t2.ds | t3.c1 | t3.c2 | t3.c3 |

+--------+-----------+-----------+--------+--------+--------+

| 1 | 120 | 20220121 | 1 | 33 | 33 |

| 1 | 120 | 20220120 | 1 | 33 | 33 |

| 1 | 120 | 20220121 | 1 | 34 | 33 |

| 1 | 120 | 20220120 | 1 | 34 | 33 |

| 3 | 120 | 20220121 | 3 | 33 | 33 |

+--------+-----------+-----------+--------+--------+--------+

SELECT *

FROM t1

left join t2

on t1.id = t2.id

JOIN t3

ON t2.id = t3.c1;

-- 执行结果如下

+--------+-----------+-----------+--------+-----------+-----------+--------+--------+--------+

| t1.id | t1.value | t1.ds | t2.id | t2.value | t2.ds | t3.c1 | t3.c2 | t3.c3 |

+--------+-----------+-----------+--------+-----------+-----------+--------+--------+--------+

| 1 | 2022 | 20220120 | 1 | 120 | 20220120 | 1 | 33 | 33 |

| 1 | 2022 | 20220120 | 1 | 120 | 20220121 | 1 | 33 | 33 |

| 1 | 2022 | 20220120 | 1 | 120 | 20220120 | 1 | 34 | 33 |

| 1 | 2022 | 20220120 | 1 | 120 | 20220121 | 1 | 34 | 33 |

+--------+-----------+-----------+--------+-----------+-----------+--------+--------+--------+

SELECT *

FROM t2

JOIN t3

ON t2.id = t3.c1

left join t1

on t1.id = t2.id;

-- 执行结果如下

+--------+-----------+-----------+--------+--------+--------+--------+-----------+-----------+

| t2.id | t2.value | t2.ds | t3.c1 | t3.c2 | t3.c3 | t1.id | t1.value | t1.ds |

+--------+-----------+-----------+--------+--------+--------+--------+-----------+-----------+

| 1 | 120 | 20220120 | 1 | 33 | 33 | 1 | 2022 | 20220120 |

| 1 | 120 | 20220121 | 1 | 33 | 33 | 1 | 2022 | 20220120 |

| 1 | 120 | 20220120 | 1 | 34 | 33 | 1 | 2022 | 20220120 |

| 1 | 120 | 20220121 | 1 | 34 | 33 | 1 | 2022 | 20220120 |

| 3 | 120 | 20220121 | 3 | 33 | 33 | NULL | NULL | NULL |

+--------+-----------+-----------+--------+--------+--------+--------+-----------+-----------+

SELECT *

FROM t2

JOIN t3

ON t2.id = t3.c1

right join t1

on t1.id = t2.id;

-- 执行结果如下

+--------+-----------+-----------+--------+--------+--------+--------+-----------+-----------+

| t2.id | t2.value | t2.ds | t3.c1 | t3.c2 | t3.c3 | t1.id | t1.value | t1.ds |

+--------+-----------+-----------+--------+--------+--------+--------+-----------+-----------+

| 1 | 120 | 20220120 | 1 | 33 | 33 | 1 | 2022 | 20220120 |

| 1 | 120 | 20220120 | 1 | 34 | 33 | 1 | 2022 | 20220120 |

| 1 | 120 | 20220121 | 1 | 33 | 33 | 1 | 2022 | 20220120 |

| 1 | 120 | 20220121 | 1 | 34 | 33 | 1 | 2022 | 20220120 |

| NULL | NULL | NULL | NULL | NULL | NULL | 2 | 2022 | 20220121 |

| NULL | NULL | NULL | NULL | NULL | NULL | 2 | 2022 | 20220122 |

+--------+-----------+-----------+--------+--------+--------+--------+-----------+-----------+

4. SQL中存在多个join时的join执行顺序和结果

从上述的示例中可以得知,当SQL中存在多个join(无论join类型))时,执行时按照join的前后顺序,前两个表join出一个虚拟的表,再和第三个表进行join,依次往后执行。

精彩文章

评论可见,请评论后查看内容,谢谢!!!
 您阅读本篇文章共花了: