网站首页 > 数据库 > 正文

hadoop 数据仓库 Hive中Left Join基于or形式匹配连接的一种优雅实现方式

大数据工程师一个月多少钱数据库 2024-01-09 19 0

0 引言

1 需求

2 数据准备

3 问题分析

4 小结

0 引言

Hive中对于模糊匹配关联是不支持的，如or 连接，基于like的模糊匹配连接，对于此类问题往往需要找新的方案，对于or连接网上给出了解决方案如union的实现形式，本文借助于locate()+concat_ws()函数进行优雅的实现。

1 需求

t表 id dt 1 2022-06-03 2 2022-05-04 3 2022-04-01 4 2022-05-22

t1表： id dt1 dt2 1 2022-06-03 2022-05-03 2 2022-05-25 2022-05-04 3 2022-03-01 2022-05-04

找出t表中时间字段dt在t1表中dt1,dt2任意出现的id，及时间，保留t表中数据，如果能够匹配到取匹配的时间，未匹配到置为NULL

2 数据准备

create table t as

select 1 as id,'2022-06-03' as dt

union all

select 2 as id,'2022-05-04' as dt

union all

select 3 as id,'2022-04-01' as dt

union all

select 4 as id,'2022-05-22' as dt

------------------

create table t1 as

select 1 as id,'2022-06-03' as dt1,'2022-05-03' as dt2

union all

select 2 as id,'2022-05-25' as dt, '2022-05-04' as dt2

union all

select 3 as id,'2022-03-01' as dt1, ' 2022-05-04' as dt2

3 问题分析

针对本问题如果在oracle或mysql中则比较好实现，具体如下：

select t.id,case when t1.id is not null then t.dt else null end as dt

from t

join t1

on t.id = t1.id

and (t.dt=t1.dt1 or t.dt=t1.dt2)

但由于Hive中不支持不等连接，使本问题增加了难度，传统的解法，采用union+去重的实现方式，具体SQL如下：

select id1 as id

,max(case when id2 is not null then dt else null end) as dt

from

(select cast(t.id as string) as id1,t.dt as dt,cast(t1.id as string) as id2

from t

left join t1

on t.id = t1.id

and t.dt=t1.dt1

union

select cast(t.id as string) as id1,t.dt as dt,cast(t1.id as string) as id2

from t

left join t1

on t.id = t1.id

and t.dt=t1.dt2

) t

group by id1

结果如下：

id dt

1 2022-06-03

2 2022-05-04

3 NULL

4 NULL

Time taken: 3.343 seconds, Fetched: 4 row(s)

但是上述的方式显示显得很啰嗦，如果后面需要匹配的or比较多，比如有5个的时候，那么同样的逻辑就要union 5次代码看起来相当繁琐，且性能较低。

针对以上问题，本文采用一种优雅的实现方式：我们知道采用or连接的时候，无非就是t表中的字段在t1表中匹配到了就成功，对于这种需要匹配就成功的连接方式，我们自然想到hive中高效的实现方式locate()函数，对于该函数的理解，可以具体参考我如下文章：

SQL之一天一个小技巧：如何使用HQL从不固定位置提取字符串元素【详解Hive字符串位置查找函数】_莫叫石榴姐的博客-CSDN博客_hive字符第二次出现位置

由于t表中的某个字段需要在t1表中的多个字段中去匹配，为了满足locate()函数使用条件，我们利用concat_ws()函数进行列转行拼成一个串，然后让该字段在串中去匹配如果能匹配成功则表明该条数据可以匹配到，然后在where语句或case when中进行过滤。注意locate()这种模糊匹配也是不能放在on条件中的，这种本质上也是一种不等连接，只能放在where或case when中过滤，至于为什么要用locate()而不用like函数，原因是locate()的性能要优于like。因此我们具有如下实现方式：

首先用left join做等值连接，然后select语句通过case when去判断匹配，具体SQL如下：

select t.id

,case when locate(t.dt,concat_ws(',',t1.dt1,t1.dt2))>0 then t.dt else null end as dt

from t

left join t1

on t.id = t1.id

具体结果如下：

t.id dt

2 2022-05-04

3 NULL

1 2022-06-03

4 NULL

Time taken: 2.275 seconds, Fetched: 4 row(s)

从执行性能上来看此种的实现方式也是要优于union的实现方式。

4 小结

通过本文我们总结了hive中or不等连接的一种优雅的实现方式，该方式借助于locate()模糊匹配的方法代码简洁优雅，从整体上看这种实现方式优于union的实现方式，性能较优，本文所采用的loacte()模糊匹配函数，在hive中用途较广，读者也需要务必掌握。

文章来源

评论可见，请评论后查看内容，谢谢！！！

您阅读本篇文章共花了：

Hive hadoop 数据仓库

本文由用户于 2024-01-09 发布在金钥匙，如有疑问，请联系我们。
本文链接：https://www.51969.com/post/18001731.html

金钥匙

hadoop 数据仓库 Hive中Left Join基于or形式匹配连接的一种优雅实现方式

大数据数据仓库 Apache Hudi 数据湖之Hudi基础：入门介绍和编译部署

大数据技术之Hadoop学习(七)——Hive数据仓库

发表评论取消回复

金钥匙

hadoop 数据仓库 Hive中Left Join基于or形式匹配连接的一种优雅实现方式

大数据 数据仓库 Apache Hudi 数据湖之Hudi基础：入门介绍和编译部署

大数据技术之Hadoop学习(七)——Hive数据仓库

相关文章

发表评论取消回复

大数据数据仓库 Apache Hudi 数据湖之Hudi基础：入门介绍和编译部署