hadoop 数据仓库【hive】hive grouping sets和GROUPING

前言

GROUPING SETS,GROUPING__ID,CUBE,ROLLUP 这几个分析函数通常用于OLAP中，不能累加，而且需要根据不同维度上钻和下钻的指标统计，比如，分小时、天、月的UV数。

grouping sets根据不同的维度组合进行聚合，等价于将不同维度的group by的结果进行 union all，简单来说就是将多个不同维度的group by逻辑写在了一个sql中。

数据准备：

vim /root/test.txt

2015-03,2015-03-10,cookie1

2015-03,2015-03-10,cookie5

2015-03,2015-03-12,cookie7

2015-04,2015-04-12,cookie3

2015-04,2015-04-13,cookie2

2015-04,2015-04-13,cookie4

2015-04,2015-04-16,cookie4

2015-03,2015-03-10,cookie2

2015-03,2015-03-10,cookie3

2015-04,2015-04-12,cookie5

2015-04,2015-04-13,cookie6

2015-04,2015-04-15,cookie3

2015-04,2015-04-15,cookie2

2015-04,2015-04-16,cookie1

将数据导入到hdfs目录上：

hdfs dfs -put /root/test.txt /tmp

创建表

use test;

create table cookie5(month string, day string, cookieid string)

row format delimited fields terminated by ',';

load data inpath "/tmp/test.txt" into table cookie5;

select * from cookie5;

开始使用grouping sets

来条sql语句：

select

month,day,count(cookieid)

from cookie5

group by month,day

grouping sets (month,day);

查询结果如下：

上面这个sql等同于多个group by ＋ union all

select month,NULL as day,count(cookieid) as nums

from cookie5

group by month

union all select NULL as month,day,count(cookieid) as nums

from cookie5

group by day;

注意点：使用union和union all必须保证各个select 集合的结果有相同个数的列，并且每个列的类型是一样的。union all的表字段必须匹配，也就是上文的month 需要用 NULL as month 来进行填充。

结果一致！但是grouping sets速度要比group by + union 快！！！！！

但是有引出一个问题？为什么grouping sets 要比 group by + union 速度要快？

首先：用explain来解释下hive执行计划：

explain select

month,day,count(cookieid)

from cookie5

group by month,day

grouping sets (month,day);

执行计划如下：

+----------------------------------------------------+--+

| Explain |

+----------------------------------------------------+--+

| STAGE DEPENDENCIES: |

| Stage-1 is a root stage |

| Stage-0 depends on stages: Stage-1 |

| |

| STAGE PLANS: |

| Stage: Stage-1 |

| Map Reduce |

| Map Operator Tree: |

| TableScan |

| alias: cookie5 |

| Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |

| Select Operator |

| expressions: month (type: string), day (type: string), cookieid (type: string) |

| outputColumnNames: month, day, cookieid |

| Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |

| Group By Operator |

| aggregations: count(cookieid) |

| keys: month (type: string), day (type: string), '0' (type: string) |

| mode: hash |

| outputColumnNames: _col0, _col1, _col2, _col3 |

| Statistics: Num rows: 2 Data size: 756 Basic stats: COMPLETE Column stats: NONE |

| Reduce Output Operator |

| key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string) |

| sort order: +++ |

| Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: string) |

| Statistics: Num rows: 2 Data size: 756 Basic stats: COMPLETE Column stats: NONE |

| value expressions: _col3 (type: bigint) |

| Reduce Operator Tree: |

| Group By Operator |

| aggregations: count(VALUE._col0) |

| keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: string) |

| mode: mergepartial |

| outputColumnNames: _col0, _col1, _col3 |

| Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |

| pruneGroupingSetId: true |

| Select Operator |

| expressions: _col0 (type: string), _col1 (type: string), _col3 (type: bigint) |

| outputColumnNames: _col0, _col1, _col2 |

| Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |

| File Output Operator |

| compressed: false |

| Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |

| table: |

| input format: org.apache.hadoop.mapred.TextInputFormat |

| output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |

| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |

| |

| Stage: Stage-0 |

| Fetch Operator |

| limit: -1 |

| Processor Tree: |

| ListSink

再看另外一个：

explain select month,NULL as day,count(cookieid) as nums from cookie5 group by month

union all

select NULL as month,day,count(cookieid) as nums from cookie5 group by day;

+----------------------------------------------------+--+

| Explain |

+----------------------------------------------------+--+

| STAGE DEPENDENCIES: |

| Stage-1 is a root stage |

| Stage-2 depends on stages: Stage-1, Stage-3 |

| Stage-3 is a root stage |

| Stage-0 depends on stages: Stage-2 |

| |

| STAGE PLANS: |

| Stage: Stage-1 |

| Map Reduce |

| Map Operator Tree: |

| TableScan |

| alias: cookie5 |

| Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |

| Select Operator |

| expressions: month (type: string), cookieid (type: string) |

| outputColumnNames: month, cookieid |

| Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |

| Group By Operator |

| aggregations: count(cookieid) |

| keys: month (type: string) |

| mode: hash |

| outputColumnNames: _col0, _col1 |

| Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |

| Reduce Output Operator |

| key expressions: _col0 (type: string) |

| sort order: + |

| Map-reduce partition columns: _col0 (type: string) |

| Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |

| value expressions: _col1 (type: bigint) |

| Reduce Operator Tree: |

| Group By Operator |

| aggregations: count(VALUE._col0) |

| keys: KEY._col0 (type: string) |

| mode: mergepartial |

| outputColumnNames: _col0, _col1 |

| Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |

| Select Operator |

| expressions: _col0 (type: string), UDFToString(null) (type: string), _col1 (type: bigint) |

| outputColumnNames: _col0, _col1, _col2 |

| Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |

| File Output Operator |

| compressed: false |

| table: |

| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |

| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |

| serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |

| |

| Stage: Stage-2 |

| Map Reduce |

| Map Operator Tree: |

| TableScan |

| Union |

| Statistics: Num rows: 2 Data size: 756 Basic stats: COMPLETE Column stats: NONE |

| File Output Operator |

| compressed: false |

| Statistics: Num rows: 2 Data size: 756 Basic stats: COMPLETE Column stats: NONE |

| table: |

| input format: org.apache.hadoop.mapred.TextInputFormat |

| output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |

| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |

| TableScan |

| Union |

| Statistics: Num rows: 2 Data size: 756 Basic stats: COMPLETE Column stats: NONE |

| File Output Operator |

| compressed: false |

| Statistics: Num rows: 2 Data size: 756 Basic stats: COMPLETE Column stats: NONE |

| table: |

| input format: org.apache.hadoop.mapred.TextInputFormat |

| output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |

| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |

| |

| Stage: Stage-3 |

| Map Reduce |

| Map Operator Tree: |

| TableScan |

| alias: cookie5 |

| Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |

| Select Operator |

| expressions: day (type: string), cookieid (type: string) |

| outputColumnNames: day, cookieid |

| Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |

| Group By Operator |

| aggregations: count(cookieid) |

| keys: day (type: string) |

| mode: hash |

| outputColumnNames: _col0, _col1 |

| Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |

| Reduce Output Operator |

| key expressions: _col0 (type: string) |

| sort order: + |

| Map-reduce partition columns: _col0 (type: string) |

| Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |

| value expressions: _col1 (type: bigint) |

| Reduce Operator Tree: |

| Group By Operator |

| aggregations: count(VALUE._col0) |

| keys: KEY._col0 (type: string) |

| mode: mergepartial |

| outputColumnNames: _col0, _col1 |

| Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |

+----------------------------------------------------+--+

| Explain |

+----------------------------------------------------+--+

| Select Operator |

| expressions: UDFToString(null) (type: string), _col0 (type: string), _col1 (type: bigint) |

| outputColumnNames: _col0, _col1, _col2 |

| Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |

| File Output Operator |

| compressed: false |

| table: |

| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |

| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |

| serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |

| |

| Stage: Stage-0 |

| Fetch Operator |

| limit: -1 |

| Processor Tree: |

| ListSink |

| |

+----------------------------------------------------+--+

比较：通过explain 可以发现

grouping sets时，有2个stage，一次reducegroup by + union时，有4个stage，两次reduce。

那么肯定用grouping sets时，速度会快。

GROUPING__ID的使用：

来条sql语句：

select

month,

day,

count(distinct cookieid) as uv,

GROUPING__ID

from cookie5

group by month,day

grouping sets (month,day)

order by GROUPING__ID;

效果等价于：

SELECT month,NULL as day,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM cookie5 GROUP BY month

UNION ALL

SELECT NULL as month,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM cookie5 GROUP BY day;

结果说明:

第一列是按照month进行分组第二列是按照day进行分组第三列是按照month或day分组是，统计这一组有几个不同的cookieid 第四列grouping_id表示这一组结果属于哪个分组集合，根据grouping sets中的分组条件month，day

再来个例子：

SELECT month, day,

COUNT(DISTINCT cookieid) AS uv,

GROUPING__ID

FROM cookie5

GROUP BY month,day

GROUPING SETS (month,day,(month,day))

ORDER BY GROUPING__ID;

等价于：

SELECT month,NULL as day,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM cookie5 GROUP BY month

UNION ALL

SELECT NULL as month,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM cookie5 GROUP BY day

UNION ALL

SELECT month,day,COUNT(DISTINCT cookieid) AS uv,3 AS GROUPING__ID FROM cookie5 GROUP BY month,day;

grouping_id计算方法

grouping sets 中的每一种粒度，都对应唯一的 grouping__id 值，其计算公式与 group by 的顺序、当前粒度的字段有关。具体计算方法如下：

将每个字段都抽象成二进制，有2个字段就是2位，有5个字段就有5位二进制数。靠近 group by 的字段是低位，远离的则是高位。hive中参与了组合的字段是1，未参与是0。这样就形成了一个二进制数，这个二进制数转为十进制，即为当前粒度对应的 grouping__id。

举例

以上述对 cookie5 的3种粒度的统计结果为例：

我们的上面的sql 为 group by month, day, 遵循第一条原则，month 在二进制中处于低位，day 处于高位。

序号grouping set二进制值对应的十进制（grouping__id 的值）0（）0001（month）0112（day）1023(month, day)113

这就是上面 grouping sets 的结果中 grouping__id 值的由来。

参考链接

评论可见，请评论后查看内容，谢谢！！！

您阅读本篇文章共花了：

金钥匙

hadoop 数据仓库【hive】hive grouping sets和GROUPING

大数据数据仓库 maxcomputer 普通和hive兼容模式下sql的差异

数据仓库（3）-模型建设

发表评论取消回复

金钥匙

hadoop 数据仓库 【hive】hive grouping sets和GROUPING

大数据 数据仓库 maxcomputer 普通和hive兼容模式下sql的差异

数据仓库（3）-模型建设

相关文章

发表评论取消回复

hadoop 数据仓库【hive】hive grouping sets和GROUPING

大数据数据仓库 maxcomputer 普通和hive兼容模式下sql的差异