文章目录

1. 创建表结构1.1 视频表结构1.2 用户表结构

2. 准备工作2.1 创建临时表2.2 创建最终使用表2.3 对创建表进行解读

3. 业务分析

1. 创建表结构

1.1 视频表结构

1.2 用户表结构

2. 准备工作

2.1 创建临时表

由于使用的是orc方式进行存储,所以我们需要建立一个临时表,通过查询插入的方式将数据插入到最终表中。

创建临时视频表

create table gulivideo_ori(

videoId string,

uploader string,

age int,

category array,

length int,

views int,

rate float,

ratings int,

comments int,

relatedId array)

row format delimited fields terminated by "\t"

collection items terminated by "&"

stored as textfile;

创建临时用户表

create table gulivideo_user_ori(

uploader string,

videos int,

friends int)

row format delimited

fields terminated by "\t"

stored as textfile;

加载原数据到临时表

load data local inpath "/opt/module/data/video" into table gulivideo_ori;

load data local inpath "/opt/module/user" into table gulivideo_user_ori;

2.2 创建最终使用表

创建视频表

create table gulivideo_orc(

videoId string,

uploader string,

age int,

category array,

length int,

views int,

rate float,

ratings int,

comments int,

relatedId array)

stored as orc

tblproperties("orc.compress"="SNAPPY");

创建用户表

create table gulivideo_user_orc(

uploader string,

videos int,

friends int)

row format delimited

fields terminated by "\t"

stored as orc

tblproperties("orc.compress"="SNAPPY");

2.3 对创建表进行解读

由于初始表和最终表他们两者中的结构都是一样的,所以对其中一种进行解读

对于用户表

三个字段都是基本数据类型,行格式字段分割以‘\t’结束,以orc的方式存储,并且设置压缩属性为snappy 视频表

视频表中视频的类别和视频的相关视频两个字段是数组的方式进行存储的 ,字段之间分隔符为‘\t’,数组中元素的分隔符为‘&’

3. 业务分析

需求Ⅰ:统计视频观看数 Top10

分析

第一步: 本需求比较简单,直接对视频的观看数进行排序,使用LIMIT关键字,限定前十条数据

代码:

select videoId

from gulivideo_orc

order by views

limit 10;

需求Ⅱ:统计视频类别热度 Top10

分析

第一步:获取视频类别,使用炸裂函数explode,由于不需要表中的其他字段,所以可以不考虑使用侧偏移第二部:按照视频的类型进行分组,并计算每一个类别的视频总数,按照总数获取前十的视频类别

步骤代码:

--步骤一使用下列任意一种都可以,第一种使用侧偏移,第二种直接炸裂

select

videoId,

category_name

from gulivideo_orc

lateral view explode(category) tmp_category as category_name;t1

select

explode(category) categoru_name

from gulivideo_orc;t1

--第二步代码:

select

category_name,

count(*) ct

from t1

group by category_name

order by ct desc

limit 10;

最终代码:

select

category_name,

count(*) ct

from

(select

explode(category) category_name

from gulivideo_orc)t1

group by category_name

order by ct desc

limit 10;

需求Ⅲ:统计出视频观看数最高的 20 个视频的所属类别以及每一个类别包含 Top20 视频的个数

分析

第一步:统计观看数最高的二十个视频第二步:获取视频的类别第三步:按照类别进行分组,获取每一组的视频数(即为每一个类别中包含Top20视频的个数)

select

category

from gulivideo_orc

order by views desc

limit 20;t1

select

explode(category) category_name

from t1;t2

select

category_name,

count(*) video_sum

from t2

group by category_name;

最终代码:

select

category_name,

count(*) video_sum

from

(select

explode(category) category_name

from

(select

category

from gulivideo_orc

order by views desc

limit 20)t1)t2

group by category_name;

需求Ⅳ: 统计视频观看数 Top50 所关联视频的所属类别排序

分析

第一步:获取观看数Top50的视频第二步:获取视频的关联的视频Id第三步:和原表做内连接,查询到该id的类别第四步:将类别进行炸裂第五步:按照类别进行分组,并且统计该分组下的视频总数,按照总数进行排序

分步代码:

select

relatedId

from gulivideo_orc

order by views desc

limit 50;t1

select

explode(relatedId) relatedId_id

from t1;t2

select

category

from t2

join gulivideo_orc t3

on t2.relatedId_id = t3.videoId;t4

select

explode(category) category_name

from t4;t5

select

category_name,

count(*) ct

from t5

group by category_name

order by ct desc;

代码:

select

category_name,

count(*) ct

from

(select

explode(category) category_name

from

(select

category

from

(select

explode(relatedId) relatedId_id

from

(select

relatedId

from gulivideo_orc

order by views desc

limit 50)t1)t2

join gulivideo_orc t3

on t2.relatedId_id = t3.videoId)t4)t5

group by category_name

order by ct desc;

需求Ⅴ:统计每个类别中的视频热度 Top10,以 Music 为例

分析

第一步:将视频的类别炸裂开第二步:获取类别是Music的数据,并且按照观看数进行排序

分布代码:

select

videoId,

category_name

from gulivideo_orc

lateral view explode(category) tmp_category as category_name;t1

select

videoId,

views,

category_name

from t1

where category_name = "Music"

order by views desc

limit 10;

代码:

select

videoId,

views,

category_name

from

(select

videoId,

views,

category_name

from gulivideo_orc

lateral view explode(category) tmp_category as category_name)t1

where category_name = "Music"

order by views desc

limit 10;

需求Ⅵ:统计每个类别视频观看数 Top10

分析

由于要求组内进行排序,所以使用开窗函数over()第一步:使用explode()函数,将类别分开第二步:使用开窗函数,按照类别分区,并且使用观看数进行排序第三步:获取前十数据

分步代码:

select

videoId,

views,

category_name

from gulivideo_orc

lateral view explode(category) tmp_category as category_name;t1

select

videoId,

views,

category_name

rank() over(partition by t1.category_name order by t1.views desc) rk

from t1;t2

select

videoId,

views,

category_name,

rk

from t2

where t2.rk <= 10;

代码:

select

videoId,

views,

category_name

rk

from (

select

videoId,

views,

category_name,

rank() over(partition by t1.category_name order by t1.views desc) rk

from

(select

videoId,

views,

category_name

from gulivideo_orc

lateral view explode(category) tmp_category as category_name)t1)t2

where t2.rk <= 10;

需求Ⅶ:统计上传视频最多的用户 Top10 以及他们上传的视频观看次数在前 20 的视频

分析

第一步:统计上传视频最多的用户 Top10第二步:用户表和视频表做内连接,获取Top10用户上传的所有视频Id第三步:按照用户进行分区,并且获取每一个用户排名前20的视频(按照视频观看书排序)

分布代码:

select

uploader

from

gulivideo_user_orc

order by videos desc

limit 10;t1

select

videoId,

views,

t2.uploader

from t1

join gulivideo_orc g

on t1.uploader = g.uploader;t2

select

videoId,

views,

uploader,

rank()over(partition by uploader order by views desc) rk

from t2;t3

select

videoId,

views,

uploader

from t3

where rk<= 20;

代码:

select

videoId,

views,

uploader,

rank()over(partition by uploader order by views desc) rk

from

(select

videoId,

views,

t1.uploader

from

(select

uploader

from

gulivideo_user_orc

order by videos desc

limit 10)t1

join gulivideo_orc g

on t1.uploader = g.uploader) t2;

由于数据不全,所有没有查询到数据

精彩内容

评论可见,请评论后查看内容,谢谢!!!
 您阅读本篇文章共花了: