hadoop 数据仓库 HIVE总结

HIVE总结

一、创建数据库

CREATE DATABASE IF NOT EXISTS TEST;

数据库在HDFS的默认位置为：/user/hive/warehouse/test.db

CREATE DATABASE IF NOT EXISTS TEST LOCATION '/test1.db';

创建数据库并指定在HDFS的位置

查询数据库

show databases;

show databases like 't*'; -- 罗列t开头的数据库名称

查看数据库信息

desc database test;

desc database extended test; -- 查看详情

切换数据库

use test;

修改数据库

（数据库的名称和位置是不可变的，仅能为其添加属性）

alter database test set dbproperties('author'='tianxuanzhizi');

删除数据库

drop database if exists test;

drop database if exists test cascade; -- 级联删除数据库（数据库不为空时）

二、创建表

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name

[(col_name data_type [COMMENT col_comment], ...)]

[COMMENT table_comment]

[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]

[CLUSTERED BY (col_name, col_name, ...)

[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]

[ROW FORMAT row_format]

[STORED AS file_format]

[LOCATION hdfs_path]

[TBLPROPERTIES (property_name=property_value, ...)]

[AS select_statement]

create table if not exists tianxuanzhizi(

id int, name string

)

clustered by(id)

into 3 buckets

row format delimited fields terminated by '\t'

stored as textfile

location '/user/hive/warehouse/tianxuanzhizi';

依据已有表结构建表

create table if not exists test1 like tianxuanzhizi;

关键字说明EXTERNAL外部表。托管表在删除表的时候会将HDFS的数据文件一并删除；外部表在删除时仅删除元数据，数据文件还存储在HDFSPARTITIONED BY分区表CLUSTERED BY分桶表SORTED BY依据某些字段桶内排序row format delimited fields terminated by指定列分隔符（行分隔符为换行）STORED AS存储格式：SEQUENCEFILE（二进制序列文件）、TEXTFILE（文本）、RCFILE（列式存储格式文件）、ORCLOCATION表在HDFS的存储位置

查询是否外表

desc formatted tianxuanzhizi;

修改表类型

alter table tianxuanzhizi set tblproperties('EXTERNAL'='TRUE');

-- TRUE为外表 FALSE为托管表

三、数据导入导出

1、数据导入

load data in path 'HSFS路径' into table tianxuanzhizi; # 从HDFS导入数据

load data local in path 'HSFS路径' into table tianxuanzhizi; # 从本地导入数据

load data in path overwrite into table tianxuanzhizi; # 覆盖导入

load data in path overwrite into table tianxuanzhizi partition(year='2023');

# 插入分区表

数据插入分区表

insert into table test partition(year='2023')

values (1,'zhangsan'),(2,'lisi');

多分区插入

from tianxuanzhizi

insert overwrite into table test partition(year='2023')

select id,name where year='2023'

insert overwrite into table test partition(year='2022')

select id,name where year='2022';

将导出结果导入（类似oracle的imp）

import table test partitjiaion(year='2023') from 'HDFS文件位置';

2、数据导出

将查询结果导出到HFS

insert overwrite directory 'HDFS路径'

row format delimited fields terminated by '\t'

select * from test;

将查询结果导出到本地

insert overwrite local directory '本地路径'

row format delimited fields terminated by '\t'

select * from test;

hive命令导出

hive -e 'select * from test;' > 本地路径;

export导出

export table test to '本地路径'

四、数据查询

1、特殊函数

（1）查询函数的用法

-- 查询upper函数的用法

desc function extended upper;

（2）正则匹配：RLIKE

-- 查询名字中有赵的人的信息

select * from test where name rlike '[赵]';

取样查询 tablesample

select * from test tablesample(bucket 1 out of 4 on id);

从test表的第一个分桶开始取数据，步长为4

eg：test的分桶数为8，取2个桶的数据（8/4）, 取第1、5个分桶的数据

test的分桶数为2，取二分之一个桶的数据（2/4）

（tablestample(bucket a out of b on XX)其中 a 小于等于 b）

2、排序

（1）MapReduce内部排序：sort by

在map-reduce阶段，的每个reduce中进行排序

-- 设置分桶数为 2

set mapreduce.job.reduces=2;

-- 查看分桶数

set mapreduce.job.reduces

select * from test sort by id desc;

（2）分区排序

按照某个字段分区，在分区内部进行排序。（排序是将分组字段和reduce的个数取模，余数相同的在同一个reduce）

select * from test distribute by deptno by id desc;

（3）cluster by

按照某个字段分组并按照该字段进行排序

select * from test cluster by id;

（仅支持升序排序，不可指定排序方式）

3、集合操作

（1）、集合查询

假设 t 表数据如下

typestylenameA1狗妹A1狗子A2狗狗A2狗剩B1三狗B2狗娃

-- collect_set将查询元素组成集合

-- concat_wa将集合中的元素按照特定分隔符连接

select a,concat_ws(',',collect_set(name)) b

from ( select concat(type,'-',style) as a -- 格式 A-1

,name

from t

) t1

group by a

得出结果：

abA-1狗妹，狗子A-2狗狗，狗剩B-1三狗B-2狗娃

（2）、UDTF展开

将7中的结果进行展开（假设查询结果为t）

-- split将结果按照','拆分成列表

-- expolre将列表元素展开

select a,tb.name

from t

lateral view

explore(split(b,',')) tb as name;

得出结果

atb.nameA-1狗妹A-1狗子A-2狗狗A-2狗剩B-1三狗B-2狗娃

4、分区操作

（1）分区修复

按照现有分区表格式的目录，将元数据与数据对齐

msck repair table test;

（2）、分区参数设置

命令描述hive.exec.dynamic.partition=true打开动态分区的开关，可以在插入数据时像oracle一样将数据自动插入分区hive.exec.dynamic.partition.mode=nonstrict非严格动态分区，当多级分区的时候，可以每个分区字段均为动态hive.exec.max.dynamic.partitions=300设置动态分区最大总数为 300hive.exec.max.created.files=100mapreduce任务最大可创建HDFS文件数hive.exec.max.dynamic.partitions.pernode=50设置每个mapreduce节点动态分区最大数为 50（默认为100）hive.error.on.empty.partition=false分区字段为空是否报异常（默认false）

5、聚合操作

（1）GROUPING SETS

在一个GROUP BY查询中，根据不同的维度组合进行聚合，等将不同维度的GROUP BY结果集进行UNION ALL,。其中的GROUPING__ID，表示结果属于哪一个分组集合。

select province,city,count(1)

from person_info

grouping set (province,city,(province,city))

-- GROUPING__ID = 1 表示属于分组 province

-- GROUPING__ID = 2 表示属于分组 city

-- GROUPING__ID = 3 表示属于分组 province,city

（2）CUBE

所有维度组合的UNION ALL

select province,city,count(1)

from person_info

group by province,city with cube;

-- GROUPING__ID = 0 表示全部聚合

-- GROUPING__ID = 1 表示属于分组 province

-- GROUPING__ID = 2 表示属于分组 city

-- GROUPING__ID = 3 表示属于分组 province,city

（3）ROLLUP

以最左侧维度为根维度，进行聚合

-- 没有仅以city为维度的聚合

select province,city,count(1)

from person_info

group by province,city with rollup;

-- GROUPING__ID = 0 表示全部聚合

-- GROUPING__ID = 1 表示属于分组 province

-- GROUPING__ID = 3 表示属于分组 province,city

五、参数

参数说明set hive.fetch.task.conversion=none;查询执行mapreduce任务set hive.fetch.task.conversion=more;查询不执行mapreduce任务set hive.auto.convert.join = true;自动选择mapjoin功能开启set hive.mapjoin.smalltable.filesize=10000000;大表小表的临界值，大于则为大表，否则为小表set hive.map.aggr = true是否在map阶段进行聚合set hive.groupby.skewindata = truemapreduce时有数据倾斜则进行负载均衡set mapreduce.input.fileinputformat.split.maxsize=100;设置最大切片值为100字节。块大小为128M，当该值设置小于128M时可以增加map的数量set hive.input.format= org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;对小文件进行合并set hive.input.format= org.apache.hadoop.hive.ql.io.HiveInputFormat;不对小文件进行合并SET hive.merge.size.per.task = 1000000;设置合并文件的大小set hive.exec.parallel=true;打开任务并行执行set hive.exec.parallel.thread.number=16；同一个sql允许最大并行度

好文阅读

评论可见，请评论后查看内容，谢谢！！！

您阅读本篇文章共花了：

金钥匙

hadoop 数据仓库 HIVE总结

hive 大数据 HDFS小文件治理方案

hadoop Hive引擎MR、Tez、Spark

发表评论取消回复

金钥匙

hadoop 数据仓库 HIVE总结

hive 大数据 HDFS小文件治理方案

hadoop Hive引擎MR、Tez、Spark

相关文章

发表评论取消回复