目录

一、业务需求二、初始化测试数据三、实现(一)最终语句(hive)(二)思路

一、业务需求

获取每个用户连续登录的最长天数

二、初始化测试数据

drop table if EXISTS login_log;

create table login_log ( times timestamp, user_id string);

truncate table login_log;

insert into login_log VALUES

(from_unixtime(unix_timestamp('2022-03-16 12:00:00','yyyy-MM-dd HH:mm:ss')), '100001'),

(from_unixtime(unix_timestamp('2022-03-16 12:01:00','yyyy-MM-dd HH:mm:ss')), '100001'),

(from_unixtime(unix_timestamp('2022-03-13 12:01:00','yyyy-MM-dd HH:mm:ss')), '100011',

(from_unixtime(unix_timestamp('2022-03-13 12:01:00','yyyy-MM-dd HH:mm:ss')), '100011'),

(from_unixtime(unix_timestamp('2022-03-13 02:01:00','yyyy-MM-dd HH:mm:ss')), '100009'),

(from_unixtime(unix_timestamp('2022-03-12 12:01:00','yyyy-MM-dd HH:mm:ss')), '100004'),

(from_unixtime(unix_timestamp('2022-03-12 02:01:00','yyyy-MM-dd HH:mm:ss')), '100009'),

(from_unixtime(unix_timestamp('2022-03-11 02:01:00','yyyy-MM-dd HH:mm:ss')), '100009'),

(from_unixtime(unix_timestamp('2022-03-11 11:01:00','yyyy-MM-dd HH:mm:ss')), '100012'),

(from_unixtime(unix_timestamp('2022-03-10 11:18:00','yyyy-MM-dd HH:mm:ss')), '100012'),

(from_unixtime(unix_timestamp('2022-03-09 11:01:00','yyyy-MM-dd HH:mm:ss')), '100009'),

(from_unixtime(unix_timestamp('2022-03-08 11:18:00','yyyy-MM-dd HH:mm:ss')), '100009');

三、实现

(一)最终语句(hive)

with tmp_user_serial as (

select b.user_id,

case when b.last_exists = 0 and b.next_exists = 0 then 1

when b.next_exists = 1 then datediff(lead(b.date_date,1) over(partition by b.user_id order by b.date_date), b.date_date)+1

else 0 end serial_day

from (select a.user_id, a.date_date,

case when lag(a.date_date,1) over(partition by a.user_id order by a.date_date) = last_date then 1 else 0 end last_exists,

case when lead(a.date_date,1) over(partition by a.user_id order by a.date_date) = next_date then 1 else 0 end next_exists

from (select user_id,to_date(times) date_date,date_sub(times,1) last_date, date_add(times,1) next_date

from login_log

group by user_id,to_date(times), date_sub(times,1),date_add(times,1)) a

) b

where last_exists = 0 or next_exists = 0)

select user_id, max(serial_day) max_seriral_day

from tmp_user_serial

group by user_id

(二)思路

实现考虑如下场景: 1、某用户就只登录过一次; 2、某用户登录过多次,但都不连续; 3、某用户连续登录过多次; 4、某用户一天内多登录;

实现思路: 1、对于一天多次登录的情况,用user_id ,登录日期 两个数据分组,得到某人某天的唯一数据; 并获取该日期的前一天及后一天数值,便于后续对比取用;

对应sql中的最内层

select user_id,to_date(times) date_date,date_sub(times,1) last_date, date_add(times,1) next_date

from login_log

group by user_id,to_date(times), date_sub(times,1),date_add(times,1)

2、通过对数据的lead lag处理,判断按照用户、日期分组排序后的数据前一条日期是否是当前登录日期的前一天 last_exists ,后一条是否是当前登录日期的后一天 next_exists

对应sql 中的第二层

select a.user_id, a.date_date,

case when lag(a.date_date,1) over(partition by a.user_id order by a.date_date) = last_date then 1 else 0 end last_exists,

case when lead(a.date_date,1) over(partition by a.user_id order by a.date_date) = next_date then 1 else 0 end next_exists

from (select user_id,to_date(times) date_date,date_sub(times,1) last_date, date_add(times,1) next_date

from login_log

group by user_id,to_date(times), date_sub(times,1),date_add(times,1)) a

3、根据登录日期的前后一天数据是否存在判断是否连续 (1)若都存在,则说明该数据是连续日期中的中间数据,last_exists = 1 and next_exists =1 (2)若仅前一天数据存在,则说明该数据是连续日期中的最后一天,last_exists = 1 and next_exists =0 (3)若仅后一天数据存在,则说明该数据是连续日期中的第一天, last_exists = 0 and next_exists =1 (4)若前后一天都不存在,则说明该数据前后不连续,连续天数为1, last_exists = 0 and next_exists =0

该处计算思路: 1、对于连续的日期数据(两条及以上),排除掉连续日期的中间数据后,按照时间顺序排列,前一天和最后一天相连,lead一下可以直接获取到其日期差,从而得到连续登录天数; 2、对于不连续的日期数据,直接给值1 即可;

对应sql中的第三层

select b.user_id,

case when b.last_exists = 0 and b.next_exists = 0 then 1

when b.next_exists = 1 then datediff(lead(b.date_date,1) over(partition by b.user_id order by b.date_date), b.date_date)+1

else 0 end serial_day

from (select a.user_id, a.date_date,

case when lag(a.date_date,1) over(partition by a.user_id order by a.date_date) = last_date then 1 else 0 end last_exists,

case when lead(a.date_date,1) over(partition by a.user_id order by a.date_date) = next_date then 1 else 0 end next_exists

from (select user_id,to_date(times) date_date,date_sub(times,1) last_date, date_add(times,1) next_date

from login_log

group by user_id,to_date(times), date_sub(times,1),date_add(times,1)) a

) b

where last_exists = 0 or next_exists = 0

4、根据用户获取其最大连续登录天数,以用户分组,获取连续天数的最大值即可

对应sql

select user_id, max(serial_day) max_seriral_day

from tmp_user_serial

group by user_id

参考阅读

评论可见,请评论后查看内容,谢谢!!!
 您阅读本篇文章共花了: