数据挖掘产品订单的数据分析与需求预测

第十一届“泰迪杯”数据挖掘挑战赛—B题：产品订单的数据分析与需求预测

赛题链接

问题背景

近年来企业外部环境越来越不确定，复杂多变的外部环境，让企业供应链面临较多难题。需求预测作为企业供应链的第一道防线，重要程度不言而喻，然而需求预测受多种因素的影响，导致预测准确率普遍较低，因此需要更加优秀的算法来解决这个问题。需求预测是基于历史数据和未来的预判得出的有理论依据的结论，需求预测不仅为了企业更好的制定物料采购计划、控制库存、提升生产效率、控制生产进度，还为了帮助企业更好的把控市场潜在需求，分析目前经营状态和未来发展趋势。有利于公司管理层对未来的销售及运营计划、目标，资金预算做决策参考；有助于采购计划和安排生产计划的制定，减少受业务波动的影响。如果没有需求预测或者预测不准，公司内部很多关于销售、采购、财务预算等决策都只能根据经验而来了，会导致对市场预测不足，产生库存和资金的积压或不足等问题，增加企业库存成本。

挖掘目标

为公司管理层对未来的销售及运营计划、目标，资金预算做决策参考、制定采购计划及安排生产计划做需求预测，本文是基于历史数据及LSTM循环神经网络建立对需求量的预测模型对未来的预判，得出的有理论依据的结论。

问题一：对附件1训练数据（order_train1.csv）的一些特征因素的深入分析。

产品的不同价格对需求量的影响；产品所在区域对需求量的影响，以及不同区域的产品需求量有何特性；不同销售方式(线上和线下)的产品需求量的特性；不同品类之间的产品需求量有何不同点和共同点；不同时间段（例如月头、月中、月末等）产品需求量有何特性；节假日对产品需求量的影响；促销（如618、双十一等）对产品需求量的影响；季节因素对产品需求量的影响。

问题二：对附件2预测数据（predict_sku1.csv）未来3月（即2019年1月、2月、3月）的月需求量的预测分析。

问题一

探索性数据分析（EDA）

导入数据—数据预处理（缺失值、重复值、异常值）—数据整合

# 导入库

# -*- coding: utf-8 -*-

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

plt.rcParams['font.sans-serif'] = 'SimHei'

plt.rcParams['axes.unicode_minus'] = False

# 读取数据

data=pd.read_csv('../data/order_train1.csv',encoding = 'gbk')

data.head(5)

order_datesales_region_codeitem_codefirst_cate_codesecond_cate_codesales_chan_nameitem_priceord_qty02015-09-0110422069307403offline1114.01912015-09-0110420028301405offline1012.01222015-09-0210421183307403online428.010932015-09-0210420448308404online962.0342015-09-0210421565307403offline1400.03

print(data.info())

RangeIndex: 597694 entries, 0 to 597693

Data columns (total 8 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 order_date 597694 non-null object

1 sales_region_code 597694 non-null int64

2 item_code 597694 non-null int64

3 first_cate_code 597694 non-null int64

4 second_cate_code 597694 non-null int64

5 sales_chan_name 597694 non-null object

6 item_price 597694 non-null float64

7 ord_qty 597694 non-null int64

dtypes: float64(1), int64(5), object(2)

memory usage: 36.5+ MB

None

#缺失值获取

print('每个特征缺失的数目：\n',data.isnull().sum())

每个特征缺失的数目：

order_date 0

sales_region_code 0

item_code 0

first_cate_code 0

second_cate_code 0

sales_chan_name 0

item_price 0

ord_qty 0

dtype: int64

#重复的数据

print('全部有重复：\n', data[data.duplicated()])

全部有重复：

.......

[312 rows x 8 columns]

print('前7列有重复：\n',data.iloc[:,:7][data.iloc[:,:7].duplicated()])

前7列有重复：

......

[11894 rows x 7 columns]

有312个重复值，再对数据前7列进行重复值检查，我们从原数据筛选出前7列有重复的数据分析。分析结果显示，前7列重复的数据有相同的重复行，均是同一订单日期的同一产品通过相同销售渠道销往同一销售区域，产品价格对应不同或相同的订单需求量，我们认为是符合市场波动和行情，并不属于数据录入重复等情况，因此对重复值数据予以保留，不予去重。

#描述性统计

print(data.describe())

data.describe(include=['object'])

将同个产品同一天销往同个地方，相同价格的需求量加在一起，形成新的出货数据，下面以这个数据集进行分析。

#将同个产品同一天销往同个地方，相同价格的需求量加在一起，形成新的出货数据

data1=data.groupby(by=['order_date','sales_region_code','item_code','first_cate_code','second_cate_code',

'sales_chan_name','item_price'],as_index=False).agg({'ord_qty':np.sum})

print(data1.head(5))

order_date sales_region_code item_code first_cate_code \

0 2015-09-01 104 20028 301

1 2015-09-01 104 22069 307

2 2015-09-02 101 20657 303

3 2015-09-02 102 20323 305

4 2015-09-02 102 20457 305

second_cate_code sales_chan_name item_price ord_qty

0 405 offline 1012.0 12

1 403 offline 1114.0 19

2 410 offline 2996.0 18

3 412 offline 99.0 502

4 412 offline 164.0 308

#异常值

# 去除item_price和ord_qty小于0的值

data1 = data1[(data1['item_price'] >= 0) & (data1['ord_qty'] >= 0)]

# 分别画出item_price和ord_qty图

fig, axs = plt.subplots(2, 1, figsize=(10, 8))

axs[0].scatter(range(len(data1)), data1['item_price'],s=1)

axs[0].set_title('item_price scatter plot')

axs[0].set_xlabel('Index')

axs[0].set_ylabel('item_price')

axs[1].scatter(range(len(data1)), data1['ord_qty'],s=1)

axs[1].set_title('ord_qty scatter plot')

axs[1].set_xlabel('Index')

axs[1].set_ylabel('ord_qty')

plt.savefig('../tmp/item_price和ord_qty散点图.png')

plt.show()

#数据处理后保存为新文件，后续以此数据分析

pd.DataFrame(data1).to_csv('../data/order_train.csv',index=False)

#读取整合后的数据

df = pd.read_csv('../data/order_train.csv', encoding='gbk')

产品的不同价格对需求量的影响

# 按item_price分组，求对应价格的需求平均值

one_data=df.groupby('item_price',as_index=False).agg({'ord_qty':np.mean})

print('item_price和ord_qty描述性统计:\n',one_data.describe())

item_price和ord_qty描述性统计:

item_price ord_qty

count 14365.000000 14365.000000

mean 2205.158618 40.786167

std 5326.057554 118.997335

min 1.000000 1.000000

25% 709.000000 6.750000

50% 1305.000000 11.000000

75% 2587.000000 39.571429

max 260014.000000 9874.000000

print(np.round(one_data.corr(method='spearman'),2))

# 画出价格对应的需求平均值折线图

one_data.plot(x='item_price',y='ord_qty',kind='scatter',s=15,figsize=(11,8))

plt.ylabel('ord_qty')

from matplotlib.pyplot import MultipleLocator

plt.gca().xaxis.set_major_locator(MultipleLocator(20000))

plt.gca().yaxis.set_major_locator(MultipleLocator(1000))

plt.savefig('../tmp/1价格对应的需求平均值1.png')

plt.show()

# 计算销售额

one_data['sales_volume'] = one_data['item_price'] * one_data['ord_qty']

print(one_data)

one_data.to_excel('../data/sales_volume.xlsx')

plt.figure(figsize=(11,8))

plt.scatter(one_data['item_price'],one_data['sales_volume'],s=15)

plt.gca().xaxis.set_major_locator(MultipleLocator(20000))

plt.xlabel('价格')

plt.ylabel('销售额')

plt.savefig('../tmp/1销售额2.png')

plt.show()

通过对数据的分析，产品价格对应需求量平均量有明显断层现象，我们将产品价格进行划分区间处理，分别为[1,15963.38],[21009,38334]，[46006,60007]，[93532,98016],{260006,260014}，并分别赋予为低、较低、中、较高、高五个价格等级

#价格按五个价格等级[1,15963.38],[21009,38334],[46006,60007],[93532,98016],{26006,260014}划分

data1['price_range'] = pd.cut(data1['item_price'], bins=[0,20000,46005,60007,98017,260014])

data1['sales_volume'] = data1['item_price'] * data1['ord_qty']

item_price=['低','较低','中','较高','高']

print(data1.head(5))

##各个区间的总需求平均量

average_ord_qty = data1.groupby('price_range', as_index=False)['ord_qty'].mean()['ord_qty'].tolist()

print('总需求平均量:',average_ord_qty)

plt.figure(figsize=(10, 9))

plt.bar(item_price, average_ord_qty, width=0.5, color='b')

#plt.title('不同价格等级的总需求平均量')

#plt.savefig('../tmp/1不同价格等级的总需求平均量3.png')

plt.ylabel('总需求平均量')

plt.xlabel('价格等级')

plt.show()

总需求平均量: [93.52652903031938, 6.282608695652174, 6.485714285714286, 6.105263157894737, 12.0]

#销售平均额

average_sales = data1.groupby('price_range', as_index=False)['sales_volume'].mean()['sales_volume'].tolist()

print('销售平均额:',average_sales)

plt.figure(figsize=(11,8))

plt.bar(item_price, average_sales, width=0.5, color='b')

#plt.title('不同价格等级的平均销售额')

plt.xlabel('价格等级')

plt.ylabel('平均销售额')

#plt.savefig('../tmp/1平均销售额4.png')

plt.show()

销售平均额: [71835.17936636858, 218169.91304347827, 312551.28571428574, 577424.9526315789, 3120108.0]

分析：价格等级为低的产品平均需求量远远大于其他等级，而价格等级为高的产品的平均销售额是远大于其他等级。由于我们认为价格等级高的产品为高端产品，从中可以看出高端产品的市场容量较小，消费者关注程度小，需求量低，但其中的收入高，高端市场是不可忽略的。而价格等级为低的产品平均需求量大，占据很高的市场份额，需求量也大，其中可能存在薄利多销的情况。企业在稳定占据低端市场的同时适当扩大高端市场，适当增加订单需求量

产品所在区域对需求量的影响，以及不同区域的产品需求量有何特性

# 按 sales_region_code分组,求不同区域的总需求量

two_data=df.groupby('sales_region_code',as_index=False)['ord_qty'].sum()

print(two_data)

…

# 画饼图

two_data.plot.pie(x='sales_region_code',y='ord_qty',labels=['101','102','103','104','105'],explode=(0.03,0.02,0.01,0,0.05),

pctdistance=0.7,autopct='%.2f%%',wedgeprops=dict(width=0.6,edgecolor="w"),shadow=True,figsize=(8,8))

plt.show()

#不同区域的按产品分组的平均需求量

region1=data.loc[data['sales_region_code']==101]

region2=data.loc[data['sales_region_code']==102]

region3=data.loc[data['sales_region_code']==103]

region4=data.loc[data['sales_region_code']==104]

region5=data.loc[data['sales_region_code']==105]

b1=region1[['item_code','ord_qty']].groupby(by = 'item_code',as_index=False).mean()

b2=region2[['item_code','ord_qty']].groupby(by = 'item_code',as_index=False).mean()

b3=region3[['item_code','ord_qty']].groupby(by = 'item_code',as_index=False).mean()

b4=region4[['item_code','ord_qty']].groupby(by = 'item_code',as_index=False).mean()

b5=region5[['item_code','ord_qty']].groupby(by = 'item_code',as_index=False).mean()

print('101',b1.describe())

print('102',b2.describe())

print('103',b3.describe())

print('104',b4.describe())

print('105',b5.describe())

101 item_code ord_qty

count 1294.000000 1294.000000

mean 21030.212519 63.507840

std 606.170322 95.017748

min 20001.000000 1.000000

25% 20509.500000 14.927885

50% 21016.500000 35.793026

75% 21548.750000 75.402151

max 22084.000000 1473.969697

102 item_code ord_qty

......

print('101平均需求量排名前五的产品为：',b1.sort_values('ord_qty',ascending=False).head(5))

print('102平均需求量排名前五的产品为：',b2.sort_values('ord_qty',ascending=False).head(5))

print('103平均需求量排名前五的产品为：',b3.sort_values('ord_qty',ascending=False).head(5))

print('104平均需求量排名前五的产品为：',b4.sort_values('ord_qty',ascending=False).head(5))

print('105平均需求量排名前五的产品为：',b5.sort_values('ord_qty',ascending=False).head(5))

101平均需求量排名前五的产品为： item_code ord_qty

1283 22066 1473.969697

717 21120 970.062500

375 20588 961.000000

917 21469 806.000000

1086 21758 794.000000

......

fig, axs = plt.subplots(3, 2, figsize=(15, 10))

axs[0][0].scatter(b1['item_code'], b1['ord_qty'], color='orangered',s=15)

axs[0][0].set_title('101')

axs[0][1].scatter(b2['item_code'], b2['ord_qty'], color='blueviolet',s=15,marker='*')

axs[0][1].set_title('102')

axs[1][0].scatter(b3['item_code'],b3['ord_qty'], color='green',s=15,marker='+')

axs[1][0].set_title('103')

axs[1][1].scatter(b4['item_code'], b4['ord_qty'], color='blue',s=15)

axs[1][1].set_title('104')

axs[2][0].scatter(b5['item_code'], b5['ord_qty'], color='red',s=15)

axs[2][0].set_title('105')

axs[2][1].remove()

plt.show()

分析: 105区域的产品个数是最多的，有1354个，而104区域的产品个数是最少的，仅有257个。其中，产品编码为21367产品的需求量最大，其他大部分产品的需求量相对较小。除此之外，103区域的产品需求量相差不大，没有个别产品突出。而101、102、104、105区域均有需求量较高的产品

不同销售方式(线上和线下)的产品需求量的特性

#按sales_chan_name分组,求线上线下的总需求量

data1=df.groupby(by = 'sales_chan_name',as_index=False).agg({'ord_qty':np.sum})

data1.plot.barh(x='sales_chan_name',y='ord_qty',color=['#1E90FF','cyan'], figsize=(11,8)).legend_.remove()

plt.xlabel('ord_qty')

for y,x in enumerate(data1['ord_qty']):

plt.text(x+0.1,y,"%s"%round(x,1),va='center')

#plt.title('不同销售方式和对应的总需求量条形图')

#plt.savefig('../tmp/3不同销售方式和对应的总需求量条形图1.png')

plt.show()

data1.plot.pie(x='sales_chan_name',y='ord_qty',labels=['online','offline'],explode=(0.05,0.02),

autopct='%.2f%%',wedgeprops=dict(width=0.6,edgecolor="w"),shadow=True,figsize=(8,8))

#plt.title('不同销售方式和对应的总需求量饼图')

#plt.savefig('../tmp/3不同销售方式和对应的总需求量饼图2.png')

plt.show()

# 线上和线下总需求量排名前五

top=df.groupby(by = ['sales_chan_name','item_code'],as_index=False).agg({'ord_qty':np.sum})

online_top5 =top[top['sales_chan_name'] == 'online'].sort_values(by='ord_qty', ascending=False)

print('线上总需求量排名前五的产品为：\n',online_top5.head(5))

offline_top5 =top[top['sales_chan_name'] == 'offline'].sort_values(by='ord_qty', ascending=False)

print('线下总需求量排名前五的产品为：\n',offline_top5.head(5))

线上总需求量排名前五的产品为：

sales_chan_name item_code ord_qty

2532 online 21619 895494

2588 online 21715 663160

2264 online 21061 603813

2127 online 20820 408975

2728 online 21986 362735

线下总需求量排名前五的产品为：

sales_chan_name item_code ord_qty

1034 offline 21271 2310551

797 offline 20973 1617680

1309 offline 21619 1033017

817 offline 20996 960703

1126 offline 21394 684014

分线上和线下，分析对应价格和区域对需求影响

#线上线下对应的不同区域

quyu=df.groupby(by =['sales_chan_name','sales_region_code']).agg({'ord_qty':np.sum}).unstack()

print(quyu)

quyu.plot(y='ord_qty',kind='bar',figsize=(12, 10))

#plt.savefig('tmp/3线上线下区域和总需求量条形图3.png')

plt.show()

ord_qty

sales_region_code 101 102 103 104 105

sales_chan_name

offline 11542949 13634154 10173394 131335 1492361

online 860019 335994 1348792 2256318 13003641

分析: 线下销售方式的产品在102区域中的订单需求量最大，其次是101和103区域中的订单需求量接近且略低于103区域，105区域与104区域的订单需求量远低于其他三个区域，104区域的订单需求量最低；线上销售方式的产品的订单需求量主要集中于105区域，需求量远高于其他四个区域，102区域的订单需求量最低

#线上线下对应的不同价格

price=df.groupby(by =['item_price','sales_chan_name']).agg({'ord_qty':np.mean}).unstack()

fu= -1 * price['ord_qty']['offline']

ax = price['ord_qty']['online'].plot(kind='line',marker='*',figsize=(11,8))

fu.plot(kind='line',figsize=(11,8), color='blueviolet',ax=ax)

plt.xlabel('item_price')

plt.legend(['online','offline'])

plt.ylabel('ord_qty')

#plt.title('线上线下价格和平均需求量折线图')

#plt.savefig('tmp/3线上线下价格和平均需求量折线图4.png')

plt.show()

分析：整体上线上销售方式的价格区间范围比线下销售的价格区间范围小，且同一价格下，线上销售方式的需求量比线下销售的大

不同品类之间的产品需求量有何不同点和共同点

#按大类，细类分组，求对应总需求

data1=df.groupby(by=['first_cate_code','second_cate_code']).agg({'ord_qty':np.sum})

data3=df.groupby(by=['first_cate_code','second_cate_code']).agg({'ord_qty':np.sum}).unstack()

data3.plot(kind='barh',y='ord_qty',stacked=True,figsize=(11,8))

#plt.title('不同品类与对应的总需求量')

plt.show()

按大类—细类看销售方式，区域，价格

# 销售方式

name=df.groupby(by=['first_cate_code','second_cate_code','sales_chan_name'],as_index=False).agg({'ord_qty':np.sum})

print(name)

first_cate_code second_cate_code sales_chan_name ord_qty

0 301 405 offline 445434

1 301 405 online 1141300

2 302 408 offline 3988155

........

name1=df.groupby(by=['first_cate_code','second_cate_code','sales_chan_name']).agg({'ord_qty':np.sum}).unstack()

name1.plot(y='ord_qty',kind='bar',figsize=(12, 10))

#plt.title('不同品类下销售方式与和对应的总需求量')

plt.show()

**分析：**从图和表可以看出，两种销售方式下，大类306细类407产品的总需求量远远高于其他品类，而大类303的三种产品总需求量相对较低，均不超过10000。从中可以看出，大类306细类407产品市场大，消费者关注度高。总体上看各品类的销售倾向于线下的销售

#区域

region=df.groupby(by=['first_cate_code','second_cate_code','sales_region_code'],as_index=False).agg({'ord_qty':np.sum})

print(region)

first_cate_code second_cate_code sales_region_code ord_qty

0 301 405 101 51961

1 301 405 102 223807

2 301 405 103 166150

3 301 405 104 489238

......

region.plot.scatter(x='second_cate_code',y='ord_qty', c='sales_region_code', cmap="viridis", s=20,figsize=(11,8))

#plt.title('不同品类下区域与和对应的总需求量')

plt.show()

细类产品402、403、404、405、406、409在区域105的需求量均为最高，细类407所在的全部区域与其中其他细类产品的总需求量相比均是最高，而细类406、410、411所在区域的需求量都偏少

# 价格

price=data.groupby(by=['item_price','second_cate_code'],as_index=True).agg({'ord_qty':np.mean}).unstack()

print(price.head(5))

price.plot(kind='line',y='ord_qty', figsize=(20,10))

#plt.title('不同品类下价格的总平均需求量')

plt.show()

**分析:**我们明显看出橙色折线起伏剧烈且价格集中，需求量达到最高峰，即细类402产品的需求量在相近价格中是比其他细类产品更大。浅蓝色的折线即细类410产品出现分层，价格跨度大，但是仍然有较小订单需求量。说明价格弹性大。

不同时间段（例如月头、月中、月末等）产品需求量有何特性

data=df.copy()

data['order_date']= pd.to_datetime(data['order_date'])#转换时间格式

# 将日期列设置为索引

data.set_index('order_date', inplace=True)

把每月前7天定为月头时间段，每月12-18定为月中时间段，每月最后7天定为月末时间段

# 每月前7天为月头

first= data[data.index.day <= 7]

# 每月12-18为月中

middle= data[(data.index.day >= 12) & (data.index.day <= 18)]

# 每月最后7天为月末

last= data[data.index.day >= (data.index.days_in_month - 6)]

# 分别计算每个月的总需求量

mean_first= first.groupby(first.index.month)['ord_qty'].sum()

mean_middle = middle.groupby(middle.index.month)['ord_qty'].sum()

mean_last = last.groupby(last.index.month)['ord_qty'].sum()

# 绘制总需求量图表

plt.figure(figsize=(10,8))

plt.plot(mean_first.index, mean_first.values, label='月头')

plt.plot(mean_middle.index, mean_middle.values, label='月中')

plt.plot(mean_last.index, mean_last.values, label='月末')

plt.legend()

plt.xlabel('月份')

plt.ylabel('总需求量')

#plt.title('每个月月头、月中、月末总需求量')

#plt.savefig('../tmp/5每个月月头、月中、月末总需求量1.png')

plt.show()

**分析：**这三年每月的月头、月中和月末总需求量相邻月的最值相差不是很大，都会在一定范围内波动。三个时间段的高峰一般出现在3月和10-11月，低峰一般出现在1-2月和7月。可以看出，三个阶段的总需求量的变化是具有规律性的，是消费者对于产品的需求是随着时间段的变化

从不同时间段数据考虑价格，区域，销售方式，品类

# 价格

price_first=first.groupby('item_price').agg({'ord_qty':np.mean})

price_middle=middle.groupby('item_price').agg({'ord_qty':np.mean})

price_last=last.groupby('item_price').agg({'ord_qty':np.mean})

# 绘制平均价格分布散点图

fig, axs = plt.subplots(3, 1, figsize=(10, 15))

axs[0].scatter(price_first.index, price_first.values, s=5,label='月头')

axs[0].set_title('月头')

axs[1].scatter(price_middle.index, price_middle.values, s=5,label='月中')

axs[1].set_title('月中')

axs[2].scatter(price_last.index,price_last.values,s=5, label='月末')

axs[2].set_title('月末')

plt.show()

**分析：**月头、月中、月末的需求量均集中在[0,500]这个区间，价格集中在等级为低的价格区间中，可以看出每个月的低端产品的市场需求量大。其次，月头与月末均有价格等级为高的需求量，企业对于高端市场是不可忽视的

# 区域

quyu_first=first.groupby('sales_region_code',as_index=False).agg({'ord_qty':np.sum})

print('月头:\n',quyu_first)

quyu_middle=middle.groupby('sales_region_code',as_index=False).agg({'ord_qty':np.sum})

print('月中:\n',quyu_middle)

quyu_last=last.groupby('sales_region_code',as_index=False).agg({'ord_qty':np.sum})

print('月末:\n',quyu_last)

plt.figure(figsize=(10,8))

plt.bar(quyu_first['sales_region_code'], quyu_first['ord_qty'], label='月头')

plt.bar(quyu_middle['sales_region_code'], quyu_middle['ord_qty'], label='月中', bottom=quyu_first['ord_qty'])

plt.bar(quyu_last['sales_region_code'], quyu_last['ord_qty'], label='月末', bottom=quyu_first['ord_qty']+quyu_middle['ord_qty'])

plt.legend()

plt.xlabel('区域')

plt.ylabel('总需求量')

plt.show()

105区域中，月头、月中和月末三个时间段的总需求量是最高的，但101、102、103区域相对接近，而104区域在三个时间段总需求量均是远远低于其他区域。

# 销售方式

xiao_first=first.groupby('sales_chan_name').agg({'ord_qty':np.sum})

print('月头销售:\n',xiao_first)

xiao_middle=middle.groupby('sales_chan_name').agg({'ord_qty':np.sum})

print('月中销售:\n',xiao_middle)

xiao_last=last.groupby('sales_chan_name').agg({'ord_qty':np.sum})

print('月末销售:\n',xiao_last)

fig, axs = plt.subplots(1, 3, figsize=(15, 5))

axs[0].pie(xiao_first['ord_qty'], labels=xiao_first.index, autopct='%1.1f%%', wedgeprops=dict(width=0.5),shadow=True)

axs[0].set_title('月头')

axs[1].pie(xiao_middle['ord_qty'], labels=xiao_middle.index, autopct='%1.1f%%', wedgeprops=dict(width=0.5),shadow=True)

axs[1].set_title('月中')

axs[2].pie(xiao_last['ord_qty'], labels=xiao_last.index, autopct='%1.1f%%', wedgeprops=dict(width=0.5),shadow=True)

axs[2].set_title('月末')

plt.show()

通过三个时间段（月头、月中、月末）在线下和线上销售方式的总需求量占比的对比，可以看出三个时间段均是线下销售方式对应的总需求量远大于线上销售方式的总需求量

#大类——细类

lei_first=first.groupby(by=['first_cate_code','second_cate_code']).agg({'ord_qty':np.sum})

print('月头品类:\n',lei_first)

lei_middle=middle.groupby(by=['first_cate_code','second_cate_code']).agg({'ord_qty':np.sum})

print('月中品类:\n',lei_middle)

lei_last=last.groupby(by=['first_cate_code','second_cate_code']).agg({'ord_qty':np.sum})

print('月末品类:\n',lei_last)

fig, axs = plt.subplots(3, 1, figsize=(10, 15), sharex=True)

lei_first.plot(kind='barh', ax=axs[0])

axs[0].set_title('月头')

lei_middle.plot(kind='barh', color='cyan', ax=axs[1])

axs[1].set_title('月中')

lei_last.plot(kind='barh', color='#1E90FF', ax=axs[2])

axs[2].set_title('月末')

plt.xlabel('总需求量')

plt.show()

…

**分析：**品类为（大类306，细类407）产品在三个时间段（月头、月中、月末）的总需求量均是远大于其他品类，并且从图中可以看出每个品类在三个时间段的订单需求量的是相差不大的

节假日对产品需求量的影响

确定2015年9月2日-2018年12月20日的所有公休假日，收集来源于国务院对于节假日公休安排，整理形成”节假日csv”文件

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-UWzNAaht-1691054387475)(attachment:image.png)]

# 原数据

data=df.copy()

data['order_date']= pd.to_datetime(data['order_date'])

self_data=data.groupby(by='order_date',as_index=False).agg({'ord_qty':np.sum})

#提取出数据含有节假日部分

data1=pd.read_csv('../data/法定节假日.csv',encoding = 'gbk')

data1['date']= pd.to_datetime(data1['date'])

jieri=data.loc[data['order_date'].isin(data1['date'])]

jieri1=jieri.groupby(by='order_date',as_index=False).agg({'ord_qty':np.sum})

plt.figure(figsize=(11,8))

plt.plot(self_data['order_date'],self_data['ord_qty'].values)

plt.scatter(jieri1['order_date'],jieri1['ord_qty'].values,color='r',linewidths=0.001)

plt.xlabel('日期')

plt.ylabel('总需求量')

plt.legend(['蓝线-原数据','红点-节假日'])

#plt.title('节假日影响')

plt.show()

从节假日数据考虑价格，区域，销售方式，品类，时间段

…

促销（如618、双十一等）对产品需求量的影响

# 原数据

data=df.copy()

data['order_date']= pd.to_datetime(data['order_date'])

self_data=data.groupby(by='order_date',as_index=False).agg({'ord_qty':np.sum})

#促销日

cuxiao=pd.read_csv('../data/促销日.csv',encoding = 'gbk')

cuxiao['order_date']= pd.to_datetime(cuxiao['order_date'])

cuxiao1=cuxiao.groupby(by='order_date',as_index=False).agg({'ord_qty':np.sum})

#比较原数据和节假日数据

plt.figure(figsize=(11,8))

plt.scatter(self_data['order_date'],self_data['ord_qty'].values)

plt.scatter(cuxiao1['order_date'],cuxiao1['ord_qty'].values,color='r',linewidths=0.001)

plt.xlabel('日期')

plt.ylabel('总需求量')

plt.legend(['蓝线-原数据','红点-促销日'])

#plt.title('节假日影响')

plt.show()

…

从促销日数据考虑价格，区域，销售方式，品类，时间段

#品类

#12月份的品类

cuxiao.index=cuxiao['order_date']

pinlei=cuxiao.groupby(by=[cuxiao.index.month,'second_cate_code'],as_index=True).agg({'ord_qty':np.sum}).unstack()

pinlei1=cuxiao.groupby(by=[cuxiao.index.month,'second_cate_code'],as_index=True).agg({'ord_qty':np.sum})

print(pinlei1)

ord_qty

order_date second_cate_code

1 401 65980

402 44430

403 106740

404 88467

405 22335

406 503

407 394343

408 89263

409 26952

410 1109

411 189

412 117964

3 401 42322

402 42633

403 77491

404 71224

.....

#按品类分组

pinlei2=cuxiao.groupby('second_cate_code',as_index=True).agg({'ord_qty':np.sum})

print('品类:\n',pinlei2)

品类:

ord_qty

second_cate_code

401 478775

402 445061

403 787329

.......

pinlei.plot(y='ord_qty',kind='bar',figsize=(12, 10))

plt.show()

**分析：**1-10月的促销日细类产品407的需求量是远大于其他细类产品的，在上述分析中我们知道整体上细类407的产品是全部产品需求量最大的。说明整体数据中需求量大的细类产品，在促销日时需求量仍然是大的

#区域

quyu=cuxiao.groupby(by='sales_region_code').agg({'ord_qty':np.sum})

print(quyu)

quyu.plot.bar(figsize=(12, 10))

plt.show()

#销售方式

xiao=cuxiao.groupby(by='sales_chan_name').agg({'ord_qty':np.sum})

print(xiao)

xiao.plot(kind='pie',y='ord_qty',figsize=(8, 8), autopct='%1.1f%%',shadow=True)

plt.show()

#不同时间段

copy_cuxiao=cuxiao.copy()

copy_cuxiao.set_index('order_date', inplace=True)

# 每月前7天为月头

first= copy_cuxiao[copy_cuxiao.index.day <= 7]

# 每月12-18为月中

middle= copy_cuxiao[(copy_cuxiao.index.day >= 12) & (copy_cuxiao.index.day <= 18)]

# 每月最后7天为月末

last= copy_cuxiao[copy_cuxiao.index.day >= (copy_cuxiao.index.days_in_month - 6)]

sum_first= first.groupby(first.index.month)['ord_qty'].sum()

sum_middle = middle.groupby(middle.index.month)['ord_qty'].sum()

sum_last = last.groupby(last.index.month)['ord_qty'].sum()

print('月头的促销日总需求\n',sum(first['ord_qty']))

print('月中的促销日总需求\n',sum(middle['ord_qty']))

print('月末的促销日总需求\n',sum(last['ord_qty']))

月头的促销日总需求

3556003

月中的促销日总需求

1148008

月末的促销日总需求

193211

plt.figure(figsize=(8,8))

plt.scatter(sum_first.index,sum_first.values,marker='*',s=50, label='月头')

plt.scatter(sum_middle.index, sum_middle.values,s=50, label='月中')

plt.scatter(sum_last.index, sum_last.values, label='月末')

plt.legend()

plt.xlabel('月份')

plt.ylabel('总需求量')

#plt.title('每个月月头、月中、月末总需求量')

plt.show()

**分析：**1月-12月的月头分别是双十一和双十二两个促销日，其需求量远大于其他促销日的不同时间段的产品需求量。

#价格

jiage=cuxiao.groupby('item_price',as_index=False).agg({'ord_qty':np.sum}).sort_values(by='ord_qty', ascending=False)

print(jiage.head(5))

code=cuxiao.groupby('item_code',as_index=False).agg({'ord_qty':np.sum}).sort_values(by='ord_qty', ascending=False)

print(code.head(5))

fig,axs=plt.subplots(1,2,figsize=(10,5))

jiage.head(5).plot(kind='barh', x='item_price', y='ord_qty',ax=axs[0])

axs[0].set_title('促销日总需求前五的价格')

axs[0].invert_yaxis()

code.head(5).plot(kind='barh', x='item_code', y='ord_qty',ax=axs[1])

axs[1].set_title('促销日总需求前五的产品')

axs[1].invert_yaxis()

plt.show()

#节假日

data1=pd.read_csv('../data/法定节假日.csv',encoding = 'gbk')

data1['date']= pd.to_datetime(data1['date'])

jieri=cuxiao .loc[cuxiao['order_date'].isin(data1['date'])]

季节因素对产品需求量的影响

data=df.copy()

data['order_date']= pd.to_datetime(data['order_date'])

data.index=data['order_date']

import seaborn as sns

#四个季度与对应总需求

season=data.groupby(data.index.quarter).agg({'ord_qty':np.sum})

a=pd.DataFrame({"season":['第一季度','第二季度','第三季度','第四季度'],"ord_qty":season['ord_qty'].T})

print(a)

sns.barplot(data=a, x="season", y="ord_qty")

#plt.savefig('tmp/8不同季度与对应总需求条形图1.png')

plt.show()

#分每年，四个季度与对应总需求

season1=data.groupby(by=[ data.index.quarter,data.index.year]).agg({'ord_qty':np.sum}).unstack()

print(season1)

season1.plot(figsize=(11,8))

plt.ylabel('ord_qty')

plt.xlabel('order_date')

plt.tick_params(labelsize=9)

plt.show()

ord_qty

order_date 2015 2016 2017 2018

order_date

1 NaN 3342190.0 4353418.0 5189048.0

2 NaN 3134103.0 4554570.0 3559035.0

3 832725.0 2791345.0 4568376.0 3861418.0

4 3315098.0 4404605.0 6006351.0 4866675.0

分析: 2015-2017年的趋势是相似的，2015年数据是从第三季度开始的，2016-2107年都是第一季度和第二季度平缓，第三季度是低峰，即为淡季，高峰为第四季度，即为旺季。而2018年高峰在第一季度，低峰在第二季度，第三季度开始需求量缓慢上升。整体趋势是上升的

按四个季度分别考虑价格，区域，销售方式，品类，时间段，节假日，促销

#区域

quyu=data.groupby(by=[data.index.quarter,'sales_region_code']).agg({'ord_qty':np.sum}).unstack()

quyu1=data.groupby(by=[data.index.quarter,'sales_region_code']).agg({'ord_qty':np.sum})

print(quyu1)

quyu.plot(y='ord_qty',kind='bar',figsize=(12, 10))

#plt.savefig('tmp/8季度区域总需求条形图3.png')

plt.show()

ord_qty

order_date sales_region_code

1 101 3086061

102 3493961

103 2854377

104 358229

105 3092028

2 101 2689544

102 2900771

......

#销售方式

xiao=data.groupby(by=[data.index.quarter,'sales_chan_name']).agg({'ord_qty':np.sum}).unstack()

print(xiao)

xiao.plot(kind='barh',y='ord_qty',figsize=(8, 8),stacked=True)

#plt.savefig('tmp/8季度-销售总需求条形图4.png')

plt.show()

ord_qty

sales_chan_name offline online

order_date

1 9561373 3323283

2 7670729 3576979

3 7669057 4384807

4 12073034 6519695

#品类

pinlei=data.groupby(by=['first_cate_code','second_cate_code',data.index.quarter]).agg({'ord_qty':np.sum}).unstack()

pinlei1=data.groupby(by=[data.index.quarter,'first_cate_code','second_cate_code']).agg({'ord_qty':np.sum})

print(pinlei1)

pinlei.plot(y='ord_qty',kind='bar',figsize=(12, 10))

#plt.savefig('tmp/8品类总需求量5.png')

plt.show()

ord_qty

order_date first_cate_code second_cate_code

1 301 405 371630

302 408 1420866

303 401 827342

406 8188

410 19499

411 5249

304 409 146351

305 412 1467238

306 402 672730

407 5025566

307 403 1484381

308 404 1435616

2 301 405 324169

302 408 1319117

303 401 789257

......

**分析：**每个季度中大类306细类407产品仍然是远大于其他品类产品，大类303中的三个产品需求量是最小的。除了大类303的三个产品没有明显变化，在第四季度中其他品类产品均是有不同程度的增长，在其他季度中其他品类产品需求量相差不大。可见，不同品类的产品均是按照季度的变化趋势变化的。

# 价格

price = data.groupby(by=['item_price', data.index.quarter]).agg(

{'ord_qty': np.sum}).unstack()

print(price)

print('四个季度价格计数多少个：\n', price.count())

ord_qty

order_date 1 2 3 4

item_price

1.00 NaN 86.0 NaN 2.0

1.01 3.0 28.0 5.0 72.0

2.00 NaN 125.0 44.0 13.0

2.01 6.0 50.0 16.0 64.0

3.00 NaN 114.0 10.0 51.0

... ... ... ... ...

[14365 rows x 4 columns]

四个季度价格计数多少个：

order_date

ord_qty 1 5386

2 6451

3 7539

4 8532

dtype: int64

问题二

数据预处理和准备

数据进一步准备和处理

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

plt.rcParams['font.sans-serif'] = 'SimHei'

plt.rcParams['axes.unicode_minus'] = False

需要预测的数据有些产品编码在原来数据集没有，不能用来预测，只分析产品编码有的部分。后续用预测出的结果按’sales_region_code’,‘first_cate_code’, 'second_cate_code’分组的平均值代替。

df1 = pd.read_csv('../data/order_train1.csv', encoding='gbk')

df2 = pd.read_csv('../data/predict_sku1.csv', encoding='gbk')

# 选取df1中'sales_region_code', 'item_code', 'first_cate_code', 'second_cate_code'列中与df2中相同的行

# 因为需要预测的数据有些在训练集中没有，我们先提取出有的部分来预测

data = df1[df1[['sales_region_code', 'item_code', 'first_cate_code', 'second_cate_code']].apply(tuple, axis=1).isin(

df2[['sales_region_code', 'item_code', 'first_cate_code', 'second_cate_code']].apply(tuple, axis=1))]

##数据准备

# print(data.dtypes)

# print(data['ord_qty'].describe())

# 去除'ord_qty'异常值

def remove_outliers(df, col_name):

q1 = df[col_name].quantile(0.25)

q3 = df[col_name].quantile(0.75)

iqr = q3 - q1

lower_bound = q1 - 1.5 * iqr

upper_bound = q3 + 1.5 * iqr

df = df[(df[col_name] >= lower_bound) & (df[col_name] <= upper_bound)]

return df

data_without_outliers = remove_outliers(data, 'ord_qty')

# 去掉异常值前、后的箱线图

fig, axs = plt.subplots(1, 2, figsize=(10, 5))

axs[0].boxplot(data['ord_qty'])

axs[0].set_title('ord_qty')

axs[1].boxplot(data_without_outliers['ord_qty'])

axs[1].set_title('ord_qty without outliers')

#plt.savefig('tmp/第二问箱线图1.png')

plt.show()

#需求量图

data_without_outliers.index = pd.to_datetime(data_without_outliers.index)

plt.figure(figsize=(8,4))

plt.plot(data_without_outliers['order_date'],data_without_outliers['ord_qty'])

plt.show()

很明显，这样大的数据量和图像不利于我们接下来的分析，结合问题二，我们按天、周、月重采样

# 按天、周、月抽样的总需求量

data_without_outliers = data.copy()

data_without_outliers['order_date'] = pd.to_datetime(

data_without_outliers['order_date'])

data_without_outliers = data_without_outliers.set_index('order_date')

fig = plt.figure(figsize=(18,16))

fig.subplots_adjust(hspace=.2)

ax1 = fig.add_subplot(3,1,1)

ax1.plot(data_without_outliers['ord_qty'].resample('D').sum(),linewidth=1)

ax1.set_title('按天的总需求量')

ax1.tick_params(axis='both', which='major')

ax2 = fig.add_subplot(3,1,2, sharex=ax1)

ax2.plot(data_without_outliers['ord_qty'].resample('W').sum(),linewidth=1)

ax2.set_title('按周的总需求量')

ax2.tick_params(axis='both', which='major')

ax3 = fig.add_subplot(3,1,3, sharex=ax1)

ax3.plot(data_without_outliers['ord_qty'].resample('M').sum(),linewidth=1)

ax3.set_title('按月的总需求量')

ax3.tick_params(axis='both', which='major')

#plt.savefig('tmp/第二问按天周月3.png')

plt.show()

把预处理后的数据按’sales_region_code’, ‘item_code’, ‘first_cate_code’, ‘second_cate_code’分组，分别按天、周、月采样得到总需求量，再去除总需求量0值多的行，依次保存为’day.csv’、‘week.csv’、'month.csv’文件，分别做为按天、周、月的时间粒度预测的预测数据集

# #按天、周、月的时间粒度

# #按'sales_region_code', 'item_code', 'first_cate_code', 'second_cate_code'分组，按天、周、月采样的总需求量，去除0的行

d = data_without_outliers.groupby(['sales_region_code', 'item_code', 'first_cate_code', 'second_cate_code'])[

'ord_qty'].resample('D').sum().reset_index()

d = d.loc[d['ord_qty'] != 0]

print(d)

w = data_without_outliers.groupby(['sales_region_code', 'item_code', 'first_cate_code', 'second_cate_code'])[

'ord_qty'].resample('W').sum().reset_index()

w = w.loc[w['ord_qty'] != 0]

m = data_without_outliers.groupby(['sales_region_code', 'item_code', 'first_cate_code', 'second_cate_code'])[

'ord_qty'].resample('M').sum().reset_index()

m = m.loc[m['ord_qty'] != 0]

pd.DataFrame(d).to_csv('../data/day.csv', index=False)

pd.DataFrame(w).to_csv('../data/week.csv', index=False)

pd.DataFrame(m).to_csv('../data/month.csv', index=False)

建立梯度提升树模型

按天时间粒度按周时间粒度按月时间粒度

from sklearn.ensemble import GradientBoostingRegressor

from sklearn.model_selection import train_test_split

from sklearn.metrics import mean_squared_error, r2_score

import joblib

按天时间粒度

df=pd.read_csv('../data/day.csv')

sales_region_codeitem_codefirst_cate_codesecond_cate_codeorder_dateord_qty0101200023034062017-08-0441101200023034062018-03-1422101200023034062018-03-1633101200023034062018-03-2534101200023034062018-03-319...

257038 rows × 6 columns

划分训练集和测试集

# 数据预处理

# 划分训练集和测试集

X = df.drop('ord_qty', axis=1)

y = df['ord_qty']

c=y.mean()

d=y.std()

# 保存c和d值

np.save('../tmp/c.npy', c)

np.save('../tmp/d.npy', d)

y= (y-c)/d

特征处理

# 对于类别数据，使用独热编码进行处理

X= pd.get_dummies(X, columns=['sales_region_code', 'item_code', 'first_cate_code', 'second_cate_code'])

# 将时间类型的变量转换成数值型变量

X['order_date'] = pd.to_datetime(X['order_date'])

X['year'] = X['order_date'].dt.year

X['month'] = X['order_date'].dt.month

X['quarter']=X['order_date'].dt.quarter

X.drop('order_date', axis=1, inplace=True)

a=X.mean()

b=X.std()

# 保存a和b值

np.save('../tmp/a.npy', a)

np.save('../tmp/b.npy', b)

X= (X-a)/b

建立梯度提升树模型

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 建立梯度提升树模型

model = GradientBoostingRegressor(n_estimators=150, learning_rate=0.08, max_depth=6, random_state=42)

model.fit(X_train, y_train)

# 保存模型

joblib.dump(model, '../tmp/model.pkl')

模型评估

# 模型评估

y_pred_train = model.predict(X_train)

mse_train = mean_squared_error(y_train, y_pred_train)

rmse_train = np.sqrt(mse_train)

r_train = r2_score(y_train, y_pred_train)

y_pred_test = model.predict(X_test)

mse_test = mean_squared_error(y_test, y_pred_test)

rmse_test = np.sqrt(mse_test)

r_test = r2_score(y_test, y_pred_test)

print('训练集均方误差MSE：', mse_train)

print('训练集均方根误差RMSE：', rmse_train)

print('训练集决定系数R2：', r_train)

print('测试集均方误差MSE：', mse_test)

print('测试集均方根误差RMSE：', rmse_test)

print('测试集决定系数R2：', r_test)

训练集均方误差MSE： 0.734888685818914

训练集均方根误差RMSE： 0.8572564877671758

训练集决定系数R2： 0.2635732896957512

测试集均方误差MSE： 0.7696230124579693

测试集均方根误差RMSE： 0.8772816038524741

测试集决定系数R2： 0.23672780093650492

# 测试集预测值和真实值对比图

y_test = y_test * d + c

y_pred = y_pred_test *d + c

plt.plot(y_test.values, label='true')

plt.plot(y_pred, label='pred')

plt.legend()

plt.show()

这是按天时间粒度训练模型，按周、按月时间粒度的类似…

按月时间粒度

df=pd.read_csv('../data/month.csv')

sales_region_codeitem_codefirst_cate_codesecond_cate_codeorder_dateord_qty0101200023034062017-08-3141101200023034062018-03-31172101200023034062018-04-301243101200023034062018-05-311104101200023034062018-06-3077.....................

30824 rows × 6 columns

# 数据预处理

# 划分训练集和测试集

X = df.drop('ord_qty', axis=1)

y = df['ord_qty']

c1=y.mean()

d1=y.std()

# 保存c和d值

np.save('../tmp/c1.npy', c1)

np.save('../tmp/d1.npy', d1)

y= (y-c1)/d1

# 对于类别数据，使用独热编码进行处理

X= pd.get_dummies(X, columns=['sales_region_code', 'item_code', 'first_cate_code', 'second_cate_code'])

# 将时间类型的变量转换成数值型变量

X['order_date'] = pd.to_datetime(X['order_date'])

X['year'] = X['order_date'].dt.year

X['month'] = X['order_date'].dt.month

X['quarter']=X['order_date'].dt.quarter

X.drop('order_date', axis=1, inplace=True)

a1=X.mean()

b1=X.std()

# 保存a和b值

np.save('../tmp/a1.npy', a1)

np.save('../tmp/b1.npy', b1)

X= (X-a1)/b1

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 建立梯度提升树模型

model = GradientBoostingRegressor(n_estimators=150, learning_rate=0.08, max_depth=6, random_state=42)

model.fit(X_train, y_train)

# 保存模型

joblib.dump(model, '../tmp/model_month.pkl')

['../tmp/model_month.pkl']

# 模型评估

y_pred_train = model.predict(X_train)

mse_train = mean_squared_error(y_train, y_pred_train)

rmse_train = np.sqrt(mse_train)

r_train = r2_score(y_train, y_pred_train)

y_pred_test = model.predict(X_test)

mse_test = mean_squared_error(y_test, y_pred_test)

rmse_test = np.sqrt(mse_test)

r_test = r2_score(y_test, y_pred_test)

print('训练集均方误差MSE：', mse_train)

print('训练集均方根误差RMSE：', rmse_train)

print('训练集决定系数R2：', r_train)

print('测试集均方误差MSE：', mse_test)

print('测试集均方根误差RMSE：', rmse_test)

print('测试集决定系数R2：', r_test)

训练集均方误差MSE： 0.3238078735863478

训练集均方根误差RMSE： 0.5690411879524608

训练集决定系数R2： 0.6913676251507997

测试集均方误差MSE： 0.40153638374400513

测试集均方根误差RMSE： 0.6336689859414023

测试集决定系数R2： 0.49983854868617916

# 测试集预测值和真实值对比图

y_test = y_test * d + c

y_pred = y_pred_test *d + c

plt.plot(y_test.values, label='true')

plt.plot(y_pred, label='pred')

plt.legend()

plt.show()

预测

#读取预测数据

pred_df = pd.read_csv('../data/predict_sku1.csv',encoding='gbk')

df = pd.read_csv('../data/month.csv',encoding='gbk')

# 选取pred_df中'sales_region_code', 'item_code', 'first_cate_code', 'second_cate_code'列中与df中相同的行

pred1= pred_df[pred_df[['sales_region_code', 'item_code', 'first_cate_code', 'second_cate_code']].apply(tuple, axis=1).isin(df[['sales_region_code', 'item_code', 'first_cate_code', 'second_cate_code']].apply(tuple, axis=1))]

no = pred_df.loc[~pred_df[['sales_region_code', 'item_code', 'first_cate_code', 'second_cate_code']].apply(tuple, axis=1).isin(pred1[['sales_region_code', 'item_code', 'first_cate_code', 'second_cate_code']].apply(tuple, axis=1))]

数据预处理

# 对于类别数据，使用独热编码进行处理

pred=pred1.copy()

pred = pd.get_dummies(pred, columns=['sales_region_code', 'item_code', 'first_cate_code', 'second_cate_code'])

# 加载a,b,c和d值

a=np.load('../tmp/a1.npy')

b=np.load('../tmp/b1.npy')

c = np.load('../tmp/c1.npy')

d = np.load('../tmp/d1.npy')

#4月份

pred4=pred.copy()

pred4['year'] = 2019

pred4['month'] = 4

pred4['quarter'] = 2

pred4=(pred4-a)/b

#5月份

pred5=pred.copy()

pred5['year'] = 2019

pred5['month'] = 5

pred5['quarter'] = 2

pred5=(pred5-a)/b

#6月份

pred6=pred.copy()

pred6['year'] = 2018

pred6['month'] = 6

pred6['quarter'] = 4

pred6=(pred6-a)/b

# 加载模型

model = joblib.load('../tmp/model_month.pkl')

# 预测未来3个月的需求量

y_pred4 = model.predict(pred4)

y_pred4 = y_pred4 *d + c

print('4月份:\n',y_pred4)

y_pred5 = model.predict(pred5)

y_pred5 = y_pred5 *d + c

y_pred6 = model.predict(pred6)

y_pred6 = y_pred6 *d + c

4月份:

[399.07802768 646.041552 748.03332473 ... 682.8277281 779.00830241

700.16161931]

# 将预测结果保存为文件

result_df =pred1.copy()

result_df['2019年4月预测需求量'] = y_pred4

result_df['2019年5月预测需求量'] = y_pred5

result_df['2019年6月预测需求量'] = y_pred6

产品编码没有的数据用预测结果的’sales_region_code’,‘first_cate_code’, 'second_cate_code’分组的平均值代替

#产品编码没有的数据用预测结果的'sales_region_code','first_cate_code', 'second_cate_code'分组的平均值代替

buchong=result_df.groupby(['sales_region_code','first_cate_code', 'second_cate_code']).agg({'2019年4月预测需求量':np.mean,'2019年5月预测需求量':np.mean,'2019年6月预测需求量':np.mean}).reset_index()

bu= pd.merge(no,buchong)

sales_region_codeitem_codefirst_cate_codesecond_cate_code2019年4月预测需求量2019年5月预测需求量2019年6月预测需求量010120011303401616.096452616.096452483.859698110120198303401616.096452616.096452483.859698210120254303401616.096452616.096452483.859698310120324303401616.096452616.096452483.859698........................

432 rows × 7 columns

#合并

he=pd.concat([result_df,bu]).drop(['first_cate_code', 'second_cate_code'], axis=1)

he = he.round({'2019年4月预测需求量': 0, '2019年5月预测需求量': 0, '2019年6月预测需求量': 0})

print(he)

he.to_excel('../result2.xlsx', index=False)

sales_region_code item_code 2019年4月预测需求量 2019年5月预测需求量 2019年6月预测需求量

0 101 20002 399.0 399.0 261.0

1 101 20003 646.0 646.0 508.0

2 101 20006 748.0 748.0 615.0

4 101 20014 1209.0 1209.0 906.0

5 101 20016 540.0 540.0 401.0

.. ... ... ... ... ...

431 105 21867 890.0 974.0 694.0

[2619 rows x 5 columns]

总结：建立的模型整体上预测效果还是差，就时间粒度而言，按月份的时间粒度来预测效果是其中最好的。对于该模型，可以进一步提取更多的适合的时间序列特征，使用网格搜索进行参数调优或者使用更复杂的模型（如深度学习模型）来优化。

下面是后来做的一些简单的power BI展示

推荐链接

评论可见，请评论后查看内容，谢谢！！！

您阅读本篇文章共花了：

金钥匙

数据挖掘产品订单的数据分析与需求预测

数据分析 Python3，5行代码，生成自动排序动图，这操作不比Excel香？

数据分析数据挖掘 r语言Error: package or namespace load failed for ‘dplyr’ in loadNamespace(i, c(lib.loc, .libPaths()报错。

发表评论取消回复

金钥匙

数据挖掘 产品订单的数据分析与需求预测

数据分析 Python3，5行代码，生成自动排序动图，这操作不比Excel香？

数据分析 数据挖掘 r语言Error: package or namespace load failed for ‘dplyr’ in loadNamespace(i, c(lib.loc, .libPaths()报错。

相关文章

发表评论取消回复

数据挖掘产品订单的数据分析与需求预测

数据分析数据挖掘 r语言Error: package or namespace load failed for ‘dplyr’ in loadNamespace(i, c(lib.loc, .libPaths()报错。