爬虫+Flask+Echarts搭建《深度学习》书评显示大屏

1、前言2、实现2.1 挑选想要采集的书籍2.2 构建爬虫2.2.1 采集书籍信息2.2.2 采集书评

2.3 数据清洗2.3.1 清洗书籍信息2.3.2 清洗书评信息

2.4 统计分析，结果持久化存储2.5 搭建flask框架2.6 数据传值2.7 完整代码&数据集获取

3、号外

1、前言

本项目是基于requests爬虫+flask+echarts搭建的《深度学习》书评的可视化大屏，主要涉及到的技术有爬虫、数据库操作，flask框架，echarts图表。

最终效果如下：

2、实现

2.1 挑选想要采集的书籍

为了爬取书评，一想到的是豆瓣读书，搜索发现大多数书评都没有超过两位数，故从中挑选书评比较多是五本书，分别是：

深度学习、

动手学深度学习、

深度学习 : 智能时代的核心驱动力量、

Python深度学习、

深度学习的数学

这五本书，对应的豆瓣书籍id是：

book_dict = {'深度学习': 27087503, '动手学深度学习': 33450010, '深度学习 : 智能时代的核心驱动力量': 30425822,

'Python深度学习': 30293801, '深度学习的数学': 33414479}

2.2 构建爬虫

2.2.1 采集书籍信息

def get_books():

for i in book_dict:

book_id = book_dict['{}'.format(i)]

book_url = 'https://book.douban.com/subject/{}/'.format(book_id)

a = requests.get(url=book_url, headers=header)

data = etree.HTML(a.text)

name = ''.join(data.xpath('//div[@id="wrapper"]/h1/span/text()'))

other = ''.join(data.xpath('//div[@id="wrapper"]//div[@id="info"]//text()')).replace('\n', '').strip()

detail = ''.join(data.xpath('//div[@class="indent"][1]//div[@class="intro"]/p//text()'))

star = ''.join(data.xpath('//div[@class="rating_self clearfix"]/strong/text()'))

comments = ''.join(data.xpath('//div[@id="comments-section"]//h2//a//text()'))

cursor = conn.cursor()

# 保存数据

sql = """insert into book(id,name,other,detail,star,comments) values('%s','%s','%s','%s','%s','%s')""" % (

book_id, name, other, detail, star, comments)

print(book_id, name, other, detail, star, comments)

cursor.execute(sql)

conn.commit()

cursor.close()

# 随机生成一个介于1和5之间的整数，作为休眠时间

sleep_time = random.randint(1, 3)

# 暂停执行程序指定的时间

time.sleep(sleep_time)

get_comments(name, book_id)

2.2.2 采集书评

def get_comments(name, book_id):

for num in range(0, 81, 20):

url = 'https://book.douban.com/subject/{}/comments/?start={}&limit=20&status=P&sort=score'.format(book_id, num)

a = requests.get(url=url, headers=header)

response = etree.HTML(a.text)

data = response.xpath('//li[@class="comment-item"]')

for per_data in data:

user_name = ''.join(per_data.xpath('.//span[@class="comment-info"]/a[1]//text()'))

comment = ''.join(per_data.xpath('.//p[@class="comment-content"]/span//text()'))

star = ''.join(per_data.xpath('.//span[@class="comment-info"]/span[1]/@class'))

comments_time = ''.join(per_data.xpath('.//span[@class="comment-info"]/a[2]//text()'))

ip = ''.join(per_data.xpath('.//span[@class="comment-location"]//text()'))

follow = ''.join(per_data.xpath('.//span[@class="vote-count"]//text()'))

cursor = conn.cursor()

# 保存数据

sql = """insert into comments(book_name,user_name,comment,star,comments_time,ip,follow) values('%s','%s','%s','%s','%s','%s','%s')""" % (

name, user_name, comment, star, comments_time, ip, follow)

print(name, user_name, comment, star, comments_time, ip, follow)

cursor.execute(sql)

conn.commit()

cursor.close()

sleep_time = random.randint(1, 3)

time.sleep(sleep_time)

2.3 数据清洗

采集的数据如下图所示，数据格式很乱，故我们需要对数据进行清洗：

2.3.1 清洗书籍信息

这里用的正则表达式对数据进行清洗：

def clean_book_data():

cursor = conn.cursor()

sql = """select * from book"""

cursor.execute(sql)

data = cursor.fetchall()

for i in data:

book_id = i[0]

book_name = i[1]

detail = i[2]

other = i[3]

star = i[4]

comments = i[5]

pattern = r"(?<=作者:).*?(?=出版社:)"

author = ''.join(re.findall(pattern, other)).replace(' ', '').strip().replace(' /', '/').replace('] ', ']')

pattern = r"(?<=出版社:).*?(?=出品方:)"

publishin_house = ''.join(re.findall(pattern, other)).strip()

pattern = r"(?<=页数:).*?(?=定价:)"

pages = ''.join(re.findall(pattern, other)).strip()

pattern = r"(?<=定价:).*?(?=装帧:)"

price = ''.join(re.findall(pattern, other)).strip().replace('元', '')

pattern = r"ISBN:\s*(\d{9,10})"

ISBN = re.search(pattern, other).group(1)

# 保存数据

sql = """insert into book_v2(id,name,detail,star,comments,author,publishin_house,pages,price,ISBN) values('%s','%s','%s','%s','%s','%s','%s','%s','%s','%s')""" % (

book_id, book_name, detail, star, comments, author, publishin_house, pages, price, ISBN)

cursor.execute(sql)

conn.commit()

cursor.close()

2.3.2 清洗书评信息

def clean_comments_data():

cursor = conn.cursor()

sql = """select * from comments"""

cursor.execute(sql)

data = cursor.fetchall()

for i in data:

book_name = i[0]

user_name = i[1]

comment = i[2]

star = i[3]

comments_time = i[4]

ip = i[5]

follow = i[6]

# 清洗评论

comment = comment.replace('\n', '')

# 清洗打星

pattern = r'\d+'

result = re.findall(pattern, star)

if result:

star = ''.join(result)

star = round(float(star) / 10)

else:

star = ''

# 保存数据

sql = """insert into comments_v2(book_name,user_name,comment,star,comments_time,ip,follow) values('%s','%s','%s','%s','%s','%s','%s')""" % (

book_name, user_name, comment, star, comments_time, ip, follow)

print(book_name, user_name, comment, star, comments_time, ip, follow)

cursor.execute(sql)

conn.commit()

cursor.close()

2.4 统计分析，结果持久化存储

# 按照采集的评论算出每本书平均星级

def clean_analyze_1():

cursor = conn.cursor()

sql = """DROP TABLE IF EXISTS `analyze_1`;"""

cursor.execute(sql)

conn.commit()

sql = """CREATE TABLE `analyze_1` AS

SELECT `book_name`, ROUND(AVG(IFNULL(`star`, 0)),2) as value

FROM `comments_v2`

GROUP BY `book_name`;"""

cursor.execute(sql)

conn.commit()

cursor.close()

# 按ip统计各地区评论数

def clean_analyze_2():

cursor = conn.cursor()

sql = """DROP TABLE IF EXISTS `analyze_2`;"""

cursor.execute(sql)

conn.commit()

sql = """CREATE TABLE `analyze_2` AS

SELECT IF(ip = '', '未知', ip) AS ip, COUNT(*) AS value

FROM comments_v2

GROUP BY ip;"""

cursor.execute(sql)

conn.commit()

cursor.close()

# 每本书评论的平均点赞数

def clean_analyze_3():

cursor = conn.cursor()

sql = """DROP TABLE IF EXISTS `analyze_3`;"""

cursor.execute(sql)

conn.commit()

sql = """CREATE TABLE `analyze_3` AS

SELECT `book_name`, ROUND(AVG(IFNULL(`follow`, 0)),2) as value

FROM `comments_v2`

GROUP BY `book_name`;"""

cursor.execute(sql)

conn.commit()

cursor.close()

# 词云

def clean_analyze_4():

cursor = conn.cursor()

sql = """DROP TABLE IF EXISTS `analyze_4`;

"""

cursor.execute(sql)

conn.commit()

sql = """

CREATE TABLE `analyze_4` (

`word` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,

`count` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL

) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;"""

cursor.execute(sql)

conn.commit()

sql = """SELECT comment

FROM `comments_v2`;"""

cursor.execute(sql)

comments = cursor.fetchall()

text = ''

for comment in comments:

text = text + comment[0]

# 正则取出中文

text = ''.join(re.findall('[\u4e00-\u9fa5]', text))

STOPWORDS_CH = open('停用词表.txt', encoding='utf8').read().split()

word_list = [

w for w in jieba.cut(text)

if w not in set(STOPWORDS_CH) and len(w) > 1

]

counter = Counter(word_list)

top_50_words = counter.most_common(50)

sql = "INSERT INTO analyze_4 (word, count) VALUES (%s, %s)"

for word, count in top_50_words:

cursor.execute(sql, (word.strip(), count))

# 提交事务

conn.commit()

cursor.close()

2.5 搭建flask框架

from flask import Flask,render_template

app = Flask(__name__)

app.config['JSON_AS_ASCII'] = False

@app.route('/')

def index():

return render_template('index.html')

if __name__ == '__main__':

app.run(debug=True)

2.6 数据传值

@app.route('/')

def index():

if not os.path.exists("./static/img/wordcloud.png"):

wc()

# 左上图

sql = 'SELECT name, comments FROM book_v2;'

cursor = conn.cursor()

cursor.execute(sql)

list_1 = cursor.fetchall()

list_1_data = []

for i in list_1:

book_name = i[0]

comments = int(i[1].replace('全部', '').replace('条', ''))

list_1_data.append([book_name, comments])

# 中上图

sql = 'SELECT book_name, value FROM analyze_1;'

cursor = conn.cursor()

cursor.execute(sql)

list_2 = cursor.fetchall()

list_2_name = []

list_2_value = []

for i in list_2:

list_2_name.append(i[0])

list_2_value.append(i[1])

# 右上图

sql = 'SELECT name, price FROM book_v2;'

cursor = conn.cursor()

cursor.execute(sql)

list_3 = cursor.fetchall()

list_3_name = []

list_3_value = []

for i in list_3:

list_3_name.append(i[0])

list_3_value.append(float(i[1]))

# 左下图

sql = 'SELECT ip, value FROM analyze_2 order by value DESC;'

cursor = conn.cursor()

cursor.execute(sql)

list_4 = cursor.fetchall()

list_4_name = []

list_4_value = []

for i in list_4:

list_4_name.append(i[0])

list_4_value.append(int(i[1]))

return render_template('index.html', list_1_data=list_1_data, list_2_name=list_2_name, list_2_value=list_2_value,

list_3_name=list_3_name, list_3_value=list_3_value, list_4_name=list_4_name,

list_4_value=list_4_value)

2.7 完整代码&数据集获取

完整代码&数据集可以私聊我获取，还有更多可视化大屏等着你：

001 服务大数据可视化监管平台

002 水质情况实时监测预警系统

003 联心菜市场数据中心

004 政务大数据共享交换平台

005 可视化监控管理

006 全国疫情实时监控

007 惠民服务平台

008 兰州智慧消防大数据平台

009 某公司大数据监控平台

010 双数智慧公卫-传染病督导平台

011 大数据可视化系统数据分析通用模版

012 某公司大数据展示模版

013 某公司大数据展示模版

014 时实客流量监控中心

015 广西矿产资源大数据监管平台

016 广西矿产资源大数据监管平台

017 大数据可视化通用素材

018 大数据可视化系统数据分析通用模版

019 大数据可视化系统数据分析通用模版

020 大数据通用模版大标题样式

021 大数据通用模版

022 全息档案平台中心

023 医院大数据展示

024 智慧社区内网比对平台

025 大数据可视化展板通用模板

026 设备环境监测平台

027 全国消费者情况看板

028 移动能耗管理平台

029 南方软件视频平台大屏中心

030 全国图书零售检测中心

031 数据可视化大屏展示系统

032 物流云数据看板平台

033 数据可视化页面设计

034 晋城高速综合管控大数据

035 视频监控前后台通用模板

036 门店销售监控平台

037 建筑智慧工地管控

038 无线网络大数据平台

039 湖南省大数据可视化平台

040 Echart图例使用

041 智慧物流服务中心

042 大数据分析系统

043 网络当天实时发稿量

044 车联网平台数据概览

045 信用风险定位系统(银行版)

046 作战指挥室

047 公司名称大数据可视化平台

048 大数据可视化展板通用模板

049 工厂信息监控台

050 大数据可视化展示平台通用模板

051 通用大数据可视化展示平台模板

052 智慧社区内网比对平台

053 通用大数据可视化展示平台模板

054 公安大数据监控平台2

055 物流大数据服务平台

056 大数据统计展示大屏

057 大屏数据统计

058 大屏数据智慧中心统计

059 物联网平台数据统计

060 广西电子商务公共服务平台大数据中心

061 智慧小区大数据分析

062 数据概览演示案例

063 商品运营大数据

064 设备环境监测平台

065 游戏可视化大数据用户案例

066 系统架构可视化监控

067 xx区智慧旅游综合服务平台

068 中国电信厅店营业效能分析

069 智能看板新中心

070 翼兴消防监控

071 市突发预警平台实时监控

072 大连市突发预警实时监控

073 观测站综合监控平台

074 酒机运行状态

075 数据可视化展示

076 交管大数据人脸识别系统

078 河长制大数据显示平台

079 保税区A仓监控中心

080 北斗车联网大数据平台

081 北京市执法信息平台

082 南方草木商品交易[超级大屏]

083 兴福公安综合监管大屏

084 压力容器大屏

085 车辆综合管控平台

086 物流大数据展示系统

087 农业产业大数据指挥仓系统

088 HTML大数据综合分析平台模板

089 警务综合监控管理平台HTML模板

090 企业营收大数据统计可视化大屏

091 ECharts扩展示例自定义视图

092 酷炫大屏数据可视化模板

093 办税渠道运行状态大数据监控平台

094 大数据统计展示大屏

095 交通大数据展示平台

096 智慧农业大数据展示

097 程序员数据可视化大屏展示

098 销售大数据分析

099 英雄联盟LPL比赛数据可视化

100 新型冠状肺炎实时监测大屏

3、号外

如果我的博客对你有帮助、如果你喜欢我的博客内容，请 “点赞” “✍️评论” “收藏” 一键三连哦！【关注我| 获取更多源码 | 定制源码】大学生毕设模板、期末大作业模板、Echarts大数据可视化等! 「一起探讨，互相学习」！（vx：python812146）以上内容技术相关问题欢迎一起交流学习

金钥匙

爬虫+Flask+Echarts搭建《深度学习》书评显示大屏

简单在线美食推荐系统开发教程 python语言、django开发框架、mysql数据库、基于用户的协同过滤推荐算法个性化美食推荐系统爬虫可视化数据分析机器学习

firefox chrome 爬虫状态码412状态

发表评论取消回复

金钥匙

爬虫+Flask+Echarts搭建《深度学习》书评显示大屏

简单在线美食推荐系统开发教程 python语言、django开发框架、mysql数据库、基于用户的协同过滤推荐算法 个性化美食推荐系统 爬虫 可视化数据分析 机器学习

firefox chrome 爬虫状态码412状态

相关文章

发表评论取消回复

简单在线美食推荐系统开发教程 python语言、django开发框架、mysql数据库、基于用户的协同过滤推荐算法个性化美食推荐系统爬虫可视化数据分析机器学习