网络爬虫爬虫 pycharm Python的Scrapy框架爬取诗词网站爱情诗送给女友

网络系统建设与运维后端 2024-02-26 6 0

文章目录

前言效果展示：一、安装scrapy库二、创建scrapy项目三、新建爬虫文件scmg_spider.py四、配置settings.py文件五、定义数据容器，修改item.py文件六、定义爬虫，修改scmg_spider.py爬虫文件七、定义管道，修改pipelines.py文件八、开启管道ShicimingjuPipeline九、创建运行项目的main.py文件，并运行项目总结

前言

我们将使用Scrapy框架爬取诗词网站的古代爱情诗，收集起来爱情诗送给女生，送给女友。Scrapy是用纯Python实现一个为了爬取网站数据、提取结构性数据而编写的应用框架，用途非常广泛，用户只需要定制开发几个模块就可以轻松的实现一个爬虫，用来抓取网页内容以及各种图片，非常之方便。

效果展示：

网站页面：爬取内容保存的文件：

一、安装scrapy库

在项目终端的命令行中输入如下安装命令并执行:

pip install scrapy -i https://pypi.tuna.tsinghua.edu.cn/simple

二、创建scrapy项目

在终端中对应的目录下，执行命令:

scrapy startproject shicimingju

三、新建爬虫文件scmg_spider.py

通过PyCharm工具打开项目，并在spiders目录下新建爬虫文件scmg_spider.py

四、配置settings.py文件

打开项目中的settings.py文件，设置ROBOTSTXT_OBEY ,不遵循 Robot协议 ,避免爬取不了数据

#设置robots.txt文本为False

# Obey robots.txt rules

ROBOTSTXT_OBEY = False

设置DEFAULT_REQUEST_HEADERS,伪装成浏览器

# Override the default request headers:

DEFAULT_REQUEST_HEADERS = {

'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

'Accept-Language': 'en',

'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Mobile Safari/537.36'}

五、定义数据容器，修改item.py文件

class ShicimingjuItem(scrapy.Item):

# define the fields for your item here like:

#诗名

title = scrapy.Field()

#朝代

dynasty = scrapy.Field()

#作者

author = scrapy.Field()

#内容

content = scrapy.Field()

六、定义爬虫，修改scmg_spider.py爬虫文件

导入爬虫依赖包

import scrapy

from scrapy import Request

from scrapy.spiders import Spider

from ..items import ShicimingjuItem

定义爬虫类

class GssSpiderSpider(Spider):

name = 'scmj_spider' # 爬虫名称

def start_requests(self):#发起起始页的请求

urls ='https://www.shicimingju.com/shicimark/aiqingshi.html'

yield Request(urls)

实现解析函数，解析网页内容

#解析内容

def parse(self, response):

scmj_divs = response.xpath("//div[@class='card shici_card']/div")

for scmj_divs in scmj_divs:

#获取诗名

title = scmj_divs.xpath('div[2]/h3/a/text()').get()

print("标题:", title)

# 获取诗的朝代和作者

source = scmj_divs.xpath("div/text()")

dynasty=str()

author=str()

if source.getall():

dynasty=source.getall()[1]

dynasty=dynasty.strip()

author=source.getall()[2]

author=author.strip()

print("朝代:",dynasty)

print("作者:",author)

# 获取诗的具体内容

content_list = scmj_divs.xpath("div[2]/div/text()")

content=str()

if content_list.getall():

content=content_list.getall()[0:2]

#转换成字符串，并给每一句诗句添加换行符，去掉空格

content="\n".join(content).strip()

#解析网页中需要点击更多显示的诗句

content_part = scmj_divs.xpath("div[2]/div/div/text()")

if content_part.getall():

content_part=content_part.getall()[0:-1]

#转换成字符串，并给每一句诗句添加换行符，去掉空格

content_part="\n".join(content_part).strip()

#拼接两部分诗句为完整诗句

content=content+"\n"+content_part

print(content)

# 将解析的数据传入item中

item = ShicimingjuItem(title=title, dynasty=dynasty, author=author, content=content)

# 将item传送到管道piplines

yield item

多页面爬取

# 多页面爬取

#解析下一页链接中url,因为第二页开始下一页的标签位置改变，需要判断后再进行解析

next= response.xpath("//div[@id='list_nav_part']/a[8]/text()").get()

if next=="下一页":

next_href = response.xpath("//div[@id='list_nav_part']/a[8]/@href").get()

else:

next_href = response.xpath("//div[@id='list_nav_part']/a[9]/@href").get()

if next_href:

#使用urljoin将下一页url转换成完整url

next_url = response.urljoin(next_href)

print(next_url)

#获取下一页的请求

request = scrapy.Request(next_url)

yield request

七、定义管道，修改pipelines.py文件

class ShicimingjuPipeline(object):

#保存爬取的数据到txt文件中

def process_item(self, item, spider):

with open("爱情诗.txt", "a", encoding="utf-8") as f:

#按照自己想要的格式保存爬取的爱情诗到文件中

scmj="\n"+item["title"]+item["dynasty"]+item["author"]+"\n"+item["content"]+"\n"+"="*40

f.write(scmj)

return item

八、开启管道ShicimingjuPipeline

在settings.py文件中配置如下设置

ITEM_PIPELINES = {

'shicimingju.pipelines.ShicimingjuPipeline': 300,

}

九、创建运行项目的main.py文件，并运行项目

在scrapy项目根目录下新建一个main.py文件，添加如下代码

from scrapy import cmdline

cmdline.execute("scrapy crawl scmj_spider".split())

···

运行main.py文件，查看项目下"爱情诗.txt"文件内容如下：

总结

以上就是今天介绍的内容，本文介绍了使用scrapy框架爬取诗词网站的内容的过程，如有疑问或者不妥的地方欢迎留言讨论。如需要现成的项目源码文件的可以去如下链接下载：链接: https://download.csdn.net/download/baidu_37075689/85252724

参考链接

评论可见，请评论后查看内容，谢谢！！！

您阅读本篇文章共花了：

python 网络爬虫爬虫 pycharm

本文由用户于 2024-02-26 发布在金钥匙，如有疑问，请联系我们。
本文链接：https://www.51969.com/post/18581431.html

金钥匙

网络爬虫爬虫 pycharm Python的Scrapy框架爬取诗词网站爱情诗送给女友

python 开发语言数据分析大数据职场和发展干货！一份详实的 Scrapy 爬虫教程，值得收藏！

数据库开发语言 chisel之scala 语法

发表评论取消回复

金钥匙

网络爬虫 爬虫 pycharm Python的Scrapy框架爬取诗词网站爱情诗送给女友

python 开发语言 数据分析 大数据 职场和发展 干货！一份详实的 Scrapy 爬虫教程，值得收藏！

数据库 开发语言 chisel之scala 语法

相关文章

发表评论取消回复

网络爬虫爬虫 pycharm Python的Scrapy框架爬取诗词网站爱情诗送给女友

python 开发语言数据分析大数据职场和发展干货！一份详实的 Scrapy 爬虫教程，值得收藏！

数据库开发语言 chisel之scala 语法