网络爬虫 Python爬虫——scrapy-4

免责声明

本文章仅用于学习交流，无任何商业用途

部分图片来自尚硅谷

meta简介

在Scrapy框架中，可以使用meta属性来传递额外的信息。meta属性可以在不同的组件之间传递数据，包括爬虫、中间件和管道等。

在爬虫中，可以使用meta属性在请求之间传递数据。例如：

yield scrapy.Request(url, callback=self.parse_details, meta={'item': item})

在上面的例子中，通过设置meta属性，将item对象传递给了下一个请求的回调函数parse_details。

在中间件中，可以使用meta属性来获取和修改请求的元数据。例如：

def process_request(self, request, spider):

item = request.meta['item']

item['timestamp'] = datetime.now()

request.meta['item'] = item

在上面的例子中，process_request方法获取了请求的item对象，并添加了一个timestamp字段，然后将修改后的item对象保存回meta属性中。

在管道中，可以使用meta属性来获取和传递数据。例如：

def process_item(self, item, spider):

timestamp = item['timestamp']

# 使用timestamp做一些处理

在上面的例子中，可以从item对象的meta属性中取出之前设置的timestamp值，并进行相应的处理。

总之，Scrapy的meta属性可以在不同的组件之间传递数据，非常方便灵活。

爬取电影天堂的国内电影的全部名字和图片链接

import scrapy

from scrapy_movie_070.items import ScrapyMovie070Item

class MvSpider(scrapy.Spider):

name = "mv"

allowed_domains = ["www.dygod.net"]

start_urls = ["https://www.dygod.net/html/gndy/china/index.html"]

def parse(self, response):

print("==============成功啦===============")

# 我们要第一页的名字和第二页的图片

a_list = response.xpath('//div[@class="co_content8"]//td[2]//a[2]')

for a in a_list:

# 获取第一页的name和链接

name = a.xpath('./text()').extract_first()

src = a.xpath('./@href').extract_first()

url = 'https://www.dygod.net' + src

print(name, url)

yield scrapy.Request(url=url, callback=self.parse_second, meta={'name':name})

def parse_second(self, response):

print("==============呀啦嗦===============")

# 如果拿不到数据，记得检查xpath语法是否正确

img_src = response.xpath('//div[@id="Zoom"]//img[1]/@src').extract_first()

img_url = 'https://www.dygod.net' + img_src

# 接收到请求的那个Meta参数的值

name = response.meta['name']

movie = ScrapyMovie070Item(src=img_url, name=name)

yield movie

CrawlSpider是Scrapy框架中的一个特殊爬虫类，它提供了一种基于规则的快速爬取方式。 CrawlSpider使用了一组规则来定义爬取的行为，并自动根据这些规则对页面上的链接进行跟踪和爬取。

使用CrawlSpider，可以更轻松地从一个网站中提取数据，而无需编写太多的代码。以下是使用CrawlSpider的基本步骤：

创建一个CrawlSpider的子类，并设置name属性（爬虫的唯一标识符）和allowed_domains属性（限制爬取的域名）。定义一个rules属性，其中包含多个Rule对象，每个Rule对象定义了一个规则。

Rule对象的link_extractor属性定义了链接提取器，用于从页面中提取链接。 Rule对象的callback属性定义了回调函数，用于处理提取到的链接对应的页面。编写回调函数，用于处理提取到的链接对应的页面。在回调函数中使用XPath或CSS选择器等方法提取数据，并使用yield语句返回Item对象或新的Request对象，进行进一步的爬取或处理。

以下是一个简单的CrawlSpider示例：

from scrapy.spiders import CrawlSpider, Rule

from scrapy.linkextractors import LinkExtractor

class MySpider(CrawlSpider):

name = 'myspider'

allowed_domains = ['example.com']

start_urls = ['http://www.example.com']

rules = (

Rule(LinkExtractor(allow=r'/page/\d+'), callback='parse_page', follow=True),

)

def parse_page(self, response):

# 提取数据并返回Item对象

yield {

'title': response.css('h1::text').get(),

'content': response.css('.content::text').getall(),

}

在上面的示例中，allowed_domains属性限制了只爬取example.com域名下的页面。start_urls属性定义了初始爬取的URL。

rules属性定义了一个规则，其中使用了LinkExtractor来提取符合allow条件的链接，并将提取到的链接交给parse_page方法进行处理。follow=True表示继续跟踪该链接上的页面。

parse_page方法是回调函数，用于处理提取到的链接对应的页面。在这个方法中，可以使用XPath或CSS选择器等方法提取页面中的数据，并使用yield语句返回Item对象。

通过以上步骤，就可以创建一个基于规则的爬虫，并使用CrawlSpider类来自动进行页面跟踪和爬取。

下图来自尚硅谷

C:\Users\14059>scrapy shell https://www.dushu.com/book/1188.html 2024-03-08 17:00:29 [scrapy.utils.log] INFO: Scrapy 2.9.0 started (bot: scrapybot) 2024-03-08 17:00:29 [scrapy.utils.log] INFO: Versions: lxml 5.1.0.0, libxml2 2.10.3, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.2, Twisted 22.10.0, Python 3.7.5 (tags/v3.7.5:5c02a39a0b, Oct 15 2019, 00:11:34) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 24.0.0 (OpenSSL 3.2.1 30 Jan 2024), cryptography 42.0.5, Platform Windows-10-10.0.22621-SP0 2024-03-08 17:00:29 [scrapy.crawler] INFO: Overridden settings: {'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0} 2024-03-08 17:00:29 [py.warnings] WARNING: d:\python\python375\lib\site-packages\scrapy\utils\request.py:232: ScrapyDeprecationWarning: '2.6' is a deprecated value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting.

It is also the default value. In other words, it is normal to get this warning if you have not defined a value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change in a future version of Scrapy.

See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation. return cls(crawler)

2024-03-08 17:00:29 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor 2024-03-08 17:00:29 [scrapy.extensions.telnet] INFO: Telnet Password: 13c50912dfa84ac1 2024-03-08 17:00:29 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole'] 2024-03-08 17:00:29 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2024-03-08 17:00:29 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2024-03-08 17:00:29 [scrapy.middleware] INFO: Enabled item pipelines: [] 2024-03-08 17:00:29 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2024-03-08 17:00:29 [scrapy.core.engine] INFO: Spider opened 2024-03-08 17:00:30 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None) 2024-03-08 17:00:30 [asyncio] DEBUG: Using selector: SelectSelector [s] Available Scrapy objects: [s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc) [s] crawler [s] item {} [s] request [s] response <200 https://www.dushu.com/book/1188.html> [s] settings [s] spider [s] Useful shortcuts: [s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed) [s] fetch(req) Fetch a scrapy.Request and update local objects [s] shelp() Shell help (print this help) [s] view(response) View response in a browser 2024-03-08 17:00:30 [asyncio] DEBUG: Using selector: SelectSelector 2024-03-08 17:00:30 [asyncio] DEBUG: Using selector: SelectSelector In [1]: from scrapy.linkextractors import LinkExtractor

2024-03-08 17:01:58 [asyncio] DEBUG: Using selector: SelectSelector In [2]: link = LinkExtractor

2024-03-08 17:02:49 [asyncio] DEBUG: Using selector: SelectSelector In [3]: from scrapy.linkextractors import LinkExtractor

2024-03-08 17:03:24 [asyncio] DEBUG: Using selector: SelectSelector In [4]: link = LinkExtractor(allow=r'/book/1188_\d+\.html')

2024-03-08 17:04:45 [asyncio] DEBUG: Using selector: SelectSelector In [5]: link Out[6]:

2024-03-08 17:05:01 [asyncio] DEBUG: Using selector: SelectSelector In [7]: link.extract_links(response)Out[7]: [Link(url='https://www.dushu.com/book/1188_2.html', text='2', fragment='', nofollow=False), Link(url='https://www.dushu.com/book/1188_3.html', text='3', fragment='', nofollow=False), Link(url='https://www.dushu.com/book/1188_4.html', text='4', fragment='', nofollow=False), Link(url='https://www.dushu.com/book/1188_5.html', text='5', fragment='', nofollow=False), Link(url='https://www.dushu.com/book/1188_6.html', text='6', fragment='', nofollow=False), Link(url='https://www.dushu.com/book/1188_7.html', text='7', fragment='', nofollow=False), Link(url='https://www.dushu.com/book/1188_8.html', text='8', fragment='', nofollow=False), Link(url='https://www.dushu.com/book/1188_9.html', text='9', fragment='', nofollow=False), Link(url='https://www.dushu.com/book/1188_10.html', text='10', fragment='', nofollow=False), Link(url='https://www.dushu.com/book/1188_11.html', text='11', fragment='', nofollow=False), Link(url='https://www.dushu.com/book/1188_12.html', text='12', fragment='', nofollow=False), Link(url='https://www.dushu.com/book/1188_13.html', text='13', fragment='', nofollow=False)]

2024-03-08 17:05:20 [asyncio] DEBUG: Using selector: SelectSelector In [8]: link1 = LinkExtractor

2024-03-08 17:17:12 [asyncio] DEBUG: Using selector: SelectSelector In [9]: link1 = LinkExtractor(restrict_xpaths=r'//div[@class="pages"]/a/@href')

2024-03-08 17:18:03 [asyncio] DEBUG: Using selector: SelectSelector In [10]: link.extract_links(response) Out[10]:[Link(url='https://www.dushu.com/book/1188_2.html', text='2', fragment='', nofollow=False), Link(url='https://www.dushu.com/book/1188_3.html', text='3', fragment='', nofollow=False), Link(url='https://www.dushu.com/book/1188_4.html', text='4', fragment='', nofollow=False), Link(url='https://www.dushu.com/book/1188_5.html', text='5', fragment='', nofollow=False), Link(url='https://www.dushu.com/book/1188_6.html', text='6', fragment='', nofollow=False), Link(url='https://www.dushu.com/book/1188_7.html', text='7', fragment='', nofollow=False), Link(url='https://www.dushu.com/book/1188_8.html', text='8', fragment='', nofollow=False), Link(url='https://www.dushu.com/book/1188_9.html', text='9', fragment='', nofollow=False), Link(url='https://www.dushu.com/book/1188_10.html', text='10', fragment='', nofollow=False), Link(url='https://www.dushu.com/book/1188_11.html', text='11', fragment='', nofollow=False), Link(url='https://www.dushu.com/book/1188_12.html', text='12', fragment='', nofollow=False), Link(url='https://www.dushu.com/book/1188_13.html', text='13', fragment='', nofollow=False)]

整个和命令行斗智斗勇的过程如上了，[○･｀Д´･ ○]

CrawlSpider案例

目标：读书网数据入库

（1）创建一个项目

scrapy startproject 项目名

（2）跳转到spdiers 的文件目录下

cd 到spiders为止

cd 项目名\项目名\spiders

（3）创建爬虫文件

scrapy genspider -t crawl 爬虫文件的名字爬取的域名

注意：一定要注意第一页的URL结构是否和其他页码的结构一样

如果不需要存储到数据库中，代码如下

read.py

import scrapy

from scrapy.linkextractors import LinkExtractor

from scrapy.spiders import CrawlSpider, Rule

from scrapy_readbook_090.items import ScrapyReadbook090Item

class ReadSpider(CrawlSpider):

name = "read"

allowed_domains = ["www.dushu.com"]

start_urls = ["https://www.dushu.com/book/1188_1.html"]

rules = (Rule(LinkExtractor(allow=r"/book/1188_\d+\.html"),

callback="parse_item",

follow=True),)

def parse_item(self, response):

img_list = response.xpath('//div[@class="bookslist"]//img')

for img in img_list:

name = img.xpath('./@alt').extract_first()

img_src = img.xpath('./@data-original').extract_first()

book = ScrapyReadbook090Item(name=name, src=img_src)

yield book

pipelines.py

# Define your item pipelines here

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

# useful for handling different item types with a single interface

from itemadapter import ItemAdapter

class ScrapyReadbook090Pipeline:

def open_spider(self, spider):

self.fp = open('book.json', 'w', encoding='utf-8')

def process_item(self, item, spider):

self.fp.write(str(item))

return item

def close_spider(self, spider):

self.fp.close()

items.py

# Define here the models for your scraped items

# See documentation in:

# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy

class ScrapyReadbook090Item(scrapy.Item):

# define the fields for your item here like:

# name = scrapy.Field()

name = scrapy.Field()

src = scrapy.Field()

settings.py

# Scrapy settings for scrapy_readbook_090 project

# For simplicity, this file contains only settings considered important or

# commonly used. You can find more settings consulting the documentation:

# https://docs.scrapy.org/en/latest/topics/settings.html

# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html

# https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = "scrapy_readbook_090"

SPIDER_MODULES = ["scrapy_readbook_090.spiders"]

NEWSPIDER_MODULE = "scrapy_readbook_090.spiders"

# Crawl responsibly by identifying yourself (and your website) on the user-agent

#USER_AGENT = "scrapy_readbook_090 (+http://www.yourdomain.com)"

# Obey robots.txt rules

ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)

#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)

# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay

# See also autothrottle settings and docs

#DOWNLOAD_DELAY = 3

# The download delay setting will honor only one of:

#CONCURRENT_REQUESTS_PER_DOMAIN = 16

#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)

#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)

#TELNETCONSOLE_ENABLED = False

# Override the default request headers:

#DEFAULT_REQUEST_HEADERS = {

# "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",

# "Accept-Language": "en",

# Enable or disable spider middlewares

# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html

#SPIDER_MIDDLEWARES = {

# "scrapy_readbook_090.middlewares.ScrapyReadbook090SpiderMiddleware": 543,

# Enable or disable downloader middlewares

# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html

#DOWNLOADER_MIDDLEWARES = {

# "scrapy_readbook_090.middlewares.ScrapyReadbook090DownloaderMiddleware": 543,

# Enable or disable extensions

# See https://docs.scrapy.org/en/latest/topics/extensions.html

#EXTENSIONS = {

# "scrapy.extensions.telnet.TelnetConsole": None,

# Configure item pipelines

# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html

ITEM_PIPELINES = {

"scrapy_readbook_090.pipelines.ScrapyReadbook090Pipeline": 300,

}

# Enable and configure the AutoThrottle extension (disabled by default)

# See https://docs.scrapy.org/en/latest/topics/autothrottle.html

#AUTOTHROTTLE_ENABLED = True

# The initial download delay

#AUTOTHROTTLE_START_DELAY = 5

# The maximum download delay to be set in case of high latencies

#AUTOTHROTTLE_MAX_DELAY = 60

# The average number of requests Scrapy should be sending in parallel to

# each remote server

#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

# Enable showing throttling stats for every response received:

#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)

# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings

#HTTPCACHE_ENABLED = True

#HTTPCACHE_EXPIRATION_SECS = 0

#HTTPCACHE_DIR = "httpcache"

#HTTPCACHE_IGNORE_HTTP_CODES = []

#HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"

# Set settings whose default value is deprecated to a future-proof value

REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"

TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"

FEED_EXPORT_ENCODING = "utf-8"

总结

其实也没有多大的差距，都适应了其实代码也都挺简单的。主要还是一些细节以及路径的查找上有一点困难。

评论可见，请评论后查看内容，谢谢！！！

您阅读本篇文章共花了：

金钥匙

网络爬虫 Python爬虫——scrapy-4

自然语言处理人工智能语言模型编程实践开发语言架构设计介绍Python中常用的网络爬虫框架Scrapy的基本概念和功能，并提供了示例代码。

python Scrapy爬虫开发流程全解析

发表评论取消回复

金钥匙

网络爬虫 Python爬虫——scrapy-4

自然语言处理 人工智能 语言模型 编程实践 开发语言 架构设计 介绍Python中常用的网络爬虫框架Scrapy的基本概念和功能，并提供了示例代码。

python Scrapy爬虫开发流程全解析

相关文章

发表评论取消回复

自然语言处理人工智能语言模型编程实践开发语言架构设计介绍Python中常用的网络爬虫框架Scrapy的基本概念和功能，并提供了示例代码。