爬虫网络爬虫 python 用Scrapy抓取当当网站数据

setting.py实验目的及要求：

【实验目的】

通过本实验了解Scrapy爬虫框架；熟练掌握Scrapy的基本使用方法和常用技巧。

【实验要求】

使用Scrapy框架，抓取网站商品信息（京东、淘宝、当当等任选一个），并将结果保存成多种形式。（本文选择当当）

实验原理及内容：

【实验原理】（列出相关知识点）

Scrapy框架:

理解Scrapy框架的基本原理和架构。HTTP请求和响应:

理解HTTP请求和响应的基本概念。如何通过Scrapy发送HTTP请求和处理响应。Web页面结构:

了解HTML和CSS的基础知识，理解Web页面的结构。CSS选择器和XPath:

使用CSS选择器或XPath在网页中定位和提取信息。Scrapy Spider:

创建Scrapy Spider，定义如何抓取和解析页面。如何通过Spider跟踪链接和处理分页。Scrapy Items:

定义Scrapy Items，用于存储抓取的数据。Scrapy Pipelines:

编写Scrapy Pipelines处理抓取到的Item。配置Pipeline在settings.py中。异常处理:

处理可能出现的异常，例如超时、连接错误等。数据存储:

将抓取到的数据保存到不同的存储介质，如文件、数据库等。用户代理和IP代理:

设置和使用用户代理和IP代理，以避免被封禁。异步和并发:

配置异步和并发请求以提高爬取效率。Scrapy Settings:

配置Scrapy的Settings，包括用户代理、下载延迟等。Web爬取伦理:

了解爬虫的伦理和法律问题，避免对网站造成不必要的压力。日志和调试:

使用Scrapy的日志系统进行调试。扩展和定制:

定制和扩展Scrapy，满足特定需求。

【程序思路】

首先，因为我们要抓取网页中的标题、链接和评论数，所以我们首先得写items.py在这里添加完我们需要爬取的哪些数据后，我们在转向我们的爬虫文件，我们通过 scrapy genspider dd dangdang.com 创建了一个爬虫文件dd.py然后开始撰写settings.py最后编写pipelines.py文件了（也就是可以正式操作数据了）。

实验数据与结果分析：（含运行结果截屏）

【实验结果】

【结果分析】

利用Scrapy框架抓取了当当网站商品信息，并通过ppline.py将数据存储到了mysql数据库中，可以以不同数据存储方式存储。

实验小结：（包括问题和解决方法、心得体会、意见与建议等）

在进行这个实验的过程中，我首先面临的挑战是网站结构的变化。由于当当网站可能会定期更新，导致之前编写的爬虫代码无法正确提取信息。为了解决这个问题，我学习了如何动态调整选择器以适应变化的网页结构，并通过查看网页源代码来快速调整选择器，确保爬虫的准确性。

在实验中，我深刻体会到Scrapy框架的强大之处。通过定义Item、编写Spider、配置Pipeline等步骤，我成功地构建了一个功能强大的爬虫，能够高效地抓取和处理目标网站的信息。学会使用Scrapy的中间件和设置，我更好地掌握了爬虫的并发和异步请求的处理方式，提高了爬取效率。

通过实验，我还发现了对数据进行实时分析和监控的重要性。及时发现并解决异常情况，如数据缺失、异常格式等，有助于提高爬虫的稳定性。此外，通过对数据进行统计和可视化分析，我更全面地了解了抓取到的信息，发现了一些潜在的趋势和规律。通过这个实验，我不仅掌握了Scrapy框架的使用，还培养了解决实际问题的能力，让我受益匪浅。

实验源代码清单：（带注释）

item.py:

# Define here the models for your scraped items

# See documentation in:

# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy

class DangdangItem(scrapy.Item):

# define the fields for your item here like:

# name = scrapy.Field()

title = scrapy.Field()

link = scrapy.Field()

price = scrapy.Field()

shop = scrapy.Field()

# Scrapy settings for dangdang project

# For simplicity, this file contains only settings considered important or

# commonly used. You can find more settings consulting the documentation:

# https://docs.scrapy.org/en/latest/topics/settings.html

# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html

# https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'dangdang'

SPIDER_MODULES = ['dangdang.spiders']

NEWSPIDER_MODULE = 'dangdang.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent

USER_AGENT = {

'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36',

}

# Obey robots.txt rules

ROBOTSTXT_OBEY = False

LOG_LEVEL = 'ERROR'

# Configure maximum concurrent requests performed by Scrapy (default: 16)

#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)

# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay

# See also autothrottle settings and docs

#DOWNLOAD_DELAY = 3

# The download delay setting will honor only one of:

#CONCURRENT_REQUESTS_PER_DOMAIN = 16

#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)

#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)

#TELNETCONSOLE_ENABLED = False

# Override the default request headers:

#DEFAULT_REQUEST_HEADERS = {

# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

# 'Accept-Language': 'en',

# Enable or disable spider middlewares

# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html

#SPIDER_MIDDLEWARES = {

# 'dangdang.middlewares.DangdangSpiderMiddleware': 543,

# Enable or disable downloader middlewares

# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html

DOWNLOADER_MIDDLEWARES = {

'dangdang.middlewares.DangdangDownloaderMiddleware': 543,

}

# Enable or disable extensions

# See https://docs.scrapy.org/en/latest/topics/extensions.html

#EXTENSIONS = {

# 'scrapy.extensions.telnet.TelnetConsole': None,

# Configure item pipelines

# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html

ITEM_PIPELINES = {

'dangdang.pipelines.DangdangPipeline': 300,

}

# Enable and configure the AutoThrottle extension (disabled by default)

# See https://docs.scrapy.org/en/latest/topics/autothrottle.html

#AUTOTHROTTLE_ENABLED = True

# The initial download delay

#AUTOTHROTTLE_START_DELAY = 5

# The maximum download delay to be set in case of high latencies

#AUTOTHROTTLE_MAX_DELAY = 60

# The average number of requests Scrapy should be sending in parallel to

# each remote server

#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

# Enable showing throttling stats for every response received:

#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)

# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings

#HTTPCACHE_ENABLED = True

#HTTPCACHE_EXPIRATION_SECS = 0

#HTTPCACHE_DIR = 'httpcache'

#HTTPCACHE_IGNORE_HTTP_CODES = []

#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

Pipline.py

# Define your item pipelines here

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

# useful for handling different item types with a single interface

#from itemadapter import ItemAdapter

#from pymongo import MongoClient

class DangdangPipeline:

def __init__(self):

self.client = MongoClient(host='localhost',port=27017)

self.db = self.client['当当']

self.collections = self.db['python爬虫']

def process_item(self, item, spider):

data = {

'title': item['title'],

'link': item['link'],

'price': item['price'],

'shop': item['shop'],

}

print(data)

self.collections.insert(data)

return item

dd.py

import scrapy

from dangdang.items import DangdangItem

class DdSpider(scrapy.Spider):

name = 'dd'

# allowed_domains = ['www.dangdang.com']

start_urls = ['http://search.dangdang.com/?key=python%C5%C0%B3%E6&act=input&page_index=1']

def parse(self, response):

item = DangdangItem()

li_lists = response.xpath('//*[@id="component_59"]/li')

for li_list in li_lists:

title = li_list.xpath('./p[1]/a/@title').extract()[0]

link = 'http:'+li_list.xpath('./p[1]/a/@href').extract()[0]

price = li_list.xpath('./p[3]/span/text()').extract()[0]

shop = li_list.xpath('./p[4]/a/text()').extract()[0]

item['title'] = title

item['link'] = link

item['price'] = price

item['shop'] = shop

# print(title)

# print(link)

# print(price)

# print(shop)

# print('*'*100)

yield item

for page in range(2,101):

url = f'http://search.dangdang.com/?key=python%C5%C0%B3%E6&act=input&page_index={page}'

result = scrapy.Request(url,callback=self.parse)

yield result

项目总结构为：

精彩内容

评论可见，请评论后查看内容，谢谢！！！

您阅读本篇文章共花了：

金钥匙

爬虫网络爬虫 python 用Scrapy抓取当当网站数据

【python】网络爬虫与信息提取--scrapy爬虫框架介绍

初学scrapy爬虫的一些坑

发表评论取消回复

金钥匙

爬虫 网络爬虫 python 用Scrapy抓取当当网站数据

【python】网络爬虫与信息提取--scrapy爬虫框架介绍

初学scrapy爬虫的一些坑

相关文章

发表评论取消回复

爬虫网络爬虫 python 用Scrapy抓取当当网站数据