scrapy超实用的两个中间件和参数配置

中间件(代理、UA)自定义代理中间件自定义代理中间件setting.py的参数编写自定义UA中间件启动代理和UA中间件

setting常用的参数配置

中间件(代理、UA)

自定义代理中间件

我自己编写了一个IP池子,代理IP放在redis当中,需要在请求的时候从redis当中随机获取到一条代理IP (我有两个代理池环境 一个正式环境一个测试环境 如果你只有一个的话请看 自定义代理中间件setting.py的参数编写)

class ProxyMiddleware:

@classmethod

def from_crawler(cls, crawler):

cls.connect_type = crawler.settings.get('CONNECT_TYPE')

print('\033[3;31mIPpool连接摸索:{}\033[0m\n\n'.format(cls.connect_type))

if cls.connect_type == 'localhost':

cls.REDIS_URL = crawler.settings.get('REDIS_HOST')

cls.REDIS_PORT = crawler.settings.get('REDIS_PORT')

cls.REDIS_DB = crawler.settings.get('REDIS_DATABASE')

cls.REDIS_PASSWORD = crawler.settings.get('REDIS_PASSWORD')

cls.REDIS_QUEUE_NAME = crawler.settings.get('REDIS_QUEUE_NAME')

elif cls.connect_type == 'server':

cls.REDIS_URL = crawler.settings.get('SERVER_REDIS_HOST')

cls.REDIS_PORT = crawler.settings.get('SERVER_REDIS_PORT')

cls.REDIS_DB = crawler.settings.get('SERVER_REDIS_DATABASE')

cls.REDIS_PASSWORD = crawler.settings.get('SERVER_REDIS_PASSWORD')

cls.REDIS_QUEUE_NAME = crawler.settings.get('SERVER_REDIS_QUEUE_NAME')

s = cls()

crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)

return s

def spider_opened(self, spider):

self.pika = redis.Redis(host=self.REDIS_URL, port=self.REDIS_PORT, password=self.REDIS_PASSWORD,

db=self.REDIS_DB,

decode_responses=True)

print('============IP池已链接{}============'.format(self.connect_type))

def close_spider(self, spider):

self.pika.close()

print("################################################")

print("############## IPpool_spider #############")

print("################################################")

def getIP(self):

proxies_list = self.pika.hvals(self.REDIS_QUEUE_NAME)

if proxies_list:

ip = random.choice(proxies_list)

proxy = 'http://{}'.format(ip)

return proxy

else:

print('\033[3;31m《《《代理池为空》》》\033[0m\n')

def process_request(self, request, spider):

spiderNames = spider.settings.get('SPIDERNAMES')

if spider.name in spiderNames:

proxies = self.getIP()

print('获取到的代理:{}'.format(proxies))

request.meta['proxy'] = proxies

CONNECT_TYPE:IP池的链接模式 localhost为测试环境代理IP池 server为正式环境的代理IP池

REDIS_URL 、REDIS_PORT 、REDIS_DB 、REDIS_PASSWORD 、REDIS_QUEUE_NAME 为连接redis的参数(写在了setting.py文件当中)

SPIDERNAMES:需要使用代理的爬虫脚本名称(写在了setting.py当中),因为有的网页不需要代理也能获取到就没有必要使用代理 (spider.settings.get 获取setting.py当中的参数设置)

自定义代理中间件setting.py的参数编写

如果需要使用上面的代理中间件 需要在setting.py当中自定义几个参数 如下:

SPIDERNAMES = [] # 需要使用代理的脚本名称

# redis配置 写自己的redis代理池连接参数

# 线上 正式

SERVER_REDIS_HOST = 'xxx.xxx.xxx.xx'

SERVER_REDIS_PORT = xxx

SERVER_REDIS_DATABASE = x

SERVER_REDIS_PASSWORD = 'xxxx'

SERVER_REDIS_QUEUE_NAME = 'xxxx'

# 本地、测试运行环境

REDIS_HOST = '127.0.0.1'

REDIS_PORT = 6379

REDIS_DATABASE = 0

REDIS_PASSWORD = ''

REDIS_QUEUE_NAME = 'ippool'

CONNECT_TYPE = 'server' # 运行环境

如果只有一个环境的话 就随便写一个 CONNECT_TYPE 就写那个对应的环境 比如我只有一个环境 我写到REDIS_HOST REDIS_PORT REDIS_DATABASE REDIS_PASSWORD REDIS_QUEUE_NAME 当中 那么CONNECT_TYPE就写localhost 或者更改ProxyMiddleware当中的读取参数逻辑(from_crawler当中)

自定义UA中间件

这个就是判断一下有没有手动添加UA,如果已经添加了UA就不做修改,如果没有自定义UA就随机添加一个 UA中间件不需要配置其他参数

class UAMiddleware(object):

user_agent_list = [

"Mozilla/5.0 (Windows; U; Windows NT 5.2) Gecko/2008070208 Firefox/3.0.1"

"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",

"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",

"Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",

"Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11",

"Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11",

"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",

"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",

"Opera/9.80 (Windows NT 5.1; U; zh-cn) Presto/2.9.168 Version/11.50",

"Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0",

"Mozilla/5.0 (Windows NT 5.2) AppleWebKit/534.30 (KHTML, like Gecko) Chrome/12.0.742.122 Safari/534.30",

"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11",

"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER",

"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)",

"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; 360SE)",

"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0",

"Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0)",

"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2)",

"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)",

"Mozilla/4.0 (compatible; MSIE 5.0; Windows NT)",

"Mozilla/5.0 (Windows; U; Windows NT 5.2) Gecko/2008070208 Firefox/3.0.1",

"Mozilla/5.0 (Windows; U; Windows NT 5.1) Gecko/20070309 Firefox/2.0.0.3",

"Mozilla/5.0 (Windows; U; Windows NT 5.1) Gecko/20070803 Firefox/1.5.0.12 "

]

def process_request(self, request, spider):

ua = random.choice(self.user_agent_list)

if request.headers.get("'User-Agent'") or request.headers.get('user-agent'):

print('拥有UA 不需要更更换 ')

else:

request.headers['User-Agent'] = ua

启动代理和UA中间件

在setting中找到 DOWNLOADER_MIDDLEWARES 将这两个中间件添加进去

DOWNLOADER_MIDDLEWARES = {

'zmnProject.middlewares.ZmnprojectDownloaderMiddleware': 543,

'你的项目.middlewares.UAMiddleware': 350,

'你的项目.middlewares.ProxyMiddleware': 200,

}

setting常用的参数配置

DOWNLOAD_DELAY 下载延迟

ROBOTSTXT_OBEY 是否遵守ROBOT协议

CONCURRENT_REQUESTS 请求并发数量

CONCURRENT_ITEMS item并发数量

好文推荐

评论可见,请评论后查看内容,谢谢!!!
 您阅读本篇文章共花了: