柚子快报邀请码778899分享:scrapy之参数传递和启动

http://yzkb.51969.com/

scrapy之参数传递和启动

start_requests

scrapy可以通过设计start_requests函数来自定义函数的启动流程,比如从某个链接启动,或启动时传递某些特定值

如果不需要自定义,只需要用某个链接开始传递,可以不定义start_requests函数,通过start_urls列表进行启动

from typing import Any

import scrapy

from scrapy.http import Response

# 使用start_requests函数

class BaiDuApi(scrapy.Spider):

name = 'baiduapi'

def start_requests(self):

urls = [

'https://httpbin.org/get?params=1',

'https://httpbin.org/get?params=2'

]

for url in urls:

yield scrapy.Request(

url

)

def parse(self, response: Response, **kwargs: Any) -> Any:

pass

from typing import Any

import scrapy

from scrapy.http import Response

# 使用start_urls,效果与上面代码一致

class BaiDuApi(scrapy.Spider):

name = 'baiduapi'

start_urls = [

'https://httpbin.org/get?params=1',

'https://httpbin.org/get?params=2'

]

def parse(self, response: Response, **kwargs: Any) -> Any:

pass

scrapy参数传递

scrapy可以在启动时加入-a参数,-a可以将某些参数传递到待执行的spider中

import scrapy

class WangYiNew(scrapy.Spider):

name = 'wangyinews'

headers = {

'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36'

}

def start_requests(self):

article = getattr(self, "article", None)

if article is not None:

base_url = "https://www.163.com/news/article/" + article + '.html'

yield scrapy.Request(

base_url,

headers=self.headers

)

def parse(self, response):

items = {

'title':response.xpath("//h1[@class='post_title']/text()").get(),

'content':''.join(response.xpath("//div[@class='post_body']//p//text()").getall()),

'pubtime':''.join(response.xpath("//div[@class='post_info']/text()").getall())

}

self.log(items)

yield items

scrapy crawl wangyinews -a article = IFJ1RHSS000189FH

柚子快报邀请码778899分享:scrapy之参数传递和启动

http://yzkb.51969.com/

文章链接

评论可见,请评论后查看内容,谢谢!!!
 您阅读本篇文章共花了: