柚子快报邀请码778899分享:scrapy之参数传递和启动
scrapy之参数传递和启动
start_requests
scrapy可以通过设计start_requests函数来自定义函数的启动流程,比如从某个链接启动,或启动时传递某些特定值
如果不需要自定义,只需要用某个链接开始传递,可以不定义start_requests函数,通过start_urls列表进行启动
from typing import Any
import scrapy
from scrapy.http import Response
# 使用start_requests函数
class BaiDuApi(scrapy.Spider):
name = 'baiduapi'
def start_requests(self):
urls = [
'https://httpbin.org/get?params=1',
'https://httpbin.org/get?params=2'
]
for url in urls:
yield scrapy.Request(
url
)
def parse(self, response: Response, **kwargs: Any) -> Any:
pass
from typing import Any
import scrapy
from scrapy.http import Response
# 使用start_urls,效果与上面代码一致
class BaiDuApi(scrapy.Spider):
name = 'baiduapi'
start_urls = [
'https://httpbin.org/get?params=1',
'https://httpbin.org/get?params=2'
]
def parse(self, response: Response, **kwargs: Any) -> Any:
pass
scrapy参数传递
scrapy可以在启动时加入-a参数,-a可以将某些参数传递到待执行的spider中
import scrapy
class WangYiNew(scrapy.Spider):
name = 'wangyinews'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36'
}
def start_requests(self):
article = getattr(self, "article", None)
if article is not None:
base_url = "https://www.163.com/news/article/" + article + '.html'
yield scrapy.Request(
base_url,
headers=self.headers
)
def parse(self, response):
items = {
'title':response.xpath("//h1[@class='post_title']/text()").get(),
'content':''.join(response.xpath("//div[@class='post_body']//p//text()").getall()),
'pubtime':''.join(response.xpath("//div[@class='post_info']/text()").getall())
}
self.log(items)
yield items
scrapy crawl wangyinews -a article = IFJ1RHSS000189FH
柚子快报邀请码778899分享:scrapy之参数传递和启动
文章链接
发表评论