许久没用过puppeteer了,最近看了篇文章,貌似能解决puppeteer访问一些需要登录的系统时每次都要手动登录的麻烦。

于是回顾了一下。

1. 指定浏览器:

因为本来电脑就有chrome,不想puppeteer装多一个chrome,于是只安装了puppeteer-core,那就需要在launch的时候,传入executablePath来指定chrome浏览器。一般装的位置:/Applications/Google Chrome.app/Contents/MacOS/Google Chrome 

2. 指定userDataDir

指定userDataDir目录,用于存储用户数据,这步很关键。当你设定了userDataDir, puppeteer会读取该指定目录下的数据。因此,只要跑一次,手动登录一下,下回再跑脚本,就不需要手动登录了。当然,如果目标系统有做一些时效处理的话,可能下回再跑脚本,缓存过期还是得手动登录了。

基本使用代码如下:

import puppeteer from 'puppeteer-core';

import os from 'node:os';

import path from 'node:path';

const sleep = (milliseconds) => new Promise(r => setTimeout(r, milliseconds));

const browser = await puppeteer.launch({

headless: false,

executablePath: '/Applications/Google Chrome.app/Contents/MacOS/Google Chrome',

defaultViewport: {

width: 0,

height: 0

},

userDataDir: path.join(os.homedir(), '.puppeteer-data')

});

3. headless: true访问不了问题

在写一个自动签到的脚本,本来headless:false完全自动,都好好的,一改成headless:true就不行,脚本调试发现拿到的html都是空的head和body。 

用node写个简单的输出请求头脚本测试了一下:

import http from 'node:http';

const server = http.createServer((req, res) => {

console.log(`Received ${req.method} request for ${req.url}`);

console.log(`Headers: ${JSON.stringify(req.headers)}`);

res.writeHead(200, { 'Content-Type': 'text/plain' });

res.write('Hello, World!');

res.end();

});

server.listen(4000, () => {

console.log('Server listening on port 4000');

});

分别用headless: false和true访问,发现:原来是请求头的问题。

Received GET request for / Headers: {"host":"127.0.0.1:4000","connection":"keep-alive","sec-ch-ua":"\"HeadlessChrome\";v=\"113\", \"Chromium\";v=\"113\", \"Not-A.Brand\";v=\"24\"","sec-ch-ua-mobile":"?0","sec-ch-ua-platform":"\"macOS\"","upgrade-insecure-requests":"1","user-agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/113.0.5672.126 Safari/537.36","accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7","sec-fetch-site":"none","sec-fetch-mode":"navigate","sec-fetch-user":"?1","sec-fetch-dest":"document","accept-encoding":"gzip, deflate, br"}  

尝试访问同个网站的首页,反倒能取到数据,猜测目标页面专门做了处理,所以,只要脚本设置好header即可。

await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36');

推荐链接

评论可见,请评论后查看内容,谢谢!!!
 您阅读本篇文章共花了: