目录

Newspaper

安装

实战

1. 抓取CSDN上的文章

2. 查阅网易新闻的内容

总结

Newspaper

是一个强大的Python库,专门用于从新闻网站和文章中提取信息。它提供了一种简单而高效的方式来抓取新闻网页,解析内容,并提取出有用的信息,如文章标题、正文、作者、发布日期等。

首先,Newspaper框架在GitHub上获得了众多开发者的认可,其点赞排名也相当靠前,显示出其在Python爬虫领域的受欢迎程度。它特别适用于抓取新闻类网页,因为新闻网站通常具有较为规范的HTML结构和内容格式。

使用Newspaper库非常简单,即使是完全没有爬虫经验的初学者也能快速上手。用户只需要提供目标网页的URL,Newspaper就能自动下载网页内容并进行解析。而且,它不需要用户考虑header、IP代理、网页解析和网页源代码架构等复杂问题,大大简化了爬虫的开发过程。

在提取信息方面,Newspaper提供了丰富的功能。除了基本的文章文本提取外,它还能自动识别并提取出文章的作者、发布日期等关键信息。Newspaper还支持多种语言,包括英语、中文、德语、阿拉伯语等,使得它可以适应不同国家和地区的新闻网站。Newspaper的另一个亮点是它支持多进程文章下载。这意味着它可以同时处理多个网页,大大提高了数据抓取的效率。此外,它还能识别新闻链接,并从HTML文件中提取文本、图片等多媒体内容,为用户提供更全面的新闻信息。

安装

pip install newspaper3k

注意是 newspaper3k,导入时用 import newspaper。

安装过程:

C:\Users>pip install newspaper3k Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple Collecting newspaper3k   Downloading https://pypi.tuna.tsinghua.edu.cn/packages/d7/b9/51afecb35bb61b188a4b44868001de348a0e8134b4dfa00ffc191567c4b9/newspaper3k-0.2.8-py3-none-any.whl (211 kB)      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 211.1/211.1 kB 1.8 MB/s eta 0:00:00 Collecting beautifulsoup4>=4.4.1 (from newspaper3k)   Downloading https://pypi.tuna.tsinghua.edu.cn/packages/b1/fe/e8c672695b37eecc5cbf43e1d0638d88d66ba3a44c4d321c796f4e59167f/beautifulsoup4-4.12.3-py3-none-any.whl (147 kB)      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 147.9/147.9 kB 9.2 MB/s eta 0:00:00 Requirement already satisfied: Pillow>=3.3.0 in d:\program files\python\lib\site-packages (from newspaper3k) (10.2.0) Requirement already satisfied: PyYAML>=3.11 in d:\program files\python\lib\site-packages (from newspaper3k) (6.0.1) Collecting cssselect>=0.9.2 (from newspaper3k)   Downloading https://pypi.tuna.tsinghua.edu.cn/packages/06/a9/2da08717a6862c48f1d61ef957a7bba171e7eefa6c0aa0ceb96a140c2a6b/cssselect-1.2.0-py2.py3-none-any.whl (18 kB) Collecting lxml>=3.6.0 (from newspaper3k)   Downloading https://pypi.tuna.tsinghua.edu.cn/packages/02/59/e1fbe2514d8ab39977b72e77f98d0fa49772f61e938049baf151b307a4f0/lxml-5.1.0-cp312-cp312-win_amd64.whl (3.9 MB)      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.9/3.9 MB 11.3 MB/s eta 0:00:00 Collecting nltk>=3.2.1 (from newspaper3k)   Downloading https://pypi.tuna.tsinghua.edu.cn/packages/a6/0a/0d20d2c0f16be91b9fa32a77b76c60f9baf6eba419e5ef5deca17af9c582/nltk-3.8.1-py3-none-any.whl (1.5 MB)      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.5/1.5 MB 19.2 MB/s eta 0:00:00 Requirement already satisfied: requests>=2.10.0 in d:\program files\python\lib\site-packages (from newspaper3k) (2.31.0) Collecting feedparser>=5.2.1 (from newspaper3k)   Downloading https://pypi.tuna.tsinghua.edu.cn/packages/7c/d4/8c31aad9cc18f451c49f7f9cfb5799dadffc88177f7917bc90a66459b1d7/feedparser-6.0.11-py3-none-any.whl (81 kB)      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 81.3/81.3 kB 4.4 MB/s eta 0:00:00 Collecting tldextract>=2.0.1 (from newspaper3k)   Downloading https://pypi.tuna.tsinghua.edu.cn/packages/fc/6d/8eaafb735b39c4ab3bb8fe4324ef8f0f0af27a7df9bb4cd503927bd5475d/tldextract-5.1.2-py3-none-any.whl (97 kB)      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 97.6/97.6 kB 5.8 MB/s eta 0:00:00 Collecting feedfinder2>=0.0.4 (from newspaper3k)   Downloading https://pypi.tuna.tsinghua.edu.cn/packages/35/82/1251fefec3bb4b03fd966c7e7f7a41c9fc2bb00d823a34c13f847fd61406/feedfinder2-0.0.4.tar.gz (3.3 kB)   Preparing metadata (setup.py) ... done Collecting jieba3k>=0.35.1 (from newspaper3k)   Downloading https://pypi.tuna.tsinghua.edu.cn/packages/a9/cb/2c8332bcdc14d33b0bedd18ae0a4981a069c3513e445120da3c3f23a8aaa/jieba3k-0.35.1.zip (7.4 MB)      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.4/7.4 MB 15.8 MB/s eta 0:00:00   Preparing metadata (setup.py) ... done Requirement already satisfied: python-dateutil>=2.5.3 in d:\program files\python\lib\site-packages (from newspaper3k) (2.8.2) Collecting tinysegmenter==0.3 (from newspaper3k)   Downloading https://pypi.tuna.tsinghua.edu.cn/packages/17/82/86982e4b6d16e4febc79c2a1d68ee3b707e8a020c5d2bc4af8052d0f136a/tinysegmenter-0.3.tar.gz (16 kB)   Preparing metadata (setup.py) ... done Collecting soupsieve>1.2 (from beautifulsoup4>=4.4.1->newspaper3k)   Downloading https://pypi.tuna.tsinghua.edu.cn/packages/4c/f3/038b302fdfbe3be7da016777069f26ceefe11a681055ea1f7817546508e3/soupsieve-2.5-py3-none-any.whl (36 kB) Requirement already satisfied: six in d:\program files\python\lib\site-packages (from feedfinder2>=0.0.4->newspaper3k) (1.16.0) Collecting sgmllib3k (from feedparser>=5.2.1->newspaper3k)   Downloading https://pypi.tuna.tsinghua.edu.cn/packages/9e/bd/3704a8c3e0942d711c1299ebf7b9091930adae6675d7c8f476a7ce48653c/sgmllib3k-1.0.0.tar.gz (5.8 kB)   Preparing metadata (setup.py) ... done Collecting click (from nltk>=3.2.1->newspaper3k)   Downloading https://pypi.tuna.tsinghua.edu.cn/packages/00/2e/d53fa4befbf2cfa713304affc7ca780ce4fc1fd8710527771b58311a3229/click-8.1.7-py3-none-any.whl (97 kB)      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 97.9/97.9 kB 5.5 MB/s eta 0:00:00 Collecting joblib (from nltk>=3.2.1->newspaper3k)   Downloading https://pypi.tuna.tsinghua.edu.cn/packages/10/40/d551139c85db202f1f384ba8bcf96aca2f329440a844f924c8a0040b6d02/joblib-1.3.2-py3-none-any.whl (302 kB)      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 302.2/302.2 kB 19.5 MB/s eta 0:00:00 Collecting regex>=2021.8.3 (from nltk>=3.2.1->newspaper3k)   Downloading https://pypi.tuna.tsinghua.edu.cn/packages/1d/af/4bd17254cdda1d8092460ee5561f013c4ca9c33ecf1aab81b44280327cab/regex-2023.12.25-cp312-cp312-win_amd64.whl (268 kB)      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 268.9/268.9 kB 17.2 MB/s eta 0:00:00 Collecting tqdm (from nltk>=3.2.1->newspaper3k)   Downloading https://pypi.tuna.tsinghua.edu.cn/packages/2a/14/e75e52d521442e2fcc9f1df3c5e456aead034203d4797867980de558ab34/tqdm-4.66.2-py3-none-any.whl (78 kB)      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78.3/78.3 kB 4.3 MB/s eta 0:00:00 Requirement already satisfied: charset-normalizer<4,>=2 in d:\program files\python\lib\site-packages (from requests>=2.10.0->newspaper3k) (3.3.2) Requirement already satisfied: idna<4,>=2.5 in d:\program files\python\lib\site-packages (from requests>=2.10.0->newspaper3k) (3.6) Requirement already satisfied: urllib3<3,>=1.21.1 in d:\program files\python\lib\site-packages (from requests>=2.10.0->newspaper3k) (2.1.0) Requirement already satisfied: certifi>=2017.4.17 in d:\program files\python\lib\site-packages (from requests>=2.10.0->newspaper3k) (2023.11.17) Collecting requests-file>=1.4 (from tldextract>=2.0.1->newspaper3k)   Downloading https://pypi.tuna.tsinghua.edu.cn/packages/10/fd/321e33597e09cb4368d361b0b6c6573ef45d5f693acef41ba33673a55b7c/requests_file-2.0.0-py2.py3-none-any.whl (4.2 kB) Collecting filelock>=3.0.8 (from tldextract>=2.0.1->newspaper3k)   Downloading https://pypi.tuna.tsinghua.edu.cn/packages/81/54/84d42a0bee35edba99dee7b59a8d4970eccdd44b99fe728ed912106fc781/filelock-3.13.1-py3-none-any.whl (11 kB) Collecting colorama (from click->nltk>=3.2.1->newspaper3k)   Downloading https://pypi.tuna.tsinghua.edu.cn/packages/d1/d6/3965ed04c63042e047cb6a3e6ed1a63a35087b6a609aa3a15ed8ac56c221/colorama-0.4.6-py2.py3-none-any.whl (25 kB) Building wheels for collected packages: tinysegmenter, feedfinder2, jieba3k, sgmllib3k   Building wheel for tinysegmenter (setup.py) ... done   Created wheel for tinysegmenter: filename=tinysegmenter-0.3-py3-none-any.whl size=13568 sha256=d393a7188655925876d6346456dfefb3c3c78d46928be57e0c464f6dbd4a02a2   Stored in directory: c:\users\boyso\appdata\local\pip\cache\wheels\b2\9d\99\03ac91b1b064af680304b0051f838ec5b8b6f1507e5f3dd39e   Building wheel for feedfinder2 (setup.py) ... done   Created wheel for feedfinder2: filename=feedfinder2-0.0.4-py3-none-any.whl size=3359 sha256=da89dae9448e7f383243238a35d45989a93841af7e7d0b3da14a1c4c403f8c2a   Stored in directory: c:\users\boyso\appdata\local\pip\cache\wheels\97\c2\39\7f52b924caacbee2039a186a0ac787062f9eaeb165d73e4e94   Building wheel for jieba3k (setup.py) ... done   Created wheel for jieba3k: filename=jieba3k-0.35.1-py3-none-any.whl size=7398388 sha256=cb8053d445dc4c9cb499865e105bba8a69f65022ce15b205da8537ce382779a2   Stored in directory: c:\users\boyso\appdata\local\pip\cache\wheels\4a\90\9d\d6bbab88e3ba8442ab9ff803197693859ef98b623f0123c68f   Building wheel for sgmllib3k (setup.py) ... done   Created wheel for sgmllib3k: filename=sgmllib3k-1.0.0-py3-none-any.whl size=6060 sha256=6b82531de468b4b36ab4bc73ca2bf3532f0e75233140d1ed66f5ec5862ed2c8b   Stored in directory: c:\users\boyso\appdata\local\pip\cache\wheels\4f\48\8e\790492080c5e85ff08704e66c6e1cc9bd96ac391bb890426bf Successfully built tinysegmenter feedfinder2 jieba3k sgmllib3k Installing collected packages: tinysegmenter, sgmllib3k, jieba3k, soupsieve, regex, lxml, joblib, filelock, feedparser, cssselect, colorama, tqdm, requests-file, click, beautifulsoup4, tldextract, nltk, feedfinder2, newspaper3k Successfully installed beautifulsoup4-4.12.3 click-8.1.7 colorama-0.4.6 cssselect-1.2.0 feedfinder2-0.0.4 feedparser-6.0.11 filelock-3.13.1 jieba3k-0.35.1 joblib-1.3.2 lxml-5.1.0 newspaper3k-0.2.8 nltk-3.8.1 regex-2023.12.25 requests-file-2.0.0 sgmllib3k-1.0.0 soupsieve-2.5 tinysegmenter-0.3 tldextract-5.1.2 tqdm-4.66.2

实战

1. 抓取CSDN上的文章

from bs4 import BeautifulSoup

from newspaper import Article

usr = 'boysoft2002' # 可以换其他用户ID,注意不是昵称是ID

url = 'https://blog.csdn.net/'+usr

article = Article(url)

article.download()

soup = BeautifulSoup(article.html, 'html.parser')

article_links = soup.find_all('a')

for link in article_links:

link = link.get('href')

if link and link.startswith(url+'/article/details'):

article = Article(link)

article.download()

article.parse()

print(f"Title:{article.title[:-7]}")

print(f"Link:{article.url}")

print(f"Text:{article.text}\n")

print('Completed!')

运行结果:

Title: Git 分布式版本控制系统基本概念和操作命令 Link: https://blog.csdn.net/boysoft2002/article/details/136970364 Text:  Squeezed text(158 lines).  Title: python共有26个内置类,你知道几个? Link: https://blog.csdn.net/boysoft2002/article/details/136953633 Text: 本文要介绍的 Python 内置函数属于核心语言的一部分。我们在介绍数据类型、控制结构的文章中已经逐渐学习了 Python 的基本语法,不过还是需要一个时机去学习 Python 中零落无序的内置函数,这就是本期文章的目的。Python 官方提供了 68 个内置函数,这些内置函数主要提供简单且基础的功能,实用性高。需要提醒的是,我们在使用自定义函数时,应该尽量避免函数名与内置函数的名称一样,否则有可能导致程序异常。下面我们以处理数据为目的,分类别向大家介绍 Python 中常用的内置函数。 Title: Help on built-in functions in module builtins (74) Link: https://blog.csdn.net/boysoft2002/article/details/136951153 Text: 这是Python解释器给出的print()函数的帮助信息。 print()函数是Python内置函数之一,它可以将指定的参数打印输出到标准输出流(默认为sys.stdout)或指定的文件流中。print()函数的参数可以是任意类型的对象,包括字符串、数字、列表、元组、字典等等。 print()函数的参数包括: - *args:表示可变参数,可以传入任意个参数,多个参数之间用逗号隔开。 - sep:表示输出多个参数时,参数之间的分隔符,默认值是一个空格。 - end:表示输出结束时的字符,默认值是一个换行符。 - file:表示输出的目标文件流,默认值是sys.stdout,即标准输出流。 - flush:表示是否立即刷新输出流,默认值是False,即不立即刷新。 例如,下面的语句将会输出三个字符串,并且每个字符串之间用逗号隔开,最后不换行: ``` print("hello", "world", "Python", sep=", ", end="") ``` Title: python 教你如何创建一个自定义库 colorlib.py Link: https://blog.csdn.net/boysoft2002/article/details/136861675 Text:  Squeezed text(1311 lines).  Title: Http 超文本传输协议基本概念学习摘录 Link: https://blog.csdn.net/boysoft2002/article/details/136851562 Text:  Squeezed text(238 lines).  Title: python自定义日历库,与对应calendar库函数功能基本一致 Link: https://blog.csdn.net/boysoft2002/article/details/136823417 Text:  Squeezed text(497 lines).  Title: python calendar内置日历库函数方法 Link: https://blog.csdn.net/boysoft2002/article/details/136770420 Text:  Squeezed text(409 lines).  Title: Python 一步一步教你用pyglet制作汉诺塔游戏(终篇) Link: https://blog.csdn.net/boysoft2002/article/details/136008639 Text:  Squeezed text(322 lines).  Title: Python 一步一步教你用pyglet制作汉诺塔游戏(续) Link: https://blog.csdn.net/boysoft2002/article/details/136634444 Text:  Squeezed text(368 lines).  Title: Python 一步一步教你用pyglet制作汉诺塔游戏 Link: https://blog.csdn.net/boysoft2002/article/details/136598320 Text:  Squeezed text(191 lines).  Title: Python 初步了解urllib库:网络请求的利器 Link: https://blog.csdn.net/boysoft2002/article/details/136589553 Text:  Squeezed text(208 lines).  Title: Python 一步一步教你用pyglet仿制鸿蒙系统里的时钟 Link: https://blog.csdn.net/boysoft2002/article/details/136578359 Text:  ...略... Title: Python 一步一步教你用pyglet制作可播放音乐的扬声器类 Link: https://blog.csdn.net/boysoft2002/article/details/136522563 Text:  ...略... Title: python INI文件操作与configparser内置库 Link: https://blog.csdn.net/boysoft2002/article/details/136546933 Text:  Squeezed text(570 lines).  Title: Pandas DataFrame 基本操作实例100个 Link: https://blog.csdn.net/boysoft2002/article/details/136437876 Text:  Squeezed text(498 lines).  Title: Pyglet图形界面版2048游戏——详尽实现教程(上) Link: https://blog.csdn.net/boysoft2002/article/details/136404961 Text:  Squeezed text(308 lines).  Title: 常用SQL查询方法与实例 Link: https://blog.csdn.net/boysoft2002/article/details/136380309 Text:  Squeezed text(277 lines).  Title: python 小游戏《2048》字符版非图形界面 Link: https://blog.csdn.net/boysoft2002/article/details/136329625 Text:  Squeezed text(511 lines).  Title: python|闲谈2048小游戏和数组的旋转及翻转和转置 Link: https://blog.csdn.net/boysoft2002/article/details/136329641 Text:  Squeezed text(919 lines). Completed!

注:Squeezed text(xxx lines). 这个是因为内容太多,IDLE缩略了点击黄色背景文字就会展开。但是都点开太耗内存,修改一下代码把文字输出到文本文件,这样再打开文件慢慢阅读好了。

from bs4 import BeautifulSoup

from newspaper import Article

usr = 'boysoft2002'

url = 'https://blog.csdn.net/'+usr

article = Article(url)

article.download()

soup = BeautifulSoup(article.html, 'html.parser')

article_links = soup.find_all('a')

file = open('csdnDoc.txt','w',encoding='utf-8')

index = 0

for link in article_links:

link = link.get('href')

if link and link.startswith(url+'/article/details'):

article = Article(link)

article.download()

article.parse()

index += 1

print(f"Title-{index:02}:{article.title[:-7]}", file=file)

print(f"Link:{article.url}", file=file)

print(f"Text:{article.text}\n", file=file)

file.close()

print('Completed!')

从代码看,用newspaper库读取网页内容还是很方便的,不需要使用者考虑header、IP代理等问题,比较一下requests获取链接的代码:

import requests

from bs4 import BeautifulSoup

url = 'https://blog.csdn.net/boysoft2002'

headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36(KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

response = requests.get(url, headers=headers)

response.raise_for_status()

soup = BeautifulSoup(response.text, 'html.parser')

article_links = soup.find_all('a')

#......

2. 查阅网易新闻的内容

from bs4 import BeautifulSoup

from newspaper import Article

url = 'https://news.163.com/world/'

article = Article(url)

article.download()

soup = BeautifulSoup(article.html, 'html.parser')

article_links = soup.find_all('a')

for link in article_links:

link = link.get('href')

if link and '/article' in link:

article = Article(link)

article.download()

article.parse()

print(f"Title:{article.title}")

print(f"Link:{article.url}")

print(f"Text:{article.text}\n")

print('Completed!')

 新闻标题和链接都能爬出来,内容能抓取到的较少,读者可再试试其它新闻网站的爬取效果。

总结

总的来说,Newspaper是一个非常适合初学者和新闻类爬虫需求的Python库。它简单易用,功能丰富,能够帮助用户快速地从新闻网站中提取所需信息。然而,对于更复杂的项目或者网站有强劲的反爬虫功能的话,可能会在处理过程中出现各种bug或者被目标网站直接拒绝访问或者的情况,此时就需要结合其他工具或框架来实现更稳定、更高效的爬取任务。

参考文章

评论可见,请评论后查看内容,谢谢!!!
 您阅读本篇文章共花了: