使用python对博客网站的文章进行爬取

前言

说到爬取文章啥的, 最好使的肯定是python, python中有很多爬虫库可供我们使用, 方便快捷, 虽然工具很多, 但是大同小异, 我们只需要用好一个库就够了

接下来以掘金为例, 给大家演示一下如何爬取网站中的文章:

网站结构分析

调出浏览器控制台, 通过分析, 我们发现掘金和简书不同, 其网站中的文章链接全部都是通过接口动态请求的, 而非以Nginx容器静态存放

由于我们需要获取多篇文章,而不是单篇, 所以文章链接我们必须先拿到手, 然后再根据链接挨个将文章爬出来

文章链接获取

在推荐一栏通过上拉加载我们很轻松就能获取到请求的接口和参数, 我们只需要将其拷贝出来用python模拟请求即可

请求接口:

1	https://api.juejin.cn/recommend_api/v1/article/recommend_cate_feed?aid=xxx&uuid=xxx

请求参数:

1	{"id_type":2,"sort_type":200,"cate_id":"6809637769959178254","cursor":"1","limit":20}

其中cursor字段表示页数, 我们可以循环递增这个字段的值来源源不断地获取数据

响应数据:

这个article_id就是我们需要获取的值, 将这个值和https://juejin.cn/post/进行拼接就得到了文章的实际地址, 如下:

1	https://juejin.cn/post/7016520448204603423

好了, 分析完毕后, 直接上代码实现:

# coding:utf-8


import codecs
import time
import requests
import json
import sys
# 增加try except嵌套层数 避免
sys.setrecursionlimit(10000)



cache_file_name = 'temp_juejin.txt'

cache = []

  
def loadCache():
    with codecs.open(cache_file_name, "r", "utf-8") as fr:
        for line in fr:
            cache.append(line)
        #print(cache)
        #return cache[len(cache)-1]


def startScrape():
    
    apiUrl='https://api.juejin.cn/recommend_api/v1/article/recommend_cate_feed?aid=2608&uuid=7023196943133656589'
    HEADERS = {
        'User-Agent'		: 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:11.0) Gecko/20100101 Firefox/11.0',
       
        'Accept-Encoding'	: 'gzip,deflate,sdch',
        'Accept-Language'	: 'zh-CN,zh;q=0.8',
         'Content-type': 'application/json; charset=UTF-8',
       'accept': 'application/json, text/plain, */*',
          
        }
    for index in range(1000):
        body = {"id_type":2,"sort_type":200,"cate_id":"6809637769959178254","cursor":"1","limit":20}
        body["cursor"]=str(index+1)
        print(body)
        r = requests.post(url=apiUrl, headers=HEADERS,data=json.dumps(body))
        print(r.status_code)
        
        res=json.loads(r.content)
       
        with codecs.open(cache_file_name, "a", "utf-8") as f:
            for item in res["data"]:
                print()
                id=item["article_info"]["article_id"]
                link = "https://juejin.cn/post/"+id
                print(link)
                #if link not in cache: #判断存在或者不存在
                if not any(link in s for s in cache):
                    cache.append(link)
                    f.write(link+"\n")
                    print("新增一条连接")
                    #time.sleep(10)
                    #切记 url不要加入换行 否则404
            f.close()  
        time.sleep(2)




def job():

    loadCache()
    startScrape()




if __name__ == '__main__':
    job()

执行该代码前先在同级目录下新建一个temp_juejin.txt文件, 用于存放获取到的所有文章链接, 对于初学者, 这里需要注意的是json的转换处理和请求头的设置, 如果没有使用json.dumps进行转换, 那么请求会失败, 如果请求头不加Content-type和accept或者没填对, 请求正常但是返回的不是正常的数据, 这一块是很多人极易忽视的地方

代码运行后爬取的结果如下:

好了, 有了文章链接, 下一步我们就开始挨个文章的爬取

文章爬取

爬虫框架, 我这里使用的是PyQuery, 关于PyQuery的用法, 可参见《Python爬虫框架之PyQuery的使用》

接下来我们爬取文章的标题和内容, 代码如下:

# -*- coding:utf-8 -*-

import requests
from pyquery import PyQuery as pq
import codecs
import os
import sys





sys.setrecursionlimit(1000000)

# 当前文件路径
current_path = os.path.abspath(__file__)
# 父目录
father_path = os.path.abspath(os.path.dirname(current_path) + os.path.sep + ".")



headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36',
    'cookie': '__cfduid=d89ead99eeea979ea1f2a1a6243d186461600935008; Hm_lvt_2374bfdfe14a279e4a045267051b54e1=1600935010,1601459114; __yjsv3_shitong=1.0_7_8b1bac638e380ca12f87734ab2405afe2e94_300_1601470572844_223.104.3.46_aaa564dc; cf_chl_1=04a386244ad8d71; cf_chl_prog=x17; cf_clearance=f4af2dbdded649a7cf11a5a52d168289e98e68c6-1601470577-0-1zd4e21871z8a534313z279abd70-150; Hm_lpvt_2374bfdfe14a279e4a045267051b54e1=1601470578',
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
   
    }

session = requests.Session()
session.headers.update(headers)

origin_url_cache_file_name="temp_juejin.txt"

originUrlCache = []

# 将文件中内容按行加载至列表中
def loadCache(filename, filter_text):
    cache = []
    with codecs.open(filename, "r", "utf-8") as fr:
        if filter_text:
            for line in fr:
                if filter_text in line:
                    cache.append(line.replace("\n", ""))
        else:
            for line in fr:
                cache.append(line.replace("\n", ""))
        return cache



def get_article_by_url(url):
    rep = session.get(url)
    d=pq(rep.text)
    title = d('h1').text()
    content = d('.markdown-body').html()
    return title, content
    
def startScrape():
    for link in originUrlCache:
   
        print(link + "\n")
        title, content = get_article_by_url(link)
        print(title + "\n")
        print(content)
        
          



if __name__ == '__main__':

    #将需要爬取的url加载到内存中
    originUrlCache=loadCache("{parent}/{filename}".format(parent=father_path,filename=origin_url_cache_file_name), None)
    #开始爬取文章
    startScrape()

但是打印结果全部为None, 去控台一查发现掘金文章页面内容是通过js动态渲染的, 如果直接获取html是无法通过PyQuery获取到我们想要的内容的, 那这咋办?

想一下, 如果我们能拿到渲染完成后的html, 然后再通过PyQuery进行查找, 不就完事了

问题在于如何获取到渲染完成后的页面源码, 单纯的Get请求肯定是不行的, 我们需要模拟浏览器渲染才行,

这个时候我们就需要用到一个Web自动化框架, 也就是大名鼎鼎的selenium, 它可以模拟真实的浏览器访问和查找甚至是点击操作, 这里我们只需要利用它得到页面源码即可, 关于selenium的详细使用, 可参见《Web自动化框架selenium的介绍与使用》

于是代码修改成如下模样:

# -*- coding:utf-8 -*-

import requests
from pyquery import PyQuery as pq
import codecs
import os
import sys

from selenium import webdriver



chrome_options = webdriver.ChromeOptions()
# 使用headless无界面浏览器模式
chrome_options.add_argument('--headless') #增加无界面选项
chrome_options.add_argument('--disable-gpu') #如果不加这个选项，有时定位会出现问题

browser = webdriver.Chrome(chrome_options=chrome_options)




sys.setrecursionlimit(1000000)

# 当前文件路径
current_path = os.path.abspath(__file__)
# 父目录
father_path = os.path.abspath(os.path.dirname(current_path) + os.path.sep + ".")




origin_url_cache_file_name="temp_juejin.txt"

originUrlCache = []

# 将文件中内容按行加载至列表中
def loadCache(filename, filter_text):
    cache = []
    with codecs.open(filename, "r", "utf-8") as fr:
        if filter_text:
            for line in fr:
                if filter_text in line:
                    cache.append(line.replace("\n", ""))
        else:
            for line in fr:
                cache.append(line.replace("\n", ""))
        return cache



def get_article_by_url(url):
    browser.get(url)
    d = pq(browser.page_source)
    title = d('h1').text()
    content = d('.markdown-body').html()
    return title, content
    
def startScrape():
    for link in originUrlCache:
   
        print(link + "\n")
        title, content = get_article_by_url(link)
        print(title + "\n")
        print(content)
        
          



if __name__ == '__main__':

    #将需要爬取的url加载到内存中
    originUrlCache=loadCache("{parent}/{filename}".format(parent=father_path,filename=origin_url_cache_file_name), None)
    #开始爬取文章
    startScrape()

该代码运行的前提是需要安装谷歌浏览器

运行时如果提示This version of ChromeDriver only supports Chrome version, 那么说明浏览器版本和驱动版本不一致, 需要下载与浏览器相匹配的驱动

查看谷歌浏览器版本:

在这里插入图片描述

然后下载驱动:

ChromeDriver下载地址

在这里插入图片描述

将下载的驱动解压到以下目录:

1 2	Win：复制webdriver到Python安装目录下 Mac：复制webdriver到/usr/local/bin目录下

至此, 我们成功爬取到指定文章地址的标题和内容

既然获取到了想要的数据, 那么接下来你可以考虑将其存放到本地, 或者上传到你的wordpress

关于wordpress文章的上传,可参考文章《如何将python采集到的文章保存到wordpress》

文章上传到wordpress

秉着善始善终的原则, 以上面的代码为例给大家补充上上传wordpress后的最终代码:

# -*- coding:utf-8 -*-

import time
from lxml import html
import requests
from pyquery import PyQuery as pq
import codecs
import os
import sys
from wordpress_xmlrpc import Client, WordPressPost
from wordpress_xmlrpc.methods.posts import GetPosts,NewPost

from selenium import webdriver
import time  


chrome_options = webdriver.ChromeOptions()
# 使用headless无界面浏览器模式
chrome_options.add_argument('--headless') #增加无界面选项
chrome_options.add_argument('--disable-gpu') #如果不加这个选项，有时定位会出现问题

browser = webdriver.Chrome(chrome_options=chrome_options)

sys.setrecursionlimit(1000000)

# 当前文件路径
current_path = os.path.abspath(__file__)
# 父目录
father_path = os.path.abspath(os.path.dirname(current_path) + os.path.sep + ".")

pushed_cache_file_name="temp_juejin_pushed_url.txt"
origin_url_cache_file_name="temp_juejin.txt"




pushedCache = []
originUrlCache = []
wp = Client('http://您的域名/xmlrpc.php', 'wordpress用户名', 'wordpress登录密码')

# 将文件中内容按行加载至列表中
def loadCache(filename, filter_text):
    cache = []
    with codecs.open(filename, "r", "utf-8") as fr:
        if filter_text:
            for line in fr:
                if filter_text in line:
                    cache.append(line.replace("\n", ""))
        else:
            for line in fr:
                cache.append(line.replace("\n", ""))
        return cache



def push_article(post_title,post_content_html):
    post = WordPressPost()
    post.title = post_title
    post.slug = post_title
    post.content = post_content_html
    post.terms_names = {
      'post_tag': post_title.split(" "),
      'category': ["itarticle"]
        }
    post.post_status = 'publish'
    wp.call(NewPost(post))

def get_article_by_url(url):
    browser.get(url)
    d = pq(browser.page_source)
    browser.quit
    title = d('h1').text()
    content = d('.markdown-body').html()
    return title, content
    
def startScrape():
    for link in originUrlCache:
        if not any(link in s for s in pushedCache):
            print(link + "\n")
            title, content = get_article_by_url(link)
            print(title + "\n")
            print(content)
            push_article(title,content)
            time.sleep(2)
            with codecs.open("{parent}/{filename}".format(parent=father_path,filename=pushed_cache_file_name), 'a', "utf-8") as fw:
                fw.write(link + "\n")
                fw.close()
          



if __name__ == '__main__':
    #将已经爬取过的url加载到内存中
    pushedCache=loadCache("{parent}/{filename}".format(parent=father_path,filename=pushed_cache_file_name), None)
    #将需要爬取的url加载到内存中
    originUrlCache=loadCache("{parent}/{filename}".format(parent=father_path,filename=origin_url_cache_file_name), None)
    #开始爬取文章
    startScrape()

只要将域名, 用户名和密码替换成你自己的就行

注意wordpress_xmlrpc库的安装:

1	pip install python-wordpress-xmlrpc

补充

另外上面获取文章链接环节也可以直接使用selenium进行获取, 虽然不如调接口来的快, 但是碰上接口被加密的情况, 那么selenium的方式能快速解决

本文为作者原创转载时请注明出处谢谢

乱码三千 – 点滴积累 ,欢迎来到乱码三千技术博客站