爬虫利器- -scrapy

2021-08-06 2022-04-24 Development > Coding > Backend 阅读次数：字数： 3.1k 时长 ≈ 3 分钟

爬虫框架学习。

安装 scrapy

1	pip install Scrapy==2.5.0

创建 scrapy 项目

1	scrapy startproject example # example 为项目名称

scrapy 目录结构说明

spiders 爬虫主要程序
items.py 定义想要爬取或存储的资料
middlewares.py 定义 spider 与 engine 中间件 及 Engine 与下载器（downloader）中间件
piplines.py 定义 items 资料的后续处理，例如：清理、存储至资料库
settings.py scrapy 配置
scrapy.cfg scrapy 设置文档

建立爬虫程序

cd first_scrapy 
# first_scrapy  爬虫项目
scrapy genspider example example.com
# example 爬虫的名称
# example.com  爬取的网站
# 会自动在 spiders 目录下创建 example.py 爬虫主程序文件

爬取内容

使用 css 方法

response.css 使用 scrapy 自带的 css 方法获取网页中的内容
::text 获取文本信息
::attr(属性名称) 获取属性值

使用 xpath 方法

response.xpath 使用 scrapy 自带的 [xpath][xpath-web](Chrome 可以使用xpath helper 工具测试语法) 方法获取网页中的内容
实例

titles = response.xpath(
    "//a[@class='js-auto_break_title']/text()"  # 通过网页标签获取相应的信息
).getall()
# text() 获取文本信息
# /href  获取链接地址，href 为属性值

保存内容

使用 MySQL 保存

配置

在 settings.py 中配置链接信息

ITEM_PIPELINES = {
   'first_scrapy.pipelines.NewsScraperPipeline': 300,
}

# MySQL Settings
MYSQL_HOST = '47.244.167.216'
MYSQL_USER = 'learn'
MYSQL_DATABASE = 'insidedb'
MYSQL_PASSWORD = 'Learn1Scrapy@.'

items.py 配置保存的方式

class NewsScraperItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    post_title = scrapy.Item()
    post_date = scrapy.Item()
    post_author = scrapy.Item()

piplines.py 配置数据库连接操作

class NewsScraperPipeline:
    """
    Use MySQL to store information.
    """
    def __init__(self):
        self.connect = pymysql.connect(
            host=settings.MYSQL_HOST,
            user=settings.MYSQL_USER,
            password=settings.MYSQL_PASSWORD,
            database=settings.MYSQL_DATABASE,
            charset='utf8'
        )
        self.cursor = self.connect.cursor()

    def process_item(self, item, spider):
        sql = "INSERT INTO posts(post_title, post_date, post_author) VALUES(%s,%s,%s)"
        data = (item['post_title'], item['post_date'], item['post_author'])
        self.cursor.execute(sql, data)
        return item

    def close_spider(self, spider):
        self.connect.commit()
        self.connect.close()

使用 CSV 保存数据

1	scrapy crawl inside -o posts.csv # 使用 scrapy 自带的 csv 保存文本 posts.csv 为

使用 selenium 爬取动态数据

from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_experimental_option("excludeSwitches", ['enable-automation', 'load-extension'])  # 屏蔽chrome 在使用测试软件控制提示
options.add_argument(f"user-agent={random.choice(settings.USER_AGENTS)}")  # 随机获取一个 agent

options.add_argument('start-maximized')
options.add_argument('enable-automation')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--no-sandbox')
options.add_argument('--disable-browser-side-navigation')
options.add_argument('window-size=1920x1080')
options.add_argument('--disable-gpu')

# 重要的配置
driver = webdriver.Chrome()
script = '''
Object.defineProperty(navigator, 'webdriver', {
    get: () => undefined
})
'''
driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {"source": script})
driver.add_cookie({'name': 'xx', 'value': 'xx'})  # 添加 cookie 内容
driver.get(url)