Python爬虫学习02(使用selenium爬取网页数据)

2022/7/14 1:20:35

编程Tag： 爬取 options value selenium python find webdriver 02 Driver

本文主要是介绍Python爬虫学习02(使用selenium爬取网页数据)，对大家解决编程问题具有一定的参考价值，需要的程序猿们随着小编来一起学习吧！

Python爬虫学习02(使用selenium爬取网页数据)

Python爬虫学习02(使用selenium爬取网页数据)
- 1.1，使用的库
- 1.2，流程
- 1.3，用到的函数
- 1.3，示例：利用selenium从中华人民共和国民政部网站获取行政区划信息
- 1.4，优化
  - 1.4.1，问题描述

1.1，使用的库

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.select import Select

1.2，流程

#1，打开浏览器
driver = webdriver.Chrome()
#该方式会显示浏览器界面
# option = webdriver.ChromeOptions()
# option.add_argument("headless")
# driver = webdriver.Chrome(options=option)
## 该方式不会显示浏览器界面
#2，通过url打开界面
driver.get('http://xzqh.mca.gov.cn/map')
#3，对打开的界面进行操作
s1 = Select(driver.find_element(by=By.NAME,value='shengji'))

1.3，用到的函数

1，driver.find_elements(by=By.OPTIONS,value='VALUES')
#作用：根据要求获取元素
#示例:driver.find_element(by=By.NAME,value='shengji')
#driver.find_element(by=By.CLASS_NAME,value="info_table")
#返回类型:list
2,Select(ELEMENT)
#作用：根据给定的元素获取select对象
#示例:s = Select(driver.find_element(by=By.NAME,value='shengji'))
#可以通过s.options[i]获取select中的选项
#示例:province = s1.options[i].text.split('（')[0]
#可以通过s.select_by_index()(或者select_by_value)来选择选项
#示例:s1.select_by_index(i)

1.3，示例：利用selenium从中华人民共和国民政部网站获取行政区划信息

from selenium import webdriver
from selenium.webdriver.common.by import By
import time as TIME

#打开浏览器
driver = webdriver.Chrome()
#通过下面的方式打开浏览器可以不打开图形界面
# option = webdriver.ChromeOptions()
# option.add_argument("headless")
# driver = webdriver.Chrome(options=option)

driver.get('http://xzqh.mca.gov.cn/map')
#获取select元素
s1 = Select(driver.find_element(by=By.NAME,value='shengji'))
#用字典保存province与index对应的关系
provinces={}
index = 0
for i in s1.options:
    provinces[i.text.split('（')[0]]=index
    index+=1

list = ['湖北省','湖南省','四川省']
for i in list:
    index = provinces[i]
    #获取select元素
    s1 = Select(driver.find_element(by=By.NAME, value='shengji'))
    #选择想要的省份
    s1.select_by_index(index)
    #获取提交按钮元素
    button = driver.find_element(by=By.CLASS_NAME,value='select_bn')
    #点击跳转
    button.click()
    #延迟等待网页加载
    TIME.sleep(2)
    #获取table元素
    table = driver.find_element(by=By.CLASS_NAME,value="info_table")
    #获取area元素
    areas = table.find_elements(by=By.NAME,value='hidzxs')
    for area in areas:
        print(i+' '+area.get_property('value'),area.get_property('alt'))
    #退回上一页
    driver.back()

1.4，优化

1.4.1，问题描述

使用上述方式，不论是否打开浏览器的图形界面都很慢，原因是Selenium页面加载策略的选择问题

selenium有三种页面加载策略：

策略	准备完成的状态	备注
normal	complete	默认情况下使用, 等待所有资源下载完成
eager	interactive	DOM访问已准备就绪, 但其他资源 (如图像) 可能仍在加载中
none	Any	完全不阻塞WebDriver

使用方式：

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.page_load_strategy = 'eager'#此处选择策略
driver = webdriver.Chrome(options=options)
driver.get("http://www.google.com")
driver.quit()

在没有选择策略的时候，默认使用nomal策略，等待所有资源加载完才会返回，所以很慢。

这篇关于Python爬虫学习02(使用selenium爬取网页数据)的文章就介绍到这儿，希望我们推荐的文章对大家有所帮助，也希望大家多多支持为之网！

Python爬虫学习02(使用selenium爬取网页数据)

Python爬虫学习02(使用selenium爬取网页数据)

1.1，使用的库

1.2，流程

1.3，用到的函数

1.3，示例：利用selenium从中华人民共和国民政部网站获取行政区划信息

1.4，优化

1.4.1，问题描述

相关编程文章