> 文章列表 > python【反爬、xpath解析器、代理ip】

python【反爬、xpath解析器、代理ip】

python【反爬、xpath解析器、代理ip】

反爬、xpath解析器、代理ip

1.自动登录

1) requests自动登录

步骤:

第一步:人工对需要自动登录网页进行登录

第二步:获取这个网站登录后的cookie信息

第三步:发送请求的时候在请求中添加cookie值

#登录知乎
import requestsheaders = {'cookie':'_xsrf=p6x9UwvzRn32qWVoFyBaD4XEA8AGIEa3; _zap=19602ad1-626f-420a-ae86-9426c03f1721; Hm_lvt_98beee57fd2ef70ccdd5ca52b9740c49=1680319020; d_c0=ANBXognIjhaPTuFh6al-dhqCYaAaJS5kG08=|1680319019; gdxidpyhxdE=mLGLh6nWCUlW%2BkWJRMSOlbPgwJwEkGKvoLB%5CDwA88f1Gase4JRiL%5CNAO%2FwCKYTX51eDsxeCJTqGdqu4V2u3xOheyUMK2WIR8w%2Fqz0%2F3T9hU0NEkMGQeod60kUS8DuoOtAxu3U%2B%2Bur8%2Br%5CkmyJ5a6vbsZWmm8x6W%2BLIuMNWUV5Wol6eT3%3A1680319921123; YD00517437729195%3AWM_NI=Gk2NR%2Bvd2a7H5U28K9kTmYZGcGxv4i9vNkxHRdLwZzYjtebbmFCRe%2Fv%2BoIS1Yq07y5EJl9Rag0chkvLE1QyRbQeMfdVbvwnD0RviO%2F7yZq4B%2FlmL7Z1Ga4dqumsvVWwEOVk%3D; YD00517437729195%3AWM_NIKE=9ca17ae2e6ffcda170e2e6eebac54ab2f5a4a3bc4181b48ba3c55e839f8fb0d85cf4aa97daeb46f5aabba8f42af0fea7c3b92a85b5feafd121b6bcfe82b644f78bfed9ca33a98ca3bad94da199b897b763fbaa828cfb4492bc8498d9729bb6bc86e16aaf9a8db3e825b6f5a1aad23b9baf00d8fc3f8796a4dad16ba5f0ad89ea7e909bffb2fc5aa28fabaee57096beba98ed50f48eba8ace25f5a79692b74aedb4a3adb652f399afd5e77bb7b3bd97d55db59b81b7ea37e2a3; YD00517437729195%3AWM_TID=d8d4rn39BaFEFVBVFQKEKh6cplJ77qDq; Hm_lpvt_98beee57fd2ef70ccdd5ca52b9740c49=1680319035; captcha_session_v2=2|1:0|10:1680319034|18:captcha_session_v2|88:YXUydDJKQ1hyTWZ4SVlQeDNKQko4eTlQS2hBV210Q2I3MEEvaVh6TkhaNGNwTC9vc3ZVbXNrV05mUkR4UmdQbQ==|41309f72b1a177584d79526caefa60a33597e3e5623ef7f86eb98d82d937da5d; __snaker__id=Mbfbw4MqXaiZJqmI; captcha_ticket_v2=2|1:0|10:1680319288|17:captcha_ticket_v2|704:eyJ2YWxpZGF0ZSI6IkNOMzFfMTZSdVNwY0FwbDZMejVJSllWclVUU0J6eTgtRnVadENKc1Itay0wZEI1X2I2RlVmLXBEOXFudHF6RUJnNXUyMmN2Ny0xemdfNWVNaHVKaG9Va1ZCS0lDY2NldVM1cm5tUFBWaWwwZGtYOG5iNXN0c2UtT1NjRk9hd3lZN2hUZXQ2NkZweXdkcElPdXlxek5rQkc4Lm4xZFYwMDdQX29BRW44aWFpT3dJaW1XMTFsbDlCbjllUHZnc3RaeEM0ZUM2WUpES0R0dUpxLlNjNVBncGV2bGpqaHB5Z3FsZWR2cXd4cDBGUjdWZjJROU1rNXVDMDJCMTJPVEVsS0dJWUE2QUF1bVJ1VkVneTlmeEdmdDZ1dzZVV0tjaDRZWWt0LWloNzg3aEtYVkUyaHVLcHJ5TE04TkU4U1VDV21Sd2pLTDRZaXFwTlE4MjJvMHdmdzBBMDJWLnBILU50TnNyS3cudGJwYlkyWlhhbDI1MktZdEpNMVp0bWJTMW9MeFJGWDFjaUFSUHlpRi5vVU1tdHQ0ZUxzMURzUUdTTmJsSFZrbFJWRmRQWEhxbnktZU8ycGhubTRWaS5PaWhqMm51bng5WWdKVFNMRGQtbC5sQTV6d1ZZUDgwTlZWc3hSTW1LY3N5UmotOHpmUWw4cVhjbUp4akhiZmJIYi1fSnZpMyJ9|31c79c4cf61774a15336dbd47264e24052e90e278c7b7ad326a7f771283e2674; z_c0=2|1:0|10:1680319303|4:z_c0|92:Mi4xNFRCVEN3QUFBQUFBMEZlaUNjaU9GaVlBQUFCZ0FsVk5SX0VVWlFBYlBJY1djM1g4enhnQm9RTHE4NjlJcGJnaTFn|d67d02d7c6c6adbb7386ef1e5112961db215d79ee7f9c5b7dcbda1922dbcdb20; q_c1=3557e163dacf470885650132315f9a99|1680319303000|1680319303000; tst=r; KLBRSID=2177cbf908056c6654e972f5ddc96dc2|1680319315|1680319018'
,
"user-agent":'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36'}response = requests.get('https://www.zhihu.com/signin?next=%2F')print(response.text)

2)selenium获取cookie

# 淘宝自动登录
from selenium.webdriver import Chrome# 1.创建浏览器打开需要自动登录的网站b = Chrome()
b.get('https://www.taobao.com')# 2.留足够长的时候,人工完成登录(必须得保证b指向的窗口的网页中能看到登录以后的信息)
input('是否已经完成登录:')# 3.获取登录成功后的cookie信息,保存到本地文件
result = b.get_cookies()
# print(result)# 4.保存cookie到文件
with open('files/淘宝cookie.txt', 'w', encoding='utf-8', newline='') as f:f.write(str(result))print('___________结束__________')

3) 使用selenuim获取得到的cookie

from selenium.webdriver import Chrome# 1.创建浏览器打开需要自动登录的网站b = Chrome()
b.get('https://www.taobao.com')# 2.获取本地爆存的cookie
with open('files/淘宝cookie.txt', encoding='utf-8') as f:result = eval(f.read())# 3.添加cookie
for x in result:b.add_cookie(x)
# 4.重新打开网页
b.get('https://www.taobao.com')
input('end:')

2.代理ip

处理已经被封了的ip地址,代理ip是用的极光ip,百度搜索就有了,买一个就可以操作了

1)requests代理

import requestsheaders = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'}# 在这里我使用的是代理ip
proxies = {'https': '114.238.85.97:4515'
}
# 使用代理ip发送请求
res = requests.get('https://movie.douban.com/top250?start=0&filter=', headers=headers, proxies=proxies)
print(res.text)

2)selenium代理

from selenium.webdriver import Chrome,ChromeOptionsoptions = ChromeOptions()
# 设置代理
options.add_argument('--proxy-server=http://114.238.85.97:4515')b=Chrome(options=options)
b.get('https://movie.douban.com/top250?start=0&filter=')
input()

3.xpath解析器

import json可以将json数据直接转换成python数据

xpath是用来解析网页数据或者xml数据的一种解析方法,他是通过路径来获取标签(元素)

python数据;{‘name’:‘lily’,‘age’:18,‘is_ad’:‘True’}
json数据:{‘name’:lily,‘age’:18,‘is_ad’:true,‘car’:null}
xml数据:
lily
18
<is_ad>是</is_ad>
<car_no></car_no>

1)常见的几个概念

树:整个网页结构和xlml结构就是一个树结构
元素(节点):html树结构的每个标签
根节点:树结构中的第一个节点
内容:标签内容
属性:标签属性

2) xpath语法

1.获取标签

1)绝对路径:以/开头,从根节点开始层层往下写路径
2)相对路径:写路径的时候用’.‘或者’..‘开头,其中’.‘表示当前节点;’..‘表示当前节点的父节点注意:如果路径以./开头,./可以省略
3)全路径:

2.获取标签内容

在获取标签的路径的最后加’/text()

3.获取标签属性

在获取标签的路径的最后加’/@属性名‘

4.案例

首先导入包

from lxml import etree
1.创建树结构,获取根节点
html = open('data.html', encoding='utf-8').read()
root = etree.HTML(html)
2.通过路径获取标签

节点对象.xpath(路径) —— 根据路径获取所有的标签,返回值是列表,列表中的元素是节点对象

1) 绝对路径
result = root.xpath('/html/body/div/a')
print(result)
2)获取标签内容
result = root.xpath('/html/body/div/a/text()')
print(result)
3) 获取标签属性
result = root.xpath('/html/body/div/a/@href')
print(result)
4)绝对路径的写法跟xpath前面用谁去点的无关
div = root.xpath('/html/body/div')[0]result = div.xpath('/html/body/div/a/text()')
print(result)
5)相对路径
result = root.xpath('./body/div/a/text()')
print(result)result = div.xpath('a/text()')
print(result)
6)全路径
result = root.xpath('//a/text()')
print(result)result = root.xpath('//div/a/text()')
print(result)
3.加谓语(加条件)—— 路径中的节点[]
1)位置相关谓语
[N]  —— 第N个指定标签
[last()] —— 最后一个指定标签
[last()-N] —— 获取倒数指定标签
[position()>N][position()>=N][position()<N][position()<=N]、假如总共有5个,N=3,那就获取第4和第5
result = root.xpath('//span/p[2]/text()')
print(result)
2) 获取p标签最后一个标签
result = root.xpath('//span/p[last()]/text()')
print(result)result = root.xpath('//span/p[position()>2]/text()')
print(result)
3)属性相关谓语
[@属性名=属性值]
result = root.xpath('//span/p[@id="p1"]/text()')
print(result)result = root.xpath('//span/p[@class="c1"]/text()')
print(result)result = root.xpath('//span/p[@data="5"]/text()')
print(result)
4.通配符
在xpath中可以通过*表示任意标签或者任意属性
result = root.xpath('//span/*/text()')
print(result)result = root.xpath('//span/*[@class="c1"]/text()')
print(result)result = root.xpath('//span/span/@*')
print(result)
将整个网页的所有的class=c1的标签全部取到
result = root.xpath('//*[@class="c1"]/text()')
print(result)

3)案例:

成都链家二手房数据

1.分行政区爬取所需信息。
2.确定信息所在位置(ul > li)
3.确定使用的方法:requests > selenium
4.确定使用的解析器:正则表达式 > xpath > bs4
5.确定网站反扒机制,根据对应机制采取对应措施(反反爬虫)
6.爬取单页信息
7.根据网站规则爬取多页信息
8.数据持久化(将数据保存:表格、数据库......import requests
from bs4 import BeautifulSoup
import json
import csv
from tqdm import tqdm
import os# 创建请求函数
def get_requests(url):headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36 Edg/95.0.1020.30'}resp = requests.get(url=url, headers=headers)if resp.status_code == 200:soup = BeautifulSoup(resp.text, 'lxml')return soupelse:print(resp.status_code)# 获取行政区名和链接的函数
def country(html_code):# 获取行政区所在a标签a_list = html_code.select('body > div:nth-child(12) > div > div.position > dl:nth-child(2) > dd > div:nth-child(1) > div > a')# print(a_list)dict_1 = {}for i in a_list:# 行政区名name = i.text# 链接href = 'https://cd.lianjia.com' + i.attrs['href']# print(name, href)dict_1[name] = hrefreturn dict_1# 获取每个行政区二手房总页数
def get_page_num(html_code):page_dict = html_code.select_one('#content > div.leftContent > div.contentBottom.clear > div.page-box.fr > div').attrs['page-data']# print(page_dict)# 将字符串类型的字典转回字典,并且将总页数提取出来num = json.loads(page_dict)['totalPage']return num# 创建主函数
def main(url):# 请求链家二手房链接code_1 = get_requests(url)# print(code_1)# 调用获取country()函数country_dict = country(code_1)# print(country_dict)for key, value in country_dict.items():# print(key, value)# 请求行政区链接code_2 = get_requests(value)# print(code_2)# 调用get_page_num函数total_page = get_page_num(code_2)# print(total_page)# 数据持久化f = open(f'数据/成都{key}二手房信息.csv', 'w', encoding='utf-8', newline='')# 创建CSV写方法mywrite = csv.writer(f)# 列名col_name = ['行政区', '标题', '小区', '街道', '户型', '面积', '装修', '单价', '总价']mywrite.writerow(col_name)for i in tqdm(range(1, total_page + 1), desc=key):# 拼接有页数的链接new_href = value + f'pg{i}/'# print(new_href)html_code = get_requests(new_href)li_list = html_code.select('#content > div.leftContent > ul > li > div.info.clear')# print(li_list, len(li_list))for i in li_list:# print(i)# 房屋标题title = i.select_one('div.info.clear > div.title > a').text# 地址address_1 = i.select_one('div.info.clear > div.flood > div > a:nth-child(2)').textaddress_2 = i.select_one('div.info.clear > div.flood > div > a:nth-child(3)').text# 基本信息info = i.select_one('div.info.clear > div.address > div').textinfo_list = info.split('|')# 户型type = info_list[0].strip()# 面积area = info_list[1].strip()# 装修decorate = info_list[3].strip()# 单价unit_price = i.select_one('div.info.clear > div.priceInfo > div.unitPrice > span').text# 总价total_price = i.select_one('div.info.clear > div.priceInfo > div.totalPrice.totalPrice2 > span').text# print(key, title, address_1, address_2, type, area, decorate, unit_price, total_price)data = [key, title, address_1, address_2, type, area, decorate, unit_price, total_price]mywrite.writerow(data)f.close()# 程序从此处开始执行
URL = 'https://cd.lianjia.com/ershoufang/rs/'
# 判断数据文件夹是否存在
if os.path.exists('数据'):main(URL)
else:os.mkdir('数据')main(URL)