【Python】【进阶篇】十五、Python爬虫的抓取房数据

文章列表

十五、Python爬虫的抓取房数据

使用 Python 爬虫完成（https://bj.lianjia.com/ershoufang/rs/）房源信息抓取。

15.1 程序流程分析

打开网站后三步走：

确定要 URL 规律；
根据要确定 Xpath 表达式；
编写 Python 爬虫程序。

汇总 URL ：

https://bj.lianjia.com/ershoufang/pg1/
https://bj.lianjia.com/ershoufang/pg2/
https://bj.lianjia.com/ershoufang/pg3/
https://bj.lianjia.com/ershoufang/pg4/

由此可以得出：第n页：https://bj.lianjia.com/ershoufang/pgn/

15.2 确定Xpath表达式

通过浏览器的审查元素，获取房源的元素结构：

<div class="info clear"><div class="title"><a class="" href="https://bj.lianjia.com/ershoufang/101118680128.html" target="_blank" data-log_index="1" data-el="ershoufang" data-housecode="101118680128" data-is_focus="" data-sl="">育新花园北里 2室1厅 南 北</a><!-- 拆分标签 只留一个优先级最高的标签--><span class="goodhouse_tag tagBlock">必看好房</span></div><div class="flood"><div class="positionInfo"><span class="positionIcon"></span><a href="https://bj.lianjia.com/xiaoqu/11000000000380/" target="_blank" data-log_index="1" data-el="region">育新花园北里 </a>   -  <a href="https://bj.lianjia.com/ershoufang/daxingxinjichangyangfangbieshuqu/" target="_blank">大兴新机场洋房别墅区</a></div></div><div class="address"><div class="houseInfo"><span class="houseIcon"></span>2室1厅 | 73.26平米 | 南 北 | 简装 | 18层  | 板楼</div></div><div class="followInfo"><span class="starIcon"></span>0人关注 / 4天以前发布</div><div class="tag"><span class="isVrFutureHome">VR看装修</span><span class="taxfree">房本满五年</span><span class="haskey">随时看房</span></div><div class="priceInfo"><div class="totalPrice totalPrice2"><i> </i><span class="">158</span><i>万</i></div><div class="unitPrice" data-hid="101118680128" data-rid="11000000000380" data-price="21568"><span>21,568元/平</span></div></div>
</div>

15.2.1 确定基准表达式

待抓取的房源信息都包含在相应的

标签中，如下所示：

<div class="positionInfo">..</div>
<div class="address">...</div>
<div class="priceInfo">...</div>

简单分析总结都包含 30 个房源，每个房源的父节点如下：

<div class="info clear"></div>

<ul class="sellListContent" log-mod="list">
<li class="clear LOGVIEWDATA LOGCLICKDATA">
房源信息..
</li>
</ul>

接下来，使用调试工具定位上述元素，然后滚动鼠标滑。你会发现li标签的class属性值发生了变化，其结果如下：

<ul class="sellListContent" log-mod="list">
<li class="clear LOGCLICKDATA">
房源信息..
</li>
</ul>

发生变化的原因是由于 JS 事件触发导致的。因此就需要去页面的源码页进行匹配。

下面使用Ctrl+F分别对 class 变化前后的属性值进行检索，最后发现源码页只存在如下属性：

class="clear LOGVIEWDATA LOGCLICKDATA"

因此 Xpath 基准表达式如下所示：

//ul[@class="sellListContent"]/li[@class="clear LOGVIEWDATA LOGCLICKDATA"]

15.2.2 确定抓取信息的表达式

根据页面元素结构确定抓取信息的 Xpath 表达式，分别如下：

小区名称：name_list=h.xpath('.//a[@data-el="region"]/text()')
房屋介绍：info_list=h.xpath('.//div[@class="houseInfo"]/text()')
地址信息：address_list=h.xpath('.//div[@class="positionInfo"]/a/text()')
单价信息：price_list=h.xpath('.//div[@class="unitPrice"]/span/text()')

其中房屋介绍，主要包含了以下信息：

<div class="address"><div class="houseInfo"><span class="houseIcon"></span>2室1厅 | 88.62平米 | 北 南 | 简装 | 顶层(共6层) | 2004年建 | 板楼</div>
</div>

因此，匹配出的 info_list 列表需要经过处理才能得出我们想要的数据，如下所示：

#户型+面积+方位+是否精装+楼层+...    ['2室1厅 | 88.62平米 | 北 南 | 简装 | 顶层(共6层) | 2004年建 | 板楼']
info_list=h.xpath('.//div[@class="houseInfo"]/text()')
if info_list:#处理列表数据L=info_list[0].split('|')# ['2室1厅 ', ' 88.62平米 ', ' 北 南 ', ' 简装 ', ' 顶层(共6层) ', ' 2004年建 ', ' 板楼']if len(L) >= 5:item['model']=L[0].strip()item['area']=L[1].strip()item['direction']=L[2].strip()item['perfect']=L[3].strip()item['floor']=L[4].strip()

15.2.3 提高抓取效率

为了提高网页信息的抓取质量，减小网络波动带来的响应，我们可以设置一个规则：在超时时间内（3秒），在该时间内对于请求失败的页面尝试请求三次，如果均未成功，则抓取下一个页面。

requests.get() 方法提供了 timeout 参数可以用来设置超时时间，此方法还提供了其他实用性参数，比如 auth(用户认证)、veryify(证书认证)、proxies(设置代理 IP)，这在后续内容中会做相应介绍。

15.3 编写程序代码

# coding:utf8
import requests
import random
from lxml import etree
import time
# 提供ua信息的的包
from fake_useragent import UserAgentclass LinajiaSpider(object):def __init__(self):self.url = 'https://bj.lianjia.com/ershoufang/pg{}/'# 计数，请求一个页面的次数，初始值为1self.blog = 1# 随机取一个UAdef get_header(self):# 实例化ua对象ua = UserAgent()headers = {'User-Agent': ua.random}return headers# 发送请求def get_html(self, url):# 在超时间内，对于失败页面尝试请求三次if self.blog <= 3:try:res = requests.get(url=url, headers=self.get_header(), timeout=3)html = res.textreturn htmlexcept Exception as e:print(e)self.blog += 1self.get_html(url)# 解析提取数据def parse_html(self, url):html = self.get_html(url)if html:p = etree.HTML(html)# 基准xpath表达式-30个房源节点对象列表h_list = p.xpath('//ul[@class="sellListContent"]/li[@class="clear LOGVIEWDATA LOGCLICKDATA"]')# 所有列表节点对象for h in h_list:item = {}# 名称name_list = h.xpath('.//a[@data-el="region"]/text()')# 判断列表是否为空item['name'] = name_list[0] if name_list else None# 户型+面积+方位+是否精装..['2室1厅 | 88.62平米 | 北 南 | 简装 | 顶层(共6层) | 2004年建 | 板楼']info_list = h.xpath('.//div[@class="houseInfo"]/text()')# 判断列表是否为空if info_list:L = info_list[0].split('|')# ['2室1厅 ', ' 88.62平米 ', ' 北 南 ', ' 简装 ', ' 顶层(共6层) ', ' 2004年建 ', ' 板楼']if len(L) >= 5:item['model'] = L[0].strip()item['area'] = L[1].strip()item['direction'] = L[2].strip()item['perfect'] = L[3].strip()item['floor'] = L[4].strip()# 区域+总价+单价address_list = h.xpath('.//div[@class="positionInfo"]/a/text()')item['address'] = address_list[0].strip() if address_list else Nonetotal_list = h.xpath('.//div[@class="totalPrice"]/span/text()')item['total_list'] = total_list[0].strip() if total_list else Noneprice_list = h.xpath('.//div[@class="unitPrice"]/span/text()')item['price_list'] = price_list[0].strip() if price_list else Noneprint(item)# 入口函数def run(self):try:for i in range(1, 101):url = self.url.format(i)self.parse_html(url)time.sleep(random.randint(1, 3))# 每次抓取一页要初始化一次self.blogself.blog = 1except Exception as e:print('发生错误', e)if __name__ == '__main__':spider = LinajiaSpider()spider.run()

【Python】【进阶篇】十五、Python爬虫的抓取房数据

目录

十五、Python爬虫的抓取房数据

15.1 程序流程分析

15.2 确定Xpath表达式

15.2.1 确定基准表达式

15.2.2 确定抓取信息的表达式

15.2.3 提高抓取效率

15.3 编写程序代码

公告

DeepSeek全套部署资料免费下载

免费可商用字体批量下载

标签

【Python】【进阶篇】十五、Python爬虫的抓取房数据

目录

十五、Python爬虫的抓取房数据

15.1 程序流程分析

15.2 确定Xpath表达式

15.2.1 确定基准表达式

15.2.2 确定抓取信息的表达式

15.2.3 提高抓取效率

15.3 编写程序代码

相关问题

公告

DeepSeek全套部署资料免费下载

免费可商用字体批量下载

标签