RE正则表达式（使用python语言进行爬虫为例）

文章列表

re正则表达式，是一种对字符串进行操作的方法，可以在爬取网页时提取我们想要的数据。

认识re

1.re速览
2.re1
3.re匹配符 - - 特殊符号（决定匹配的数据量）
4.re通配匹配符（决定匹配什么数据）
5.re小练习
6.re的其他匹配符
7.re的贪婪/非贪婪模式
8.re转义字符
9.re的几个函数
10.re处理字符串
11.re处理爬虫爬取到的html字符串
12.re爬取翻页网站数据

1.re速览


for i in range(0,3):
i依次从0到2，一共3次
for i in range(0, 25, 1)
i依次从0到24，25次【4】 findall
string8 = "{ymd:'2018-01-01',tianqi:'晴',aqiInfo:'轻度污染'}," \\"{ymd:'2018-01-02',tianqi:'阴~小雨',aqiInfo:'优'}," \\"{ymd:'2018-01-03',tianqi:'小雨~中雨',aqiInfo:'优'}," \\"{ymd:'2018-01-04',tianqi:'中雨~小雨',aqiInfo:'优'}"()表示输出你小括号内匹配的内容
print(re.findall("tianqi:'(.*?)'", string8)
注：
使用 re.findall时，不需要进行 print(result.group())
只需要直接输出 re.findall('')即可【1】
匹配某个字符串，match()只能匹配某个result = re.match('[-\\d]', text)
print(result.group())点(.) 匹配任意的某个字符，无法匹配换行符，若想匹配，加re.DOTALL\\d: 匹配任意的某个数字\\D: 除数字外均可匹配\\s: 匹配空白符  注：\\n、\\t、\\r都表示空白符\\w(小写):匹配小写的a-z，大写的A-Z，数字和下划线\\W:匹配除小写的w之外的所有符号[] : ->> 组合的方式，只要在中括号内的内容均可匹配知道了[]之后，则
\\d  ->>  [0-9]
\\D  ->>  [^0-9]
\\w  ->>  [0-9a-zA-Z]
\\W  ->>  [^0-9a-zA-Z]
[\\d\\D]、[\\w\\W]  -->  匹配所有的字符【2】
在待匹配的内容后面加特殊符号，可以改变匹配的数量星号(*):匹配零个或者多个字符
加号(+)：匹配一个或者多个
问号(?):要么匹配0个，要么匹配1个text = '-a158-5555-6582'
有?, 对[]内容匹配零次或者一次
匹配-、数字、a，匹配零次或者一次，从起始位置开始匹配，匹配 0次不会报错
result = re.match('[-a\\d]?', text)【3】
（1）从头匹配和全局遍历
re.match():【必须】从字符串开头进行匹配
re.search():从左到右进行字符串的遍历，找到就返回，后续再出现，但不再返回结果text = 'aapythpyon'
result = re.match('y',text)   # 报错
result = re.search('py',text) # py（2）^的用法a.在中括号内表示取反b.在中括号外表示以指定的字符开始
text = 'pppypthon'result = re.search('[^\\d]+',text)
# pppypthonresult = re.search('^p+',text)
# ppp（3) ……$: 表示匹配以……为结尾
以com为结尾提取数据，若不是以该词结尾就报错text = 'python123@163.com'
result = re.search('[\\w]+@[a-z0-9]+[.]com$',text)(4)|: 匹配多个表达式或者字符串
如果将https|http|ftp|file放入[]，使得https|http|ftp|file理解为同一个字符串
（1）[]中括号认为里面的都是单个字符
（2）()认为是不同的字符串

2.re1

import re
'''
1.匹配某个字符串：match()只能匹配某个！
从起始位置进行匹配
'''
text = 'cpython'
result = re.match('py', text)
# print(result)
# print(result.group()) 报错'''
点(.)匹配任意的某个字符
Tips:1.无法匹配换行符2.从起始位置进行匹配
'''
text = 'cpython'
result = re.match('.', text)
# c'''
\\d:匹配任意的某个数字1.只能匹配数字，其余数据类型均不匹配2.从起始位置开始3.只能匹配一个
'''
text = "211python"
result = re.match('\\d', text)
# 2'''
\\D:除数字外均可匹配1.只能匹配非数字2.从起始位置开始3.只能匹配一个
'''
text = "python"
result = re.match('\\D', text)
# p'''
\\s:匹配空白符1.从起始位置开始匹配2.\\n、\\t、\\r都表示空白符3.必须是小写的s4.匹配空白字符
'''
text = '\\npython'
result = re.match('\\s', text)
# 输出换行符'''
\\w(小写):匹配小写的a-z，大写的A-Z，数字和下划线1.小写的w2.从头开始匹配3.除上述外无法匹配，但是中文可以，中文符号不行
'''
text = "_python"
result = re.match('\\w', text)
# _'''
\\W:匹配除小写的w之外的所有符号1.匹配\\w能匹配以外的所有符号1.大写的w2.从头开始匹配
'''
text = '--python'
result = re.match('\\W', text)
# -'''
[] : ->> 组合的方式，只要在中括号内的内容均可匹配
tips:1.[]内的内容都可以匹配2.[]内多个内容匹配内容时，取“或”，只要匹配对象中含有其中一个内容就匹配3.从起始位置匹配4.这一节demo14讲了只匹配某个字符
'''
text = ' thon'
result = re.match('[-\\s]', text)
# 匹配到空格

3.re匹配符 - - 特殊符号（决定匹配的数据量）

import re
'''
星号(*):匹配零个或者多个字符
'''text = '158-5555-6582'
# 没有*，就是从起始位置进行匹配，匹配到第一个。
result = re.match('[\\d]',text)
# 1# 有*，从起始位置进行匹配，匹配[]内容零次或多次
result = re.match('[\\d]*',text)
# 158# 有*，从起始位置进行匹配，匹配[]内容零次或者多次。这里每次相当于不仅匹配-还或者匹配\\d
result = re.match('[-\\d]*',text)
# 158-5555-6582# 有*，从起始位置进行匹配，匹配[]内容零次！！！或者多次。
result = re.match('[-]*',text)
# 没有匹配到，输出空'''
加号(+)：匹配一个或者多个
'''
text = 'a158-5555-6582'
#有+，对[]内容匹配一次或者多次(至少有一次)
result = re.match('[\\d]+',text)
# print(result.group())#有+，对[]内容匹配一次或者多次(至少有一次)，[]内容一次或者多次，匹配到不满足条件为止。
result = re.match('[a\\d]+',text)
# print(result.group())'''
问号(?):要么匹配0个，要么匹配1个
'''
text = '-a158-5555-6582'
# 有?, 对[]内容匹配零次或者一次
# 匹配-、数字、a，匹配零次或者一次，从起始位置开始匹配，匹配0次不会报错
result = re.match('[-a\\d]?', text)
# -'''
{m}:匹配指定的个数(m)
'''
text = '158-5555-6582'
# 有{k}，从起始位置匹配[]内容中k次
# 从起始位置，如果第四次[]内容不满足匹配要求，报错
result = re.match('[\\d]{3}', text)
# 158'''
{m,n}:匹配m到n个
但是默认匹配最多次数
'''
text = '158-5m55-6582'
result = re.match('[-\\d]{2,4}', text)
# 158
result = re.match('[\\d]{2,4}', text)
# 158-
result = re.match('[-\\d]{2,6}', text)
# 158-5

4.re通配匹配符（决定匹配什么数据）

import re
'''
\\d ==> [0-9]:匹配所有的数字
'''
# 配合*（0次或多次），匹配多次数字
text = '158-5555-6582'
result = re.match('[-0-9]*', text)
# print(result.group()) # 158-5555-6582'''
\\D ==> [^0-9]:匹配所有的非数字
'''
# 配合+（1次或多次），匹配一次或者多次非数字字符
# 配合*（0次或多次），匹配零次或者多次非数字字符
text = '158-5555-6582'
result = re.match('[^0-9]*', text)
# 没有匹配到，但是不会报错'''
\\w ==> [0-9a-zA-Z]:匹配所有的数字、字母和下划线
'''
# -不在匹配范围之内
text = '158-5555-6582'
result = re.match('[0-9a-zA-Z_]+', text)
# 158'''
\\W ==> [0-9a-zA-Z]:匹配所有的非数字、字母和下划线
'''
text = '我158-5555-6582'
result = re.match('[^0-9a-zA-Z_]+', text)
# 我'''
[\\d\\D]、[\\w\\W]:匹配所有的字符
'''text = '---------123_145\\n45 \\t 678中文。'
result = re.match('[\\w\\W]+',text)
print(result.group()) # 可以全部输出'''
点（.）:匹配任意的某个字符
'''
text = 'python12--//'
result = re.match('[.]+', text)
# print(result.group()) # 报错
# [问题]：既然可以匹配所有的字符，配合+应该可以匹配整个text
# [原因]：[.]表示的是仅匹配点，配合+，即匹配一次或者多次，text第一个不是.，所以报错# 去掉中括号后，.才表示匹配所有的字符
result = re.match('.+', text)
print(result.group())

5.re小练习

import re'''
验证手机号
1.必须是11位数字
2.第一位必须是1           1
3.第二位必须是3-9         [3456789]
4.第三位到第十一位没有要求  [0-9]
'''
text = '18726981556'
result = re.match('1[3-9][0-9]{9}', text)
# print(result.group()) # 18726981556'''
验证邮箱
...@xxx.com
python123@163.com
1.用户名部分（@前的部分） ==> 英文字母、数字、下划线组成
2.域名部分（@后的部分）    ==> 数字、字母（一般都是小写）
'''
text = 'p_ython123@qq.com'
result = re.match('\\w+@[0-9a-z]+[.]com', text)
# 匹配成功'''
验证简易的身份证号【18位】
身份证号特点
前17位：[0-9]
第18位：[0-9xX]
'''
text = '342423199805200591'
result = re.match('[0-9]{17}[0-9xX]', text)
# print(result.group()) 匹配成功

6.re的其他匹配符

import re
'''
1、从头匹配和全局遍历
re.match():【必须】从字符串开头进行匹配
re.search():从左到右进行字符串的遍历，找到就返回，后续再出现，但不再返回结果
'''
text = 'aapythpyon'
result = re.match('y',text) # 报错result = re.search('py',text)
# py'''
2、^的用法
（1）在中括号内表示取反
（2）在中括号外表示以指定的字符开始
'''
text = 'pppypthon'
result = re.search('[^\\d]+',text)
# pppypthonresult = re.search('^p+',text)
# ppp'''
3、$:表示匹配以……为结尾
以com为结尾提取数据，若不是以该词结尾就报错
'''
text = 'python123@163.com'
result = re.search('[\\w]+@[a-z0-9]+[.]com$',text)
# 匹配成功'''
4、|:匹配多个表达式或者字符串
如果将https|http|ftp|file放入[]，使得https|http|ftp|file理解为同一个字符串
（1）[]中括号认为里面的都是单个字符
（2）()认为是不同的字符串
'''
text = 'https://www.baidu.com/'
result = re.search('https|http|ftp|file', text)
# https
result = re.search('[https|http|ftp|file]', text)
# h
result = re.search('[https|http|ftp|file]+', text)
# https
result = re.search('(https|http|ftp|file)', text)
# https

7.re的贪婪/非贪婪模式

import re'''
贪婪模式：正则表达式会尽可能多地匹配字符【默认就是贪婪模式】'''
text = 'python'
# 有+，对[]内容匹配一次或者多次(至少有一次)，[]内容一次或者多次，匹配到不满足条件为止。
result = re.match('[a-z]+', text)
# python'''
非贪婪模式：正则表达式会尽可能少地匹配字符【添加？】
'''
text = 'python'
result = re.match('[a-z]+?', text)
# ptext = \\"""
<tr class="兰智数加学院">
<tr class="1">shujia1</tr>
<tr class="2">shujia2</tr>
<tr class="3">shujia3</tr>
<tr class="4">shujia4</tr>
<tr class="5">shujia5</tr>
<tr class="6">爬虫</tr>
"""
result = re.match('\\n<tr[\\d\\D]+>', text)
# print(result.group()) # 全部输出
# '\\n<tr[\\d\\D]+>' 所有内容都匹配出来了为什么？ 默认的贪婪模式，将>定位到了最后的>标签上result = re.match('\\n<tr[\\d\\D]+?>', text)
# print(result.group()) # <tr class="兰智数加学院">string8 = "{ymd:'2018-01-01',tianqi:'晴',aqiInfo:'轻度污染'}," \\"{ymd:'2018-01-02',tianqi:'阴~小雨',aqiInfo:'优'}," \\"{ymd:'2018-01-03',tianqi:'小雨~中雨',aqiInfo:'优'}," \\"{ymd:'2018-01-04',tianqi:'中雨~小雨',aqiInfo:'优'}"# ()输出你小括号内匹配的内容
print(re.findall("tianqi:'(.*?)'", string8))
# 使用 re.findall时，不需要进行 print(result.group())
# 只需要直接输出 re.findall('')即可，其中内部小括号是我们得到的内容

8.re转义字符

import re'''
转义字符：\\
保持符号的本意，符号转义前的本来意义，也要了解转义后的含义
'''
pi = '3....1415926*'
# 给.和*转义成普通的字符
result = re.match('\\d\\.+\\d+\\*+', pi)'''
. ->> 匹配任意的字符
\\. 转义后：这里.就是小数点的本意 加[] 或者 \\* ->> 匹配0个或多个字符
\\* 转义后：这里就是*的本意 []或\\'''
print(result.group())

9.re的几个函数

import re
'''在python中
str = "hello,i am no.{}".format()
可以通过format传入参数
'''
x = "x"
str = "hello,i am no.{}".format(x)
print(str)'''
group函数，使用()就可以完成
'''
text = 'my email is 2781162818@qq.com and PYTHON123@163.com'
result = re.search('[\\s\\w]+\\s(\\w+@[0-9a-z]+\\.com)[\\s\\w]+\\s(\\w+@[0-9a-z]+\\.com)', text)
# print(result.group())
# my email is 2781162818@qq.com and PYTHON123@163.com# print(result.group(1))
# 2781162818@qq.com# print(result.group(2))
# PYTHON123@163.com'''
findall()：在整个字符串中查找所有满足条件的字符串
【返回结果为列表】
'''
text = 'my email is 2781162818@qq.com and PYTHON123@163.com'
result = re.findall('\\s(\\w+@[0-9a-z]+\\.com)', text)
# print(result)
# ['2781162818@qq.com', 'PYTHON123@163.com']'''
sub('a', 'b', text)：替换字符串
【匹配出来的字符串进行人为替换】
'''
text = 'my email is 2781162818@qq.com and PYTHON123@163.com'
result = re.sub('\\s(\\w+@[0-9a-z]+\\.com)', ' xxx', text)
print(result)'''
split()：分割字符串
'''
text = 'my email is 2781162818@qq.com and PYTHON123@163.com'
# 按照空格分割
result = re.split(' ', text)
print(result)
# 按照不是\\w的分割，空格不是\\w。@不是\\w。.也不是\\w
result = re.split('[^\\w]', text)
print(result)'''
compile()：对正则表达式可以进行编译（注释和保存的作用）
'''
text = 'my email is 2781162818@qq.com and PYTHON123@163.com'
r = re.compile(r"""\\s # 邮箱前的空格(\\w+ #邮箱的第一部分，即@之前的部分@ #提取邮箱的@符号[0-9a-z]+ #邮箱的第二部分，即@之后.之前的信息\\.com)  #匹配邮箱的结尾部分
""", re.VERBOSE)
result = re.findall(r, text)
print(result)
# ['2781162818@qq.com', 'PYTHON123@163.com']'''
我来给提取两个邮箱的代码做个注释
'''
text = 'my email is 2781162818@qq.com and PYTHON123@163.com'
r = re.compile(r"""[\\sa-z]+ # 匹配邮箱之前的空格及小写字母my email is\\s      # 匹配278前面的那个空格(\\w+    # 匹配第一个邮箱之前的部分@       # 匹配@符号[0-9a-z]+   # 匹配第一个邮箱的后面的部分\\.com)   # 匹配第一个邮箱的最后[\\sa-z]+ # 匹配第二个邮箱之前的空格及小写字母 and\\s      # 匹配PYTHON前面的那个空格(\\w+    # 匹配第二个邮箱之前的部分@       # 匹配@符号[0-9a-z]+   # 匹配第二个邮箱的后面的部分\\.com)  # 匹配第二个邮箱的最后
""", re.VERBOSE)
result = re.findall(r, text)
print(result)
# [('2781162818@qq.com', 'PYTHON123@163.com')]

10.re处理字符串

text = \\"""
<ul class="ullist" padding="1" spacing="1"><li><div id="top"><span class="position" width="350">职位名称</span><span>职位类别</span><span>人数</span><span>地点</span><span>发布时间</span></div><div id="even"><span class="l square"><a target="_blank" href="position_detail.php?id=33824&amp;keywords=python&amp;tid=87&amp;lid=2218">python开发工程师</a></span><span>技术类</span><span>2</span><span>合肥</span><span>2018-10-23</span></div><div id="odd"><span class="l square"><a target="_blank" href="position_detail.php?id=29938&amp;keywords=python&amp;tid=87&amp;lid=2218">python后端</a></span><span>技术类</span><span>2</span><span>合肥</span><span>2018-10-23</span></div><div id="even"><span class="l square"><a target="_blank" href="position_detail.php?id=31236&amp;keywords=python&amp;tid=87&amp;lid=2218">高级Python开发工程师</a></span><span>技术类</span><span>2</span><span>合肥</span><span>2018-10-23</span></div><div id="odd"><span class="l square"><a target="_blank" href="position_detail.php?id=31235&amp;keywords=python&amp;tid=87&amp;lid=2218">python架构师</a></span><span>技术类</span><span>1</span><span>合肥</span><span>2018-10-23</span></div><div id="even"><span class="l square"><a target="_blank" href="position_detail.php?id=34531&amp;keywords=python&amp;tid=87&amp;lid=2218">Python数据开发工程师</a></span><span>技术类</span><span>1</span><span>合肥</span><span>2018-10-23</span></div><div id="odd"><span class="l square"><a target="_blank" href="position_detail.php?id=34532&amp;keywords=python&amp;tid=87&amp;lid=2218">高级图像算法研发工程师</a></span><span>技术类</span><span>1</span><span>合肥</span><span>2018-10-23</span></div><div id="even"><span class="l square"><a target="_blank" href="position_detail.php?id=31648&amp;keywords=python&amp;tid=87&amp;lid=2218">高级AI开发工程师</a></span><span>技术类</span><span>4</span><span>合肥</span><span>2018-10-23</span></div><div id="odd"><span class="l square"><a target="_blank" href="position_detail.php?id=32218&amp;keywords=python&amp;tid=87&amp;lid=2218">后台开发工程师</a></span><span>技术类</span><span>1</span><span>合肥</span><span>2018-10-23</span></div><div id="even"><span class="l square"><a target="_blank" href="position_detail.php?id=32217&amp;keywords=python&amp;tid=87&amp;lid=2218">Python开发（自动化运维方向）</a></span><span>技术类</span><span>1</span><span>合肥</span><span>2018-10-23</span></div><div id="odd"><span class="l square"><a target="_blank" href="position_detail.php?id=34511&amp;keywords=python&amp;tid=87&amp;lid=2218">Python数据挖掘讲师 </a></span><span>技术类</span><span>1</span><span>合肥</span><span>2018-10-23</span></div></li>
</ul>
"""
import re# 1.获取所有的div标签
result = re.findall('<div[\\d\\D]*?</div>', text)
# print(result)
# 或者
result = re.findall('<div.*?</div>', text, re.DOTALL)# 2.获取某个属性的div标签(含有id属性的div标签)
result = re.findall('<div\\sid.*?</div>', text, re.DOTALL)# 3.获取所有的id=even的标签
result = re.findall('<div\\sid="even".*?</div>', text, re.DOTALL)# 4.获取某个标签属性的值
# 获取所有id的值
result = re.findall('<div id="(.*?)".*?</div>', text, re.DOTALL)
print(result)# 5.获取a标签中的href属性的值
result = re.findall('<a.*?href="(.*?)">', text, re.DOTALL)
# print(result)# 6.div中所有的职位信息
result = re.findall('<span>(.*?)</span>', text, re.DOTALL)
print(result)# 7.获取岗位信息
result = re.findall('<a.*?>(.*?)</a>', text, re.DOTALL)
print(result)

11.re处理爬虫爬取到的html字符串

使用requests库获取content是html字符串，而re正则表达式恰恰是对字符串进行处理。
因此，将正则表达式使用到html字符串的数据提取中。

import requests
import re
url = "https://s.weibo.com/top/summary?cate=realtimehot"headers = {'cookie': 'SUB=_2AkMTRi1qf8NxqwFRmPEVzmLha4R1yQ3EieKlGtyxJRMxHRl-yT9kqmsmtRB6OMYDhdCTGr2FB95K0HLKHeAZHPKKREb3; SUBP=0033WrSXqPxfM72-Ws9jqgMF55529P9D9WF8HDv2g_bdeD8IhYlDFZ.M; _s_tentry=-; Apache=2377795257586.004.1679467161310; SINAGLOBAL=2377795257586.004.1679467161310; ULV=1679467161315:1:1:1:2377795257586.004.1679467161310:','user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36'
}response = requests.get(url, headers=headers)
content = response.content.decode('utf8')
print(type(content)) # <class 'str'># 获取热度榜的名称
names = re.findall('<td class="td-02">.*?<a.*?>(.*?)</a>', content, re.DOTALL)[1:]
# print(names)# 获取热度值
hots = re.findall('<td class="td-02">.*?<span>(.*?)</', content, re.DOTALL)
# print(hots)# 存储数据
sinas = []
for name,hot in zip(names, hots):sina = {"name":name,"hot":hot}sinas.append(sina)print(sinas)

12.re爬取翻页网站数据

import re
import requests
import csvgushis = []urls = []
for i in range(0, 5, 1):url = 'https://www.gushiwen.cn/default_{}.aspx'.format(i)urls.append(url)# print(url)# 定义请求头信息
headers = {"cookie" : "login=flase; Hm_lvt_9007fab6814e892d3020a64454da5a55=1679637970; __bid_n=187123b38accfde6624207; FPTOKEN=/iNBklILJISaHg5CmgUVlxivXpv2j8GpwuWrSVKewp1C1HAJE873KXSPPU2Wh6ScBaR1VTAH0m+o44lxRanXXJICZc5mUXYJyNY6+YTF25f9/qE9DUYCzxes7r0Xfkzw0qtfDIW9gWbtt37qnkAYymMequLn1jAyYdzl3Q8M8vctJvoKbEZlf4RLlc16+cT4+aIJiHKDbpe0GKunIpw/71nWFSJgRB7FiSx5ucE07KBux6wfEyuIBxeHp3Ujnx8uvaZQVLZCPjfEwnnTiBw0Py1647QplmV8Qd60x1XLo1huueuZB/k8kL8fzD4q3Lx0jdViFQ8LpQa7xr7vJf1MYLIcGUyyRm5EzWxfp3oJ3PBUR9LN4iG4IFTOMlqYw52yq1cDcIW915wgWN0Oy8oiiWZGgNXdH53GeI0o15VjRGdP6GaFoebQa0RT7tDbM23/|U0Mzy8G2Y2S5UU23cAStWAVr0Z0bxruHz6SZnhkZe9Q=|10|46f2a7e0241d7c98763a077979d80ff8; Hm_lpvt_9007fab6814e892d3020a64454da5a55=1679643073","user-agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36"
}for url in urls:response = requests.get(url, headers=headers)content = response.content.decode('utf8')# 诗名titles = re.findall('<b>(.*?)</b>', content, re.DOTALL)# print(titles)# 作者authors = re.findall('<p class="source".*?<a.*?>(.*?)</a>', content, re.DOTALL)# 朝代dynasties = re.findall('<p class="source".*?<a.*?<a.*?>(.*?)</a>', content, re.DOTALL)# 诗词内容poems = re.findall('<div class="contson".*?>(.*?)</div>', content, re.DOTALL)new_poems = []for poem in poems:new_poem = re.sub('<.*?>', '', poem)new_poem = re.sub('[\\s\\u3000]', '', new_poem)new_poems.append(new_poem)# print(new_poems)# 存储数据for title,author,dynasty,new_poem in zip(titles,authors,dynasties,new_poems):gushi = {"诗名" : title,"作者" : author,"朝代" : dynasty,"内容" : new_poem}gushis.append(gushi)# 将结果写入到本地
with open("gushis.csv", "w", encoding='utf8', newline="") as f:# 字段名fieldnames = ['诗名', "作者", "朝代", "内容"]writer = csv.DictWriter(f, fieldnames=fieldnames)writer.writeheader()# print(gushis)writer.writerows(gushis)print("数据写入成功！")

RE正则表达式（使用python语言进行爬虫为例）

认识re

1.re速览

2.re1

3.re匹配符 - - 特殊符号（决定匹配的数据量）

4.re通配匹配符（决定匹配什么数据）

5.re小练习

6.re的其他匹配符

7.re的贪婪/非贪婪模式

8.re转义字符

9.re的几个函数

10.re处理字符串

11.re处理爬虫爬取到的html字符串

12.re爬取翻页网站数据

公告

DeepSeek全套部署资料免费下载

免费可商用字体批量下载

标签

RE正则表达式（使用python语言进行爬虫为例）

认识re

1.re速览

2.re1

3.re匹配符 - - 特殊符号（决定匹配的数据量）

4.re通配匹配符（决定匹配什么数据）

5.re小练习

6.re的其他匹配符

7.re的贪婪/非贪婪模式

8.re转义字符

9.re的几个函数

10.re处理字符串

11.re处理爬虫爬取到的html字符串

12.re爬取翻页网站数据

相关问题

公告

DeepSeek全套部署资料免费下载

免费可商用字体批量下载

标签