> 文章列表 > Python学习笔记15:Python和Web

Python学习笔记15:Python和Web

Python学习笔记15:Python和Web

Python和Web

屏幕抓取

# 简单的屏幕抓取程序
from urllib.request import urlopen 
import rep = re.compile('<a href="(/jobs/\\\\d+)/">(.*?)</a>') 
text = urlopen('http://python.org/jobs').read().decode() 
for url, name in p.findall(text): print('{} ({})'.format(name, url))#输出类似
Python Developer (/jobs/7209)
Python developer (/jobs/7208)
🛠 Experienced Data Engineer (/jobs/7200)
🤖 Experienced Machine Learning Engineer (/jobs/7199)
Lead / Senior Python Software Engineer (/jobs/7198)
Senior Back-End Developer (/jobs/7197)
IT Specialist (Data Science) (/jobs/7196)
Software Engineer (Mid/Senior) (/jobs/7194)
Python Back-end Developer - Planning and Tasking Team (/jobs/7193)
Senior Python Developer (/jobs/7187)
Health / Intermountain Healthcare (/jobs/7185)
Senior Python Developer (/jobs/7184)
Senior Python Developer (/jobs/7181)
Principal Python Engineer (/jobs/7180)
Senior Backend Software Engineer (f/m/d) (/jobs/7177)
Senior Software Engineer @ Omnipresent (/jobs/7175)
Principal Software Engineer @ Omnipresent (/jobs/7174)
Senior Software Engineer - Python (/jobs/7173)
Sr. Python Developer (/jobs/7171)
Senior Python Software Engineer (/jobs/7170)
Python Engineer @ Aeguana (/jobs/7165)
Algorithms Engineer (/jobs/7164)
Experienced Django Developer (Python) (/jobs/7163)
Python Software Engineer (/jobs/7162)
Senior Python/Django Engineer (/jobs/7129)

缺点

  • 正则表达式不容易理解。
  • 对付不了独特的html内容,如CDATA部分和字符实体(如&amp).

Tidy和XHTML解析

  1. Tidy

Tidy是用于对格式不正确且不严谨的HTML进行修复的工具。

格式错误的HTML代码

<h1>Pet Shop 
<h2>Complaints</h3> 
<p>There is <b>no <i>way</b> at all</i> we can accept returned 
parrots. 
<h1><i>Dead Pets</h1>
<p>Our pets may tend to rest at times, but rarely die within the 
warranty period. 
<i><h2>News</h2></i> 
<p>We have just received <b>a really nice parrot. 
<p>It's really nice.</b> 
<h3><hr>The Norwegian Blue</h3> 
<h4>Plumage and <hr>pining behavior</h4> 
<a href="#norwegian-blue">More information<a> 
<p>Features: 
<body> 
<li>Beautiful plumage

Tidy修复后的版本

<!DOCTYPE html> 
<html> 
<head> 
<title></title> 
</head> 
<body> 
<h1>Pet Shop</h1> 
<h2>Complaints</h2> 
<p>There is <b>no <i>way</i></b> <i>at all</i> we can accept 
returned parrots.</p> 
<h1><i>Dead Pets</i></h1> 
<p><i>Our pets may tend to rest at times, but rarely die within the 
warranty period.</i></p> 
<h2><i>News</i></h2> 
<p>We have just received <b>a really nice parrot.</b></p> 
<p><b>It's really nice.</b></p> 
<hr> 
<h3>The Norwegian Blue</h3> 
<h4>Plumage and</h4> 
<hr> 
<h4>pining behavior</h4> 
<a href="#norwegian-blue">More information</a> 
<p>Features:</p> 
<ul> 
<li>Beautiful plumage</li> 
</ul> 
</body> 
</html>

当然,Tidy并不能修复HTML文件存在的所有问题,但确实能够确保文件是格式良好的(即所有元素都嵌套正确),这让解析工作容易得多。

  1. 获取Tidy
$ pip install pytidylib

例如,假设你有一个混乱的HTML文件(messy.html),且在执行路径
中包含命令行版Tidy,下面的程序将对这个文件运行Tidy并将结果打印出来:

from subprocess import Popen, PIPE text = open('messy.html').read() 
tidy = Popen('tidy', stdin=PIPE, stdout=PIPE, stderr=PIPE) tidy.stdin.write(text.encode()) 
tidy.stdin.close() print(tidy.stdout.read().decode())
  1. 为何使用XHTML

XHTML和旧式HTML的主要区别在于,XHTML非常严格,要求显式地结束所有的元素(至少就我们当前的目标而言如此)。

要对Tidy生成的格式良好的XHTML进行解析,一种非常简单的方式是使用标准库模块html.parser中的HTMLParser类。

  1. 使用HTMLParser

HTMLParser中的回调方法

回调方法 何时被调用
handle_starttag(tag, attrs) 遇到开始标签时调用。attrs是一个由形如(name, value)的元组组成的序列
handle_startendtag(tag, attrs) 遇到空标签时调用。默认分别处理开始标签和结束标签
handle_endtag(tag) 遇到结束标签时调用
handle_data(data) 遇到文本数据时调用
handle_charref(ref) 遇到形如&#ref;的字符引用时调用
handle_entityref(name) 遇到形如&name;的实体引用时调用
handle_comment(data) 遇到注释时;只对注释内容调用
handle_decl(decl) 遇到形如<!..>的声明时调用
handle_pi(data) 用于处理指令
unknown_decl(data) 遇到未知声明时调用
# 使用模块HTMLParser的屏幕抓取程序
from urllib.request import urlopen 
from html.parser import HTMLParser def isjob(url): try: a, b, c, d = url.split('/') except ValueError: return False return a == d == '' and b == 'jobs' and c.isdigit() class Scraper(HTMLParser): in_link = False  def handle_starttag(self, tag, attrs): attrs = dict(attrs) url = attrs.get('href', '') if tag == 'a' and isjob(url): self.url = url self.in_link = True self.chunks = [] def handle_data(self, data): if self.in_link: self.chunks.append(data)def handle_endtag(self, tag): if tag == 'a' and self.in_link: print('{} ({})'.format(''.join(self.chunks), self.url)) self.in_link = False text = urlopen('http://python.org/jobs').read().decode() 
parser = Scraper() 
parser.feed(text) 
parser.close()# 输出类似
Python Developer (/jobs/7209/)
Python developer (/jobs/7208/)
🛠 Experienced Data Engineer (/jobs/7200/)
🤖 Experienced Machine Learning Engineer (/jobs/7199/)
Lead / Senior Python Software Engineer (/jobs/7198/)
Senior Back-End Developer (/jobs/7197/)
IT Specialist (Data Science) (/jobs/7196/)
Software Engineer (Mid/Senior) (/jobs/7194/)
Python Back-end Developer - Planning and Tasking Team (/jobs/7193/)
Senior Python Developer (/jobs/7187/)
Health / Intermountain Healthcare (/jobs/7185/)
Senior Python Developer (/jobs/7184/)
Senior Python Developer (/jobs/7181/)
Principal Python Engineer (/jobs/7180/)
Senior Backend Software Engineer (f/m/d) (/jobs/7177/)
Senior Software Engineer @ Omnipresent (/jobs/7175/)
Principal Software Engineer @ Omnipresent (/jobs/7174/)
Senior Software Engineer - Python (/jobs/7173/)
Sr. Python Developer (/jobs/7171/)
Senior Python Software Engineer (/jobs/7170/)p
Python Engineer @ Aeguana (/jobs/7165/)
Algorithms Engineer (/jobs/7164/)
Experienced Django Developer (Python) (/jobs/7163/)
Python Software Engineer (/jobs/7162/)
Senior Python/Django Engineer (/jobs/7129/)

Beautiful Soup

Beautiful Soup是一个小巧而出色的模块,用于解析你在Web上可能遇到的不严谨且格式糟糕
的HTML。

# 使用BeautifulSoup的屏幕抓取程序
from urllib.request import urlopen 
from bs4 import BeautifulSoup text = urlopen('http://python.org/jobs').read() 
soup = BeautifulSoup(text, 'html.parser') jobs = set() 
for job in soup.body.section('h2'): jobs.add('{} ({})'.format(job.a.string, job.a['href'])) print('\\n'.join(sorted(jobs, key=str.lower)))# 输出类似
Algorithms Engineer (/jobs/7164/)
Experienced Django Developer (Python) (/jobs/7163/)
Health / Intermountain Healthcare (/jobs/7185/)
IT Specialist (Data Science) (/jobs/7196/)
Lead / Senior Python Software Engineer (/jobs/7198/)
Principal Python Engineer (/jobs/7180/)
Principal Software Engineer @ Omnipresent (/jobs/7174/)
Python Back-end Developer - Planning and Tasking Team (/jobs/7193/)
Python developer (/jobs/7208/)
Python Developer (/jobs/7209/)
Python Engineer @ Aeguana (/jobs/7165/)
Python Software Engineer (/jobs/7162/)
Senior Back-End Developer (/jobs/7197/)
Senior Backend Software Engineer (f/m/d) (/jobs/7177/)
Senior Python Developer (/jobs/7181/)
Senior Python Developer (/jobs/7184/)
Senior Python Developer (/jobs/7187/)
Senior Python Software Engineer (/jobs/7170/)
Senior Python/Django Engineer (/jobs/7129/)
Senior Software Engineer - Python (/jobs/7173/)
Senior Software Engineer @ Omnipresent (/jobs/7175/)
Software Engineer (Mid/Senior) (/jobs/7194/)
Sr. Python Developer (/jobs/7171/)
🛠 Experienced Data Engineer (/jobs/7200/)
🤖 Experienced Machine Learning Engineer (/jobs/7199/)

使用CGI创建动态网页

通用网关接口(CGI)。CGI是一种标准机制,Web服务器可通过它将(通常是通过Web表
达提供的)查询交给专用程序(如你编写的Python程序),并以网页的方式显示查询结果。

第一步:准备 Web 服务器

$ python3 -m http.server --cgi
Serving HTTP on 0.0.0.0 port 8000 (http://0.0.0.0:8000/) ...

如果现在将浏览器指向http://127.0.0.1:8000或http://localhost:8000,将看到运行这个服务器所
在目录的内容。另外,你还将看到服务器提供的有关连接的信息。

CGI程序也必须放在可通过Web访问的目录中。另外,必须将其标识为CGI脚本,以免Web
服务器以网页的方式提供其源代码。为此,有两种常见的方式:

  • 将脚本放在子目录cgi-bin中;
  • 将脚本文件的扩展名指定为.cgi。

PS 开放指定端口

# 查看防火墙状态 
systemctl status firewalld# 查看端口是否已开 
firewall-cmd --query-port=8000/tcp
# 添加指定需要开放的端口
firewall-cmd --permanent --add-port=8000/tcp   
# 移除指定的端口
firewall-cmd --permanent --remove-port=8000/tcp
# 添加端口后,重启防火墙
systemctl restart firewalld

第二步:添加!#行

脚本开头添加如下(之前没有空行)

#!/usr/bin/python3

第三步:设置文件权限

chmod 755 hello.cgi

简单的cgi脚本

#!/usr/bin/python3print('Content-type: text/plain') 
print()# 打印一个空行,以结束首部print('Hello, world!')

PS;需要放到cgi-bin目录下

小结

屏幕抓取:指的是自动下载网页并从中提取信息。程序Tidy及其库版本是很有用的工具,可用来修复格式糟糕的HTML,然后使用HTTML解析器进行解析。另一种抓取方式是使用Beautiful Soup,即便面对混乱的输入,它也可以处理。

CGI:通用网关接口是一种创建动态网页的方式,这是通过让Web服务器运行、与客户端程序通信并显示结果而实现的。模块cgi和cgitb可用于编写CGI脚本。CGI脚本通常是在HTML表单中调用的。

Flask:一个简单的Web框架,让你能够将代码作为Web应用发布,同时不用过多操心Web部分。

Web应用框架:要使用Python开发复杂的大型Web应用,Web应用框架必不可少。对简单的项目来说,Flask是不错的选择;但对于较大的项目,你可能应考虑使用Django或TurboGears。

Web服务:Web服务之于程序犹如网页之于用户。你可以认为,Web服务让你能够以更抽象的方式进行网络编程。常用的Web服务标准包括RSS(以及与之类似的RDF和Atom)、XML-RPC和SOAP。