国产电影一区二区在线观看,精品成人在线观看,午夜成人A V,国产精品男人的天堂,中国毛片一级,成人黄网站免费视频,日本一二三高清,91激情

下載數(shù)據(jù) - 「urllib」 / 「requests」 / 「aiohttp」 / 「httpx」。
解析數(shù)據(jù) - 「re」 / 「lxml」 / 「beautifulsoup4」 / 「pyquery」。
緩存和持久化 - 「mysqlclient」 / 「sqlalchemy」 / 「peewee」 / 「redis」 / 「pymongo」。
生成數(shù)字簽名 - 「hashlib」。
序列化和壓縮 - 「pickle」 / 「json」 / 「zlib」。
調(diào)度器 - 「multiprocessing」 / 「threading」 / 「concurrent.futures」。

HTML頁面

html>
<html>
?<head>
??<title>Hometitle>
??<style?type="text/css">
???/*?此處省略層疊樣式表代碼?*/
??style>
?head>
?<body>
??<div?class="wrapper">
???<header>
????<h1>Yoko's?Kitchenh1>
????<nav>
?????<ul>
??????<li><a?href=""?class="current">Homea>li>
??????<li><a?href="">Classesa>li>
??????<li><a?href="">Cateringa>li>
??????<li><a?href="">Abouta>li>
??????<li><a?href="">Contacta>li>
?????ul>
????nav>
???header>
???<section?class="courses">
????<article>
?????<figure>
??????<img?src="images/bok-choi.jpg"?alt="Bok?Choi"?/>
??????<figcaption>Bok?Choifigcaption>
?????figure>
?????<hgroup>
??????<h2>Japanese?Vegetarianh2>
??????<h3>Five?week?course?in?Londonh3>
?????hgroup>
?????<p>A?five?week?introduction?to?traditional?Japanese?vegetarian?meals,?teaching?you?a?selection?of?rice?and?noodle?dishes.p>
????article>????
????<article>
?????<figure>
??????<img?src="images/teriyaki.jpg"?alt="Teriyaki?sauce"?/>
??????<figcaption>Teriyaki?Saucefigcaption>
?????figure>
?????<hgroup>
??????<h2>Sauces?Masterclassh2>
??????<h3>One?day?workshoph3>
?????hgroup>
?????<p>An?intensive?one-day?course?looking?at?how?to?create?the?most?delicious?sauces?for?use?in?a?range?of?Japanese?cookery.p>
????article>????
???section>
???<aside>
????<section?class="popular-recipes">
?????<h2>Popular?Recipesh2>
?????<a?href="">Yakitori?(grilled?chicken)a>
?????<a?href="">Tsukune?(minced?chicken?patties)a>
?????<a?href="">Okonomiyaki?(savory?pancakes)a>
?????<a?href="">Mizutaki?(chicken?stew)a>
????section>
????<section?class="contact-details">
?????<h2>Contacth2>
?????<p>Yoko's?Kitchen<br>
??????27?Redchurch?Street<br>
??????Shoreditch<br>
??????London?E2?7DPp>
????section>
???aside>
???<footer>
????©?2011?Yoko's?Kitchen
???footer>
??div>
????????<script>
?????????/*?此處省略JavaScript代碼?*/
????????script>
?body>
html>

如上所示的HTML頁面通常由三部分構(gòu)成，分別是用來承載內(nèi)容的Tag（標(biāo)簽）、負(fù)責(zé)渲染頁面的CSS（層疊樣式表）以及控制交互式行為的JavaScript。通常，我們可以在瀏覽器的右鍵菜單中通過“查看網(wǎng)頁源代碼”的方式獲取網(wǎng)頁的代碼并了解頁面的結(jié)構(gòu)；當(dāng)然，我們也可以通過瀏覽器提供的開發(fā)人員工具來了解更多的信息。

使用requests獲取頁面

在上一節(jié)課的代碼中我們使用了三方庫requests來獲取頁面，下面我們對requests庫的用法做進(jìn)一步說明。

GET請求和POST請求。

import?requests

resp?=?requests.get('http://www.baidu.com/index.html')
print(resp.status_code)
print(resp.headers)
print(resp.cookies)
print(resp.content.decode('utf-8'))

resp?=?requests.post('http://httpbin.org/post',?data={'name':?'Hao',?'age':?40})
print(resp.text)
data?=?resp.json()
print(type(data))

URL參數(shù)和請求頭。

resp?=?requests.get(
????url='https://movie.douban.com/top250',
????headers={
????????'User-Agent':?'Mozilla/5.0?(Macintosh;?Intel?Mac?OS?X?10_14_6)?'
??????????????????????'AppleWebKit/537.36?(KHTML,?like?Gecko)?'
??????????????????????'Chrome/83.0.4103.97?Safari/537.36',
????????'Accept':?'text/html,application/xhtml+xml,application/xml;'
??????????????????'q=0.9,image/webp,image/apng,*/*;'
??????????????????'q=0.8,application/signed-exchange;v=b3;q=0.9',
????????'Accept-Language':?'zh-CN,zh;q=0.9,en;q=0.8',
????}
)
print(resp.status_code)

復(fù)雜的POST請求（文件上傳）。

resp?=?requests.post(
?url='http://httpbin.org/post',
????files={'file':?open('data.xlsx',?'rb')}
)
print(resp.text)

操作Cookie。

cookies?=?{'key1':?'value1',?'key2':?'value2'}
resp?=?requests.get('http://httpbin.org/cookies',?cookies=cookies)
print(resp.text)

jar?=?requests.cookies.RequestsCookieJar()
jar.set('tasty_cookie',?'yum',?domain='httpbin.org',?path='/cookies')
jar.set('gross_cookie',?'blech',?domain='httpbin.org',?path='/elsewhere')
resp?=?requests.get('http://httpbin.org/cookies',?cookies=jar)
print(resp.text)

設(shè)置代理服務(wù)器。
```
requests.get('https://www.taobao.com',?proxies={
????'http':?'http://10.10.1.10:3128',
????'https':?'http://10.10.1.10:1080',
})
```
「說明」：關(guān)于requests庫的相關(guān)知識，還是強烈建議大家自行閱讀它的官方文檔。

設(shè)置請求超時。

requests.get('https://github.com',?timeout=10)

頁面解析

幾種解析方式的比較

解析方式	對應(yīng)的模塊	速度	使用難度	備注
正則表達(dá)式解析	re	快	困難	常用正則表達(dá)式在線正則表達(dá)式測試
XPath解析	lxml	快	一般	需要安裝C語言依賴庫唯一支持XML的解析器
CSS選擇器解析	bs4 / pyquery	不確定	簡單

「說明」：BeautifulSoup可選的解析器包括：Python標(biāo)準(zhǔn)庫中的html.parser、lxml的HTML解析器、lxml的XML解析器和html5lib。

使用正則表達(dá)式解析頁面

如果你對正則表達(dá)式?jīng)]有任何的概念，那么推薦先閱讀《正則表達(dá)式30分鐘入門教程》，然后再閱讀我們之前講解在Python中如何使用正則表達(dá)式一文。

下面的例子演示了如何用正則表達(dá)式解析“豆瓣電影Top250”中的中文電影名稱。

import?random
import?re
import?time

import?requests

PATTERN?=?re.compile(r']*?>\s*(.*?)')

for?page?in?range(10):
????resp?=?requests.get(
????????url=f'https://movie.douban.com/top250?start={page?*?25}',
????????headers={
????????????'User-Agent':?'Mozilla/5.0?(Macintosh;?Intel?Mac?OS?X?10_14_6)?'
??????????????????????????'AppleWebKit/537.36?(KHTML,?like?Gecko)?'
??????????????????????????'Chrome/83.0.4103.97?Safari/537.36',
????????????'Accept':?'text/html,application/xhtml+xml,application/xml;'
??????????????????????'q=0.9,image/webp,image/apng,*/*;'
??????????????????????'q=0.8,application/signed-exchange;v=b3;q=0.9',
????????????'Accept-Language':?'zh-CN,zh;q=0.9,en;q=0.8',
????????},
????)
????items?=?PATTERN.findall(resp.text)
????for?item?in?items:
????????print(item)
????time.sleep(random.randint(1,?5))

XPath解析和lxml

XPath是在XML文檔中查找信息的一種語法，它使用路徑表達(dá)式來選取XML文檔中的節(jié)點或者節(jié)點集。這里所說的XPath節(jié)點包括元素、屬性、文本、命名空間、處理指令、注釋、根節(jié)點等。


<bookstore>
????<book>
??????<title?lang="eng">Harry?Pottertitle>
??????<price>29.99price>
????book>
????<book>
??????<title?lang="zh">三國演義title>
??????<price>39.95price>
????book>
bookstore>

對于上面的XML文件，我們可以用如下所示的XPath語法獲取文檔中的節(jié)點。

路徑表達(dá)式	結(jié)果
bookstore	選取 bookstore 元素的所有子節(jié)點。
/bookstore	選取根元素 bookstore。注釋：假如路徑起始于正斜杠( / )，則此路徑始終代表到某元素的絕對路徑！
bookstore/book	選取屬于 bookstore 的子元素的所有 book 元素。
//book	選取所有 book 子元素，而不管它們在文檔中的位置。
bookstore//book	選擇屬于 bookstore 元素的后代的所有 book 元素，而不管它們位于 bookstore 之下的什么位置。
//@lang	選取名為 lang 的所有屬性。

在使用XPath語法時，還可以使用XPath中的謂詞。

路徑表達(dá)式	結(jié)果
/bookstore/book[1]	選取屬于 bookstore 子元素的第一個 book 元素。
/bookstore/book[last()]	選取屬于 bookstore 子元素的最后一個 book 元素。
/bookstore/book[last()-1]	選取屬于 bookstore 子元素的倒數(shù)第二個 book 元素。
/bookstore/book[position()<3]	選取最前面的兩個屬于 bookstore 元素的子元素的 book 元素。
//title[@lang]	選取所有擁有名為 lang 的屬性的 title 元素。
//title[@lang='eng']	選取所有 title 元素，且這些元素?fù)碛兄禐?eng 的 lang 屬性。
/bookstore/book[price>35.00]	選取 bookstore 元素的所有 book 元素，且其中的 price 元素的值須大于 35.00。
/bookstore/book[price>35.00]/title	選取 bookstore 元素中的 book 元素的所有 title 元素，且其中的 price 元素的值須大于 35.00。

XPath還支持通配符用法，如下所示。

路徑表達(dá)式	結(jié)果
/bookstore/*	選取 bookstore 元素的所有子元素。
//*	選取文檔中的所有元素。
//title[@*]	選取所有帶有屬性的 title 元素。

如果要選取多個節(jié)點，可以使用如下所示的方法。

路徑表達(dá)式	結(jié)果
//book/title \| //book/price	選取 book 元素的所有 title 和 price 元素。
//title \| //price	選取文檔中的所有 title 和 price 元素。
/bookstore/book/title \| //price	選取屬于 bookstore 元素的 book 元素的所有 title 元素，以及文檔中所有的 price 元素。

「說明」：上面的例子來自于菜鳥教程網(wǎng)站上XPath教程，有興趣的讀者可以自行閱讀原文。

當(dāng)然，如果不理解或者不太熟悉XPath語法，可以在Chrome瀏覽器中按照如下所示的方法查看元素的XPath語法。

下面的例子演示了如何用XPath解析“豆瓣電影Top250”中的中文電影名稱。

from?lxml?import?etree

import?requests

for?page?in?range(10):
????resp?=?requests.get(
????????url=f'https://movie.douban.com/top250?start={page?*?25}',
????????headers={
????????????'User-Agent':?'Mozilla/5.0?(Macintosh;?Intel?Mac?OS?X?10_14_6)?'
??????????????????????????'AppleWebKit/537.36?(KHTML,?like?Gecko)?'
??????????????????????????'Chrome/83.0.4103.97?Safari/537.36',
????????????'Accept':?'text/html,application/xhtml+xml,application/xml;'
??????????????????????'q=0.9,image/webp,image/apng,*/*;'
??????????????????????'q=0.8,application/signed-exchange;v=b3;q=0.9',
????????????'Accept-Language':?'zh-CN,zh;q=0.9,en;q=0.8',
????????}
????)
????html?=?etree.HTML(resp.text)
????spans?=?html.xpath('/html/body/div[3]/div[1]/div/div[1]/ol/li/div/div[2]/div[1]/a/span[1]')
????for?span in?spans:
????????print(span.text)

BeautifulSoup的使用

BeautifulSoup是一個可以從HTML或XML文件中提取數(shù)據(jù)的Python庫。它能夠通過你喜歡的轉(zhuǎn)換器實現(xiàn)慣用的文檔導(dǎo)航、查找、修改文檔的方式。

遍歷文檔樹

獲取標(biāo)簽
獲取標(biāo)簽屬性
獲取標(biāo)簽內(nèi)容
獲取子（孫）節(jié)點
獲取父節(jié)點/祖先節(jié)點
獲取兄弟節(jié)點

搜索樹節(jié)點

find / find_all
select_one / select

「說明」：更多內(nèi)容可以參考BeautifulSoup的官方文檔。

下面的例子演示了如何用CSS選擇器解析“豆瓣電影Top250”中的中文電影名稱。

import?random
import?time

import?bs4
import?requests

for?page?in?range(10):
????resp?=?requests.get(
????????url=f'https://movie.douban.com/top250?start={page?*?25}',
????????headers={
????????????'User-Agent':?'Mozilla/5.0?(Macintosh;?Intel?Mac?OS?X?10_14_6)?'
??????????????????????????'AppleWebKit/537.36?(KHTML,?like?Gecko)?'
??????????????????????????'Chrome/83.0.4103.97?Safari/537.36',
????????????'Accept':?'text/html,application/xhtml+xml,application/xml;'
??????????????????????'q=0.9,image/webp,image/apng,*/*;'
??????????????????????'q=0.8,application/signed-exchange;v=b3;q=0.9',
????????????'Accept-Language':?'zh-CN,zh;q=0.9,en;q=0.8',
????????},
????)
????soup?=?bs4.BeautifulSoup(resp.text,?'lxml')
????elements?=?soup.select('.info>div>a')
????for?element?in?elements:
????????span =?element.select_one('.title')
????????print(span.text)
????time.sleep(random.random()?*?5)

例子 - 獲取知乎發(fā)現(xiàn)上的問題鏈接

import?re
from?urllib.parse?import?urljoin

import?bs4
import?requests


def?main():
????headers?=?{'user-agent':?'Baiduspider'}
????base_url?=?'https://www.zhihu.com/'
????resp?=?requests.get(urljoin(base_url,?'explore'),?headers=headers)
????soup?=?bs4.BeautifulSoup(resp.text,?'lxml')
????href_regex?=?re.compile(r'^/question')
????links_set?=?set()
????for?a_tag?in?soup.find_all('a',?{'href':?href_regex}):
????????if?'href'?in?a_tag.attrs:
????????????href?=?a_tag.attrs['href']
????????????full_url?=?urljoin(base_url,?href)
????????????links_set.add(full_url)
????print('Total?%d?question?pages?found.'?%?len(links_set))
????print(links_set)


if?__name__?==?'__main__':
????main()

數(shù)據(jù)采集和解析