手把手帶你爬蟲 | 爬取500px圖片
目標
爬取500px網(wǎng)站圖片并保存到本地。
項目準備
軟件:Pycharm
第三方庫:requests,fake_useragent
網(wǎng)站地址:https://500px.com/popular
網(wǎng)站分析
首先拿到一個網(wǎng)站,先看一下目標網(wǎng)站是靜態(tài)加載還是動態(tài)加載的。

右邊有個下拉滾動條,下拉之后會發(fā)現(xiàn),它是沒有頁碼并且會自動加載的,一般這樣就可以初步判斷該網(wǎng)站為動態(tài)加載方式,或者還可以打開開發(fā)者模式,復(fù)制其中一個圖片鏈接,Ctrl+U查看源代碼,Ctrl+f打開搜索框,把鏈接地址粘貼進去,會發(fā)現(xiàn)根本找不到這個鏈接地址,這樣就可以確定為動態(tài)加載。


在這里找到了圖片鏈接,向下拉動滾動條,這里會再次加載下一頁的內(nèi)容。

在這里找到了圖片鏈接,向下拉動滾動條,這里會再次加載下一頁的內(nèi)容。

這個就是網(wǎng)頁的真實URL鏈接。

復(fù)制下來這前幾個地址進行分析:
第一個:https://api.500px.com/v1/photos?rpp=50&feature=popular&image_size%5B%5D=1&image_size%5B%5D=2&image_size%5B%5D=32&image_size%5B%5D=31&image_size%5B%5D=33&image_size%5B%5D=34&image_size%5B%5D=35&image_size%5B%5D=36&image_size%5B%5D=2048&image_size%5B%5D=4&image_size%5B%5D=14&sort=&include_states=true&include_licensing=true&formats=jpeg%2Clytro&only=&exclude=&personalized_categories=&page=1&rpp=50
第二個:https://api.500px.com/v1/photos?rpp=50&feature=popular&image_size%5B%5D=1&image_size%5B%5D=2&image_size%5B%5D=32&image_size%5B%5D=31&image_size%5B%5D=33&image_size%5B%5D=34&image_size%5B%5D=35&image_size%5B%5D=36&image_size%5B%5D=2048&image_size%5B%5D=4&image_size%5B%5D=14&sort=&include_states=true&include_licensing=true&formats=jpeg%2Clytro&only=All+photographers%2CPulse&exclude=&personalized_categories=&page=2&rpp=50
會發(fā)現(xiàn)第一頁是:page=1,第二頁是:page=2…但是還有其他地方些許不一樣,但是經(jīng)過驗證是沒出問題的,這就發(fā)現(xiàn)了每一頁的規(guī)律。
反爬分析
同一個ip地址去多次訪問會面臨被封掉的風險,這里采用fake_useragent,產(chǎn)生隨機的User-Agent請求頭進行訪問。
代碼實現(xiàn)
1.導(dǎo)入相對應(yīng)的第三方庫,定義一個class類繼承object,定義init方法繼承self,主函數(shù)main繼承self。
import??requests
from?fake_useragent?import?UserAgent
filename=0
class?photo_spider(object):
????def?__init__(self):
????????self.url?=?'https://api.500px.com/v1/photos?rpp=50&feature=popular&image_size%5B%5D=1&image_size%5B%5D=2&image_size%5B%5D=32&image_size%5B%5D=31&image_size%5B%5D=33&image_size%5B%5D=34&image_size%5B%5D=35&image_size%5B%5D=36&image_size%5B%5D=2048&image_size%5B%5D=4&image_size%5B%5D=14&sort=&include_states=true&include_licensing=true&formats=jpeg%2Clytro&only=&exclude=&personalized_categories=&page={}&rpp=50'
????????ua?=?UserAgent(verify_ssl=False)
????????#隨機產(chǎn)生user-agent
????????for?i?in?range(1,?100):
????????????self.headers?=?{
????????????????'User-Agent':?ua.random
????????????}
????def?mian(self):
?????pass
if?__name__?==?'__main__':
????spider?=?photo_spider()
????spider.main()
2.發(fā)送請求,獲取網(wǎng)頁。
????def?get_html(self,url):
????????response=requests.get(url,headers=self.headers)
????????html=response.json()#動態(tài)加載的json數(shù)據(jù)
????????return?html
3.獲取圖片的鏈接地址,保存圖片格式到本地文件夾。
????def?get_imageUrl(self,html):
????????global?filename
????????content_list=html['photos']
????????for?content?in?content_list:
????????????image_url=content['image_url']
????????????#print(image_url[8])
????????????imageUrl=image_url[8]
????????????r=requests.get(imageUrl,headers=self.headers)
????????????with?open('F:/pycharm文件/photo/'+str(filename)+'.jpg','wb')?as?f:
????????????????f.write(r.content)
????????????????filename+=1
這里說明一下,imageUrl=image_url[8]這里由于有多個image-url。

4.獲取多頁及函數(shù)調(diào)用。
????def?main(self):
????????start?=?int(input('輸入開始頁:'))
????????end?=?int(input('輸入結(jié)束頁:'))
????????for?page?in?range(start,?end?+?1):
????????????print('第%s頁內(nèi)容'?%?page)
????????????url?=?self.url.format(page)#{}傳入page即頁碼
????????????html=self.get_html(url)
????????????self.get_imageUrl(html)
????????????print('第%s頁爬取完成'%page)
運行結(jié)果
打開本地F:/pycharm文件/photo/

完整代碼
import??requests
from?fake_useragent?import?UserAgent
filename=0
class?photo_spider(object):
????def?__init__(self):
????????self.url?=?'https://api.500px.com/v1/photos?rpp=50&feature=popular&image_size%5B%5D=1&image_size%5B%5D=2&image_size%5B%5D=32&image_size%5B%5D=31&image_size%5B%5D=33&image_size%5B%5D=34&image_size%5B%5D=35&image_size%5B%5D=36&image_size%5B%5D=2048&image_size%5B%5D=4&image_size%5B%5D=14&sort=&include_states=true&include_licensing=true&formats=jpeg%2Clytro&only=&exclude=&personalized_categories=&page={}&rpp=50'
????????ua?=?UserAgent(verify_ssl=False)
????????for?i?in?range(1,?100):
????????????self.headers?=?{
????????????????'User-Agent':?ua.random
????????????}
????def?get_html(self,url):
????????response=requests.get(url,headers=self.headers)
????????html=response.json()
????????return?html
????def?get_imageUrl(self,html):
????????global?filename
????????content_list=html['photos']
????????for?content?in?content_list:
????????????image_url=content['image_url']
????????????#print(image_url[8])
????????????imageUrl=image_url[8]
????????????r=requests.get(imageUrl,headers=self.headers)
????????????with?open('F:/pycharm文件/photo/'+str(filename)+'.jpg','wb')?as?f:
????????????????f.write(r.content)
????????????????filename+=1
????def?main(self):
????????start?=?int(input('輸入開始:'))
????????end?=?int(input('輸入結(jié)束頁:'))
????????for?page?in?range(start,?end?+?1):
????????????print('第%s頁'?%?page)
????????????url?=?self.url.format(page)
????????????html=self.get_html(url)
????????????self.get_imageUrl(html)
if?__name__?==?'__main__':
????spider?=?photo_spider()
????spider.main()評論
圖片
表情
