1. <strong id="7actg"></strong>
    2. <table id="7actg"></table>

    3. <address id="7actg"></address>
      <address id="7actg"></address>
      1. <object id="7actg"><tt id="7actg"></tt></object>

        用 Python 爬取起點(diǎn)小說(shuō)網(wǎng)

        共 4347字,需瀏覽 9分鐘

         ·

        2020-12-24 07:25

        目標(biāo)

        爬取一本仙俠類的小說(shuō)下載并保存為txt文件到本地。本例為“大周仙吏”。

        項(xiàng)目準(zhǔn)備

        軟件:Pycharm

        第三方庫(kù):requests,fake_useragent,lxml

        網(wǎng)站地址:https://book.qidian.com

        網(wǎng)站分析

        打開(kāi)網(wǎng)址:

        網(wǎng)址變?yōu)椋?/span>https://book.qidian.com/info/1020580616#Catalog

        判斷是否為靜態(tài)加載網(wǎng)頁(yè),Ctrl+U打開(kāi)源代碼,Ctrl+F打開(kāi)搜索框,輸入:第一章。

        在這里是可以找到的,判定為靜態(tài)加載。

        反爬分析

        同一個(gè)ip地址去多次訪問(wèn)會(huì)面臨被封掉的風(fēng)險(xiǎn),這里采用fake_useragent,產(chǎn)生隨機(jī)的User-Agent請(qǐng)求頭進(jìn)行訪問(wèn)。

        代碼實(shí)現(xiàn)

        1.導(dǎo)入相對(duì)應(yīng)的第三方庫(kù),定義一個(gè)class類繼承object,定義init方法繼承self,主函數(shù)main繼承self。
        import??requests
        from?fake_useragent?import?UserAgent
        from?lxml?import?etree
        class?photo_spider(object):
        ????def?__init__(self):
        ????????self.url?=?'https://book.qidian.com/info/1020580616#Catalog'
        ????????ua?=?UserAgent(verify_ssl=False)
        ????????#隨機(jī)產(chǎn)生user-agent
        ????????for?i?in?range(1,?100):
        ????????????self.headers?=?{
        ????????????????'User-Agent':?ua.random
        ????????????}
        ????def?mian(self):
        ?????pass
        if?__name__?==?'__main__':
        ????spider?=?qidian()
        ????spider.main()
        2.發(fā)送請(qǐng)求,獲取網(wǎng)頁(yè)。
        ????def?get_html(self,url):
        ????????response=requests.get(url,headers=self.headers)
        ????????html=response.content.decode('utf-8')
        ????????return?html
        3.獲取圖片的鏈接地址。
        import?requests
        from?lxml?import?etree
        from?fake_useragent?import?UserAgent
        class?qidian(object):
        ????def?__init__(self):
        ????????self.url?=?'https://book.qidian.com/info/1020580616#Catalog'
        ????????ua?=?UserAgent(verify_ssl=False)
        ????????for?i?in?range(1,?100):
        ????????????self.headers?=?{
        ????????????????'User-Agent':?ua.random
        ????????????}
        ????def?get_html(self,url):
        ????????response=requests.get(url,headers=self.headers)
        ????????html=response.content.decode('utf-8')
        ????????return?html
        ????def?parse_html(self,html):
        ????????target=etree.HTML(html)
        ????????links=target.xpath('//ul[@class="cf"]/li/a/@href')#獲取鏈接
        ????????names=target.xpath('//ul[@class="cf"]/li/a/text()')#獲取每一章的名字
        ????????for?link,name?in?zip(links,names):
        ????????????print(name+'\t'+'https:'+link)
        ????def?main(self):
        ????????url=self.url
        ????????html=self.get_html(url)
        ????????self.parse_html(html)
        if?__name__?==?'__main__':
        ????spider=qidian()
        ????spider.main()

        打印結(jié)果:

        4.解析鏈接,獲取每一章內(nèi)容。
        ????def?parse_html(self,html):
        ????????target=etree.HTML(html)
        ????????links=target.xpath('//ul[@class="cf"]/li/a/@href')
        ????????for?link?in?links:
        ????????????host='https:'+link
        ????????????#解析鏈接地址
        ????????????res=requests.get(host,headers=self.headers)
        ????????????c=res.content.decode('utf-8')
        ????????????target=etree.HTML(c)
        ????????????names=target.xpath('//span[@class="content-wrap"]/text()')
        ????????????results=target.xpath('//div[@class="read-content?j_readContent"]/p/text()')
        ????????????for?name?in?names:
        ????????????????print(name)
        ????????????for?result?in?results:
        ????????????????print(result)

        打印結(jié)果:(下面內(nèi)容過(guò)多,只貼出一部分。)

        5.保存為txt文件到本地。
        ?with?open('F:/pycharm文件/document/'?+?name?+?'.txt',?'a')?as?f:
        ??????for?result?in?results:
        ??????????#print(result)
        ??????????f.write(result+'\n')

        效果顯示:

        打開(kāi)文件目錄:

        完整代碼

        import?requests
        from?lxml?import?etree
        from?fake_useragent?import?UserAgent
        class?qidian(object):
        ????def?__init__(self):
        ????????self.url?=?'https://book.qidian.com/info/1020580616#Catalog'
        ????????ua?=?UserAgent(verify_ssl=False)
        ????????for?i?in?range(1,?100):
        ????????????self.headers?=?{
        ????????????????'User-Agent':?ua.random
        ????????????}
        ????def?get_html(self,url):
        ????????response=requests.get(url,headers=self.headers)
        ????????html=response.content.decode('utf-8')
        ????????return?html
        ????def?parse_html(self,html):
        ????????target=etree.HTML(html)
        ????????links=target.xpath('//ul[@class="cf"]/li/a/@href')
        ????????for?link?in?links:
        ????????????host='https:'+link
        ????????????#解析鏈接地址
        ????????????res=requests.get(host,headers=self.headers)
        ????????????c=res.content.decode('utf-8')
        ????????????target=etree.HTML(c)
        ????????????names=target.xpath('//span[@class="content-wrap"]/text()')
        ????????????results=target.xpath('//div[@class="read-content?j_readContent"]/p/text()')
        ????????????for?name?in?names:
        ????????????????print(name)
        ????????????????with?open('F:/pycharm文件/document/'?+?name?+?'.txt',?'a')?as?f:
        ????????????????????for?result?in?results:
        ????????????????????????#print(result)
        ????????????????????????f.write(result+'\n')
        ????def?main(self):
        ????????url=self.url
        ????????html=self.get_html(url)
        ????????self.parse_html(html)
        if?__name__?==?'__main__':
        ????spider=qidian()
        ????spider.main()


        更多閱讀



        5 分鐘完全掌握 Python 協(xié)程


        程序運(yùn)行慢?你怕是寫(xiě)的假 Python


        賽博朋克科幻文化的起源和意義


        特別推薦


        程序員摸魚(yú)指南


        為你精選的硅谷極客資訊,
        來(lái)自FLAG巨頭開(kāi)發(fā)者、技術(shù)、創(chuàng)投一手消息




        點(diǎn)擊下方閱讀原文加入社區(qū)會(huì)員

        瀏覽 26
        點(diǎn)贊
        評(píng)論
        收藏
        分享

        手機(jī)掃一掃分享

        分享
        舉報(bào)
        評(píng)論
        圖片
        表情
        推薦
        點(diǎn)贊
        評(píng)論
        收藏
        分享

        手機(jī)掃一掃分享

        分享
        舉報(bào)
        1. <strong id="7actg"></strong>
        2. <table id="7actg"></table>

        3. <address id="7actg"></address>
          <address id="7actg"></address>
          1. <object id="7actg"><tt id="7actg"></tt></object>
            加勒比综合 | 男人先锋资源 | 国产伦精品一区二区三区照片 | 色偷偷色偷偷色偷偷在线视频 | 激情无码视频 | 精品麻豆一区二区国产明星 | 色戒未删减免费在线观看完整版 | 欧美黑人操逼 | 在线观看黄色小说 | 日本超碰在线 |