1. <strong id="7actg"></strong>
    2. <table id="7actg"></table>

    3. <address id="7actg"></address>
      <address id="7actg"></address>
      1. <object id="7actg"><tt id="7actg"></tt></object>

        用Python爬取芒騰B愛(ài)知微幾大平臺(tái)的彈幕、評(píng)論

        共 6197字,需瀏覽 13分鐘

         ·

        2021-11-15 17:51

        大家好,歡迎來(lái)到 Crossin的編程教室 !

        學(xué)習(xí)爬蟲(chóng)和數(shù)據(jù)分析的時(shí)候,經(jīng)常需要用到網(wǎng)絡(luò)上的彈幕、評(píng)論等數(shù)據(jù)。今天就講解下如何用python爬取芒果TV、騰訊視頻、B站、愛(ài)奇藝、知乎、微博這幾個(gè)平臺(tái)的彈幕和評(píng)論。這類(lèi)數(shù)據(jù)一般用于娛樂(lè)、輿情分析,比如新出一部火爆的電影,爬取彈幕評(píng)論分析他為什么這么火;微博又出大瓜,爬取底下評(píng)論看看網(wǎng)友怎么說(shuō)等等。

        本文共涉及六個(gè)平臺(tái),十個(gè)爬蟲(chóng)案例,介紹順序?yàn)椋?span style="color: rgb(0, 0, 0);font-family: Optima-Regular, Optima, PingFangSC-light, PingFangTC-light, "PingFang SC", Cambria, Cochin, Georgia, Times, "Times New Roman", serif;font-size: 16px;text-align: left;">芒果TV、騰訊視頻、B站、愛(ài)奇藝、知乎、微博。完整的實(shí)戰(zhàn)源碼已在文中,我們廢話(huà)不多說(shuō),下面開(kāi)始操作!
        特別說(shuō)明:本文內(nèi)容僅做編程技術(shù)交流,相關(guān)數(shù)據(jù)和代碼不可用于商業(yè)用途,否則后果自負(fù)。

        芒果TV

        這里以電影《懸崖之上》為例,講解如何爬取芒果TV視頻的彈幕和評(píng)論。

        網(wǎng)頁(yè)地址:

        https://www.mgtv.com/b/335313/12281642.html?fpa=15800&fpos=8&lastp=ch_movie

        彈幕

        ??分析網(wǎng)頁(yè)

        彈幕數(shù)據(jù)所在的文件是動(dòng)態(tài)加載的,需要進(jìn)入瀏覽器的開(kāi)發(fā)者工具進(jìn)行抓包,得到彈幕數(shù)據(jù)所在的真實(shí)url。當(dāng)視頻播放一分鐘它就會(huì)更新一個(gè)json數(shù)據(jù)包,里面包含我們需要的彈幕數(shù)據(jù)。得到的真實(shí)url:

        https://bullet-ali.hitv.com/bullet/2021/08/14/005323/12281642/0.json
        https://bullet-ali.hitv.com/bullet/2021/08/14/005323/12281642/1.json

        可以發(fā)現(xiàn),每條url的差別在于后面的數(shù)字,首條url為0,后面的逐步遞增。視頻一共120:20分鐘,向上取整,也就是121條數(shù)據(jù)包。

        ??實(shí)戰(zhàn)代碼

        import?requests
        import?pandas?as?pd

        headers?=?{
        ????'user-agent':?'Mozilla/5.0?(Windows?NT?10.0;?Win64;?x64)?AppleWebKit/537.36?(KHTML,?like?Gecko)?Chrome/91.0.4472.124?Safari/537.36'
        }
        df?=?pd.DataFrame()
        for?e?in?range(0,?121):
        ????print(f'正在爬取第{e}頁(yè)')
        ????resposen?=?requests.get(f'https://bullet-ali.hitv.com/bullet/2021/08/3/004902/12281642/{e}.json',?headers=headers)
        ????#?直接用json提取數(shù)據(jù)
        ????for?i?in?resposen.json()['data']['items']:
        ????????ids?=?i['ids']??#?用戶(hù)id
        ????????content?=?i['content']??#?彈幕內(nèi)容
        ????????time?=?i['time']??#?彈幕發(fā)生時(shí)間
        ????????#?有些文件中不存在點(diǎn)贊數(shù)
        ????????try:??
        ????????????v2_up_count?=?i['v2_up_count']
        ????????except:
        ????????????v2_up_count?=?''
        ????????text?=?pd.DataFrame({'ids':?[ids],?'彈幕':?[content],?'發(fā)生時(shí)間':?[time]})
        ????????df?=?pd.concat([df,?text])
        df.to_csv('懸崖之上.csv',?encoding='utf-8',?index=False)

        結(jié)果展示:

        評(píng)論

        ??分析網(wǎng)頁(yè)

        芒果TV視頻的評(píng)論需要拉取到網(wǎng)頁(yè)下面進(jìn)行查看。評(píng)論數(shù)據(jù)所在的文件依然是動(dòng)態(tài)加載的,進(jìn)入開(kāi)發(fā)者工具,按下列步驟進(jìn)行抓包:Network→js,最后點(diǎn)擊查看更多評(píng)論。加載出來(lái)的依然是js文件,里面包含評(píng)論數(shù)據(jù)。得到的真實(shí)url:

        https://comment.mgtv.com/v4/comment/getCommentList?page=1&subjectType=hunantv2014&subjectId=12281642&callback=jQuery1820749973529821774_1628942431449&_support=10000000&_=1628943290494
        https://comment.mgtv.com/v4/comment/getCommentList?page=2&subjectType=hunantv2014&subjectId=12281642&callback=jQuery1820749973529821774_1628942431449&_support=10000000&_=1628943296653

        其中有差別的參數(shù)有page_,page是頁(yè)數(shù),_是時(shí)間戳;url中的時(shí)間戳刪除后不影響數(shù)據(jù)完整性,但里面的callback參數(shù)會(huì)干擾數(shù)據(jù)解析,所以進(jìn)行刪除。最后得到url:

        https://comment.mgtv.com/v4/comment/getCommentList?page=1&subjectType=hunantv2014&subjectId=12281642&_support=10000000

        數(shù)據(jù)包中每頁(yè)包含15條評(píng)論數(shù)據(jù),評(píng)論總數(shù)是2527,得到最大頁(yè)為169。??實(shí)戰(zhàn)代碼

        import?requests
        import?pandas?as?pd

        headers?=?{
        ????'user-agent':?'Mozilla/5.0?(Windows?NT?10.0;?Win64;?x64)?AppleWebKit/537.36?(KHTML,?like?Gecko)?Chrome/91.0.4472.124?Safari/537.36'
        }
        df?=?pd.DataFrame()
        for?o?in?range(1,?170):
        ????url?=?f'https://comment.mgtv.com/v4/comment/getCommentList?page={o}&subjectType=hunantv2014&subjectId=12281642&_support=10000000'
        ????res?=?requests.get(url,?headers=headers).json()
        ????for?i?in?res['data']['list']:
        ????????nickName?=?i['user']['nickName']??#?用戶(hù)昵稱(chēng)
        ????????praiseNum?=?i['praiseNum']??#?被點(diǎn)贊數(shù)
        ????????date?=?i['date']??#?發(fā)送日期
        ????????content?=?i['content']??#?評(píng)論內(nèi)容
        ????????text?=?pd.DataFrame({'nickName':?[nickName],?'praiseNum':?[praiseNum],?'date':?[date],?'content':?[content]})
        ????????df?=?pd.concat([df,?text])
        df.to_csv('懸崖之上.csv',?encoding='utf-8',?index=False)

        結(jié)果展示:

        騰訊視頻

        這里以爬取電影《革命者》為例,講解如何爬取騰訊視頻的彈幕和評(píng)論。

        網(wǎng)頁(yè)地址:

        https://v.qq.com/x/cover/mzc00200m72fcup.html

        彈幕

        ??分析網(wǎng)頁(yè)

        依然進(jìn)入瀏覽器的開(kāi)發(fā)者工具進(jìn)行抓包,當(dāng)視頻播放30秒它就會(huì)更新一個(gè)json數(shù)據(jù)包,里面包含我們需要的彈幕數(shù)據(jù)。得到真實(shí)url:

        https://mfm.video.qq.com/danmu?otype=json&callback=jQuery19109541041335587612_1628947050538&target_id=7220956568%26vid%3Dt0040z3o3la&session_key=0%2C32%2C1628947057×tamp=15&_=1628947050569
        https://mfm.video.qq.com/danmu?otype=json&callback=jQuery19109541041335587612_1628947050538&target_id=7220956568%26vid%3Dt0040z3o3la&session_key=0%2C32%2C1628947057×tamp=45&_=1628947050572

        其中有差別的參數(shù)有timestamp_。_是時(shí)間戳。timestamp是頁(yè)數(shù),首條url為15,后面以公差為30遞增,公差是以數(shù)據(jù)包更新時(shí)長(zhǎng)為基準(zhǔn),而最大頁(yè)數(shù)為視頻時(shí)長(zhǎng)7245秒。依然刪除不必要參數(shù),得到url:

        https://mfm.video.qq.com/danmu?otype=json&target_id=7220956568%26vid%3Dt0040z3o3la&session_key=0%2C18%2C1628418094×tamp=15&_=1628418086509

        ??實(shí)戰(zhàn)代碼

        import?pandas?as?pd
        import?time
        import?requests

        headers?=?{
        ????'User-Agent':?'Googlebot'
        }
        #?初始為15,7245?為視頻秒長(zhǎng),鏈接以三十秒遞增
        df?=?pd.DataFrame()
        for?i?in?range(15,?7245,?30):
        ????url?=?"https://mfm.video.qq.com/danmu?otype=json&target_id=7220956568%26vid%3Dt0040z3o3la&session_key=0%2C18%2C1628418094×tamp={}&_=1628418086509".format(i)
        ????html?=?requests.get(url,?headers=headers).json()
        ????time.sleep(1)
        ????for?i?in?html['comments']:
        ????????content?=?i['content']
        ????????print(content)
        ????????text?=?pd.DataFrame({'彈幕':?[content]})
        ????????df?=?pd.concat([df,?text])
        df.to_csv('革命者_(dá)彈幕.csv',?encoding='utf-8',?index=False)

        結(jié)果展示:

        評(píng)論

        ??分析網(wǎng)頁(yè)

        騰訊視頻評(píng)論數(shù)據(jù)在網(wǎng)頁(yè)底部,依然是動(dòng)態(tài)加載的,需要按下列步驟進(jìn)入開(kāi)發(fā)者工具進(jìn)行抓包:點(diǎn)擊查看更多評(píng)論后,得到的數(shù)據(jù)包含有我們需要的評(píng)論數(shù)據(jù),得到的真實(shí)url:

        https://video.coral.qq.com/varticle/6655100451/comment/v2?callback=_varticle6655100451commentv2&orinum=10&oriorder=o&pageflag=1&cursor=0&scorecursor=0&orirepnum=2&reporder=o&reppageflag=1&source=132&_=1628948867522
        https://video.coral.qq.com/varticle/6655100451/comment/v2?callback=_varticle6655100451commentv2&orinum=10&oriorder=o&pageflag=1&cursor=6786869637356389636&scorecursor=0&orirepnum=2&reporder=o&reppageflag=1&source=132&_=1628948867523

        url中的參數(shù)callback以及_刪除即可。重要的是參數(shù)cursor,第一條url參數(shù)cursor是等于0的,第二條url才出現(xiàn),所以要查找cursor參數(shù)是怎么出現(xiàn)的。經(jīng)過(guò)我的觀察,cursor參數(shù)其實(shí)是上一條url的last參數(shù):??實(shí)戰(zhàn)代碼

        import?requests
        import?pandas?as?pd
        import?time
        import?random

        headers?=?{
        ????'User-Agent':?'Mozilla/5.0?(Windows?NT?10.0;?Win64;?x64)?AppleWebKit/537.36?(KHTML,?like?Gecko)?Chrome/91.0.4472.124?Safari/537.36'
        }
        df?=?pd.DataFrame()
        a?=?1
        #?此處必須設(shè)定循環(huán)次數(shù),否則會(huì)無(wú)限重復(fù)爬取
        # 281為參照數(shù)據(jù)包中的oritotal,數(shù)據(jù)包中一共10條數(shù)據(jù),循環(huán)280次得到2800條數(shù)據(jù),但不包括底下回復(fù)的評(píng)論
        #?數(shù)據(jù)包中的commentnum,是包括回復(fù)的評(píng)論數(shù)據(jù)的總數(shù),而數(shù)據(jù)包都包含10條評(píng)論數(shù)據(jù)和底下的回復(fù)的評(píng)論數(shù)據(jù),所以只需要把2800除以10取整數(shù)+1即可!
        while?a?281:
        ????if?a?==?1:
        ????????url?=?'https://video.coral.qq.com/varticle/6655100451/comment/v2?orinum=10&oriorder=o&pageflag=1&cursor=0&scorecursor=0&orirepnum=2&reporder=o&reppageflag=1&source=132'
        ????else:
        ????????url?=?f'https://video.coral.qq.com/varticle/6655100451/comment/v2?orinum=10&oriorder=o&pageflag=1&cursor={cursor}&scorecursor=0&orirepnum=2&reporder=o&reppageflag=1&source=132'
        ????res?=?requests.get(url,?headers=headers).json()
        ????cursor?=?res['data']['last']
        ????for?i?in?res['data']['oriCommList']:
        ????????ids?=?i['id']
        ????????times?=?i['time']
        ????????up?=?i['up']
        ????????content?=?i['content'].replace('\n',?'')
        ????????text?=?pd.DataFrame({'ids':?[ids],?'times':?[times],?'up':?[up],?'content':?[content]})
        ????????df?=?pd.concat([df,?text])
        ????a?+=?1
        ????time.sleep(random.uniform(2,?3))
        ????df.to_csv('革命者_(dá)評(píng)論.csv',?encoding='utf-8',?index=False)

        效果展示:

        B站

        這里以爬取視頻《“ 這是我見(jiàn)過(guò)最拽的一屆中國(guó)隊(duì)奧運(yùn)冠軍”》為例,講解如何爬取B站視頻的彈幕和評(píng)論。

        網(wǎng)頁(yè)地址:

        https://www.bilibili.com/video/BV1wq4y1Q7dp

        彈幕

        ??分析網(wǎng)頁(yè)

        B站視頻的彈幕不像騰訊視頻那樣,播放視頻就會(huì)觸發(fā)彈幕數(shù)據(jù)包,他需要點(diǎn)擊網(wǎng)頁(yè)右側(cè)的彈幕列表行的展開(kāi),然后點(diǎn)擊查看歷史彈幕獲得視頻彈幕開(kāi)始日到截至日鏈接:鏈接末尾以oid以及開(kāi)始日期來(lái)構(gòu)成彈幕日期url:

        https://api.bilibili.com/x/v2/dm/history/index?type=1&oid=384801460&month=2021-08

        在上面的的基礎(chǔ)之上,點(diǎn)擊任一有效日期即可獲得這一日期的彈幕數(shù)據(jù)包,里面的內(nèi)容目前是看不懂的,之所以確定它為彈幕數(shù)據(jù)包,是因?yàn)辄c(diǎn)擊了日期他才加載出來(lái),且鏈接與前面的鏈接具有相關(guān)性:得到的url:

        https://api.bilibili.com/x/v2/dm/web/history/seg.so?type=1&oid=384801460&date=2021-08-08

        url中的oid為視頻彈幕鏈接的id值;data參數(shù)為剛才的的日期,而獲得該視頻全部彈幕內(nèi)容,只需要更改data參數(shù)即可。而data參數(shù)可以從上面的彈幕日期url獲得,也可以自行構(gòu)造;網(wǎng)頁(yè)數(shù)據(jù)格式為json格式

        ??實(shí)戰(zhàn)代碼

        import?requests
        import?pandas?as?pd
        import?re

        def?data_resposen(url):
        ????headers?=?{
        ????????"cookie":?"你的cookie",
        ????????"user-agent":?"Mozilla/5.0?(Windows?NT?10.0;?Win64;?x64)?AppleWebKit/537.36?(KHTML,?like?Gecko)?Chrome/88.0.4324.104?Safari/537.36"
        ????}
        ????resposen?=?requests.get(url,?headers=headers)
        ????return?resposen

        def?main(oid,?month):
        ????df?=?pd.DataFrame()
        ????url?=?f'https://api.bilibili.com/x/v2/dm/history/index?type=1&oid={oid}&month={month}'
        ????list_data?=?data_resposen(url).json()['data']??#?拿到所有日期
        ????print(list_data)
        ????for?data?in?list_data:
        ????????urls?=?f'https://api.bilibili.com/x/v2/dm/web/history/seg.so?type=1&oid={oid}&date={data}'
        ????????text?=?re.findall(".*?([\u4E00-\u9FA5]+).*?",?data_resposen(urls).text)
        ????????for?e?in?text:
        ????????????print(e)
        ????????????data?=?pd.DataFrame({'彈幕':?[e]})
        ????????????df?=?pd.concat([df,?data])
        ????df.to_csv('彈幕.csv',?encoding='utf-8',?index=False,?mode='a+')

        if?__name__?==?'__main__':
        ????oid?=?'384801460'??#?視頻彈幕鏈接的id值
        ????month?=?'2021-08'??#?開(kāi)始日期
        ????main(oid,?month)

        結(jié)果展示:

        評(píng)論

        ??分析網(wǎng)頁(yè)

        B站視頻的評(píng)論內(nèi)容在網(wǎng)頁(yè)下方,進(jìn)入瀏覽器的開(kāi)發(fā)者工具后,只需要向下拉取即可加載出數(shù)據(jù)包:得到真實(shí)url:

        https://api.bilibili.com/x/v2/reply/main?callback=jQuery1720034332372316460136_1629011550479&jsonp=jsonp&next=0&type=1&oid=589656273&mode=3&plat=1&_=1629012090500
        https://api.bilibili.com/x/v2/reply/main?callback=jQuery1720034332372316460136_1629011550483&jsonp=jsonp&next=2&type=1&oid=589656273&mode=3&plat=1&_=1629012513080
        https://api.bilibili.com/x/v2/reply/main?callback=jQuery1720034332372316460136_1629011550484&jsonp=jsonp&next=3&type=1&oid=589656273&mode=3&plat=1&_=1629012803039

        兩條urlnext參數(shù),以及_callback參數(shù)。_callback一個(gè)是時(shí)間戳,一個(gè)是干擾參數(shù),刪除即可。next參數(shù)第一條為0,第二條為2,第三條為3,所以第一條next參數(shù)固定為0,第二條開(kāi)始遞增;網(wǎng)頁(yè)數(shù)據(jù)格式為json格式。

        ??實(shí)戰(zhàn)代碼

        import?requests
        import?pandas?as?pd

        df?=?pd.DataFrame()
        headers?=?{
        ????'user-agent':?'Mozilla/5.0?(Windows?NT?10.0;?Win64;?x64)?AppleWebKit/537.36?(KHTML,?like?Gecko)?Chrome/86.0.4240.111?Safari/537.36'}
        try:
        ????a?=?1
        ????while?True:
        ????????if?a?==?1:
        ?????????#?刪除不必要參數(shù)得到的第一條url
        ????????????url?=?f'https://api.bilibili.com/x/v2/reply/main?&jsonp=jsonp&next=0&type=1&oid=589656273&mode=3&plat=1'
        ????????else:
        ????????????url?=?f'https://api.bilibili.com/x/v2/reply/main?&jsonp=jsonp&next={a}&type=1&oid=589656273&mode=3&plat=1'
        ????????print(url)
        ????????html?=?requests.get(url,?headers=headers).json()
        ????????for?i?in?html['data']['replies']:
        ????????????uname?=?i['member']['uname']??#?用戶(hù)名稱(chēng)
        ????????????sex?=?i['member']['sex']??#?用戶(hù)性別
        ????????????mid?=?i['mid']??#?用戶(hù)id
        ????????????current_level?=?i['member']['level_info']['current_level']??#?vip等級(jí)
        ????????????message?=?i['content']['message'].replace('\n',?'')??#?用戶(hù)評(píng)論
        ????????????like?=?i['like']??#?評(píng)論點(diǎn)贊次數(shù)
        ????????????ctime?=?i['ctime']??#?評(píng)論時(shí)間
        ????????????data?=?pd.DataFrame({'用戶(hù)名稱(chēng)':?[uname],?'用戶(hù)性別':?[sex],?'用戶(hù)id':?[mid],
        ?????????????????????????????????'vip等級(jí)':?[current_level],?'用戶(hù)評(píng)論':?[message],?'評(píng)論點(diǎn)贊次數(shù)':?[like],
        ?????????????????????????????????'評(píng)論時(shí)間':?[ctime]})
        ????????????df?=?pd.concat([df,?data])
        ????????a?+=?1
        except?Exception?as?e:
        ????print(e)
        df.to_csv('奧運(yùn)會(huì).csv',?encoding='utf-8')
        print(df.shape)

        結(jié)果展示,獲取的內(nèi)容不包括二級(jí)評(píng)論,如果需要,可自行爬取,操作步驟差不多:

        愛(ài)奇藝

        這里以爬取電影《哥斯拉大戰(zhàn)金剛》為例,講解如何爬愛(ài)奇藝視頻的彈幕和評(píng)論。

        網(wǎng)頁(yè)地址:

        https://www.iqiyi.com/v_19rr0m845o.html

        彈幕

        ??分析網(wǎng)頁(yè)

        愛(ài)奇藝視頻的彈幕依然是要進(jìn)入開(kāi)發(fā)者工具進(jìn)行抓包,得到一個(gè)br壓縮文件,點(diǎn)擊可以直接下載,里面的內(nèi)容是二進(jìn)制數(shù)據(jù),視頻每播放一分鐘,就加載一條數(shù)據(jù)包:得到url,兩條url差別在于遞增的數(shù)字,60為視頻每60秒更新一次數(shù)據(jù)包:

        https://cmts.iqiyi.com/bullet/64/00/1078946400_60_1_b2105043.br
        https://cmts.iqiyi.com/bullet/64/00/1078946400_60_2_b2105043.br

        br文件可以用brotli庫(kù)進(jìn)行解壓,但實(shí)際操作起來(lái)很難,特別是編碼等問(wèn)題,難以解決;在直接使用utf-8進(jìn)行解碼時(shí),會(huì)報(bào)以下錯(cuò)誤:

        UnicodeDecodeError:?'utf-8'?codec?can't?decode?byte?0x91?in?position?52:?invalid?start?byte

        在解碼中加入ignore,中文不會(huì)亂碼,但html格式出現(xiàn)亂碼,數(shù)據(jù)提取依然很難:

        decode("utf-8",?"ignore")

        有興趣的小伙伴可以對(duì)上面的內(nèi)容做進(jìn)一步處理,本文就不再繼續(xù)深入。這里采用另一個(gè)方法,對(duì)得到url進(jìn)行修改成以下鏈接而獲得.z壓縮文件:

        https://cmts.iqiyi.com/bullet/64/00/1078946400_300_1.z

        之所以如此更改,是因?yàn)檫@是愛(ài)奇藝以前的彈幕接口鏈接,他還未刪除或修改,目前還可以使用。該接口鏈接中1078946400是視頻id;300是以前愛(ài)奇藝的彈幕每5分鐘會(huì)加載出新的彈幕數(shù)據(jù)包,5分鐘就是300秒,《哥斯拉大戰(zhàn)金剛》時(shí)長(zhǎng)112.59分鐘,除以5向上取整就是23;1是頁(yè)數(shù);64為id值的第7為和第8為數(shù)。

        ??實(shí)戰(zhàn)代碼

        import?requests
        import?pandas?as?pd
        from?lxml?import?etree
        from?zlib?import?decompress??#?解壓

        df?=?pd.DataFrame()
        for?i?in?range(1,?23):
        ????url?=?f'https://cmts.iqiyi.com/bullet/64/00/1078946400_300_{i}.z'
        ????bulletold?=?requests.get(url).content??#?得到二進(jìn)制數(shù)據(jù)
        ????decode?=?decompress(bulletold).decode('utf-8')??#?解壓解碼
        ????with?open(f'{i}.html',?'a+',?encoding='utf-8')?as?f:??#?保存為靜態(tài)的html文件
        ????????f.write(decode)

        ????html?=?open(f'./{i}.html',?'rb').read()??#?讀取html文件
        ????html?=?etree.HTML(html)??#?用xpath語(yǔ)法進(jìn)行解析網(wǎng)頁(yè)
        ????ul?=?html.xpath('/html/body/danmu/data/entry/list/bulletinfo')
        ????for?i?in?ul:
        ????????contentid?=?''.join(i.xpath('./contentid/text()'))
        ????????content?=?''.join(i.xpath('./content/text()'))
        ????????likeCount?=?''.join(i.xpath('./likecount/text()'))
        ????????print(contentid,?content,?likeCount)
        ????????text?=?pd.DataFrame({'contentid':?[contentid],?'content':?[content],?'likeCount':?[likeCount]})
        ????????df?=?pd.concat([df,?text])
        df.to_csv('哥斯拉大戰(zhàn)金剛.csv',?encoding='utf-8',?index=False)

        結(jié)果展示:

        評(píng)論

        ??分析網(wǎng)頁(yè)

        愛(ài)奇藝視頻的評(píng)論在網(wǎng)頁(yè)下方,依然是動(dòng)態(tài)加載的內(nèi)容,需要進(jìn)入瀏覽器的開(kāi)發(fā)者工具進(jìn)行抓包,當(dāng)網(wǎng)頁(yè)下拉取時(shí),會(huì)加載一條數(shù)據(jù)包,里面包含評(píng)論數(shù)據(jù):得到的真實(shí)url:

        https://sns-comment.iqiyi.com/v3/comment/get_comments.action?agent_type=118&agent_version=9.11.5&authcookie=null&business_type=17&channel_id=1&content_id=1078946400&hot_size=10&last_id=&page=&page_size=10&types=hot,time&callback=jsonp_1629025964363_15405
        https://sns-comment.iqiyi.com/v3/comment/get_comments.action?agent_type=118&agent_version=9.11.5&authcookie=null&business_type=17&channel_id=1&content_id=1078946400&hot_size=0&last_id=7963601726142521&page=&page_size=20&types=time&callback=jsonp_1629026041287_28685
        https://sns-comment.iqiyi.com/v3/comment/get_comments.action?agent_type=118&agent_version=9.11.5&authcookie=null&business_type=17&channel_id=1&content_id=1078946400&hot_size=0&last_id=4933019153543021&page=&page_size=20&types=time&callback=jsonp_1629026394325_81937

        第一條url加載的是精彩評(píng)論的內(nèi)容,第二條url開(kāi)始加載的是全部評(píng)論的內(nèi)容。經(jīng)過(guò)刪減不必要參數(shù)得到以下url:

        https://sns-comment.iqiyi.com/v3/comment/get_comments.action?agent_type=118&agent_version=9.11.5&business_type=17&content_id=1078946400&last_id=&page_size=10
        https://sns-comment.iqiyi.com/v3/comment/get_comments.action?agent_type=118&agent_version=9.11.5&business_type=17&content_id=1078946400&last_id=7963601726142521&page_size=20
        https://sns-comment.iqiyi.com/v3/comment/get_comments.action?agent_type=118&agent_version=9.11.5&business_type=17&content_id=1078946400&last_id=4933019153543021&page_size=20

        區(qū)別在于參數(shù)last_idpage_size。page_size在第一條url中的值為10,從第二條url開(kāi)始固定為20。last_id在首條url中值為空,從第二條開(kāi)始會(huì)不斷發(fā)生變化,經(jīng)過(guò)我的研究,last_id的值就是從前一條url中的最后一條評(píng)論內(nèi)容的用戶(hù)id(應(yīng)該是用戶(hù)id);網(wǎng)頁(yè)數(shù)據(jù)格式為json格式。

        ??實(shí)戰(zhàn)代碼

        import?requests
        import?pandas?as?pd
        import?time
        import?random


        headers?=?{
        ????'User-Agent':?'Mozilla/5.0?(Windows?NT?10.0;?Win64;?x64)?AppleWebKit/537.36?(KHTML,?like?Gecko)?Chrome/91.0.4472.124?Safari/537.36'
        }
        df?=?pd.DataFrame()
        try:
        ????a?=?0
        ????while?True:
        ????????if?a?==?0:
        ????????????url?=?'https://sns-comment.iqiyi.com/v3/comment/get_comments.action?agent_type=118&agent_version=9.11.5&business_type=17&content_id=1078946400&page_size=10'
        ????????else:
        ????????????#?從id_list中得到上一條頁(yè)內(nèi)容中的最后一個(gè)id值
        ????????????url?=?f'https://sns-comment.iqiyi.com/v3/comment/get_comments.action?agent_type=118&agent_version=9.11.5&business_type=17&content_id=1078946400&last_id={id_list[-1]}&page_size=20'
        ????????print(url)
        ????????res?=?requests.get(url,?headers=headers).json()
        ????????id_list?=?[]??#?建立一個(gè)列表保存id值
        ????????for?i?in?res['data']['comments']:
        ????????????ids?=?i['id']
        ????????????id_list.append(ids)
        ????????????uname?=?i['userInfo']['uname']
        ????????????addTime?=?i['addTime']
        ????????????content?=?i.get('content',?'不存在')??#?用get提取是為了防止鍵值不存在而發(fā)生報(bào)錯(cuò),第一個(gè)參數(shù)為匹配的key值,第二個(gè)為缺少時(shí)輸出
        ????????????text?=?pd.DataFrame({'ids':?[ids],?'uname':?[uname],?'addTime':?[addTime],?'content':?[content]})
        ????????????df?=?pd.concat([df,?text])
        ????????a?+=?1
        ????????time.sleep(random.uniform(2,?3))
        except?Exception?as?e:
        ????print(e)
        df.to_csv('哥斯拉大戰(zhàn)金剛_評(píng)論.csv',?mode='a+',?encoding='utf-8',?index=False)

        結(jié)果展示:

        知乎

        這里以爬取知乎熱點(diǎn)話(huà)題《如何看待網(wǎng)傳騰訊實(shí)習(xí)生向騰訊高層提出建議頒布拒絕陪酒相關(guān)條令?》為例,講解如爬取知乎回答。

        網(wǎng)頁(yè)地址:

        https://www.zhihu.com/question/478781972

        ??分析網(wǎng)頁(yè)

        經(jīng)過(guò)查看網(wǎng)頁(yè)源代碼等方式,確定該網(wǎng)頁(yè)回答內(nèi)容為動(dòng)態(tài)加載的,需要進(jìn)入瀏覽器的開(kāi)發(fā)者工具進(jìn)行抓包。進(jìn)入Noetwork→XHR,用鼠標(biāo)在網(wǎng)頁(yè)向下拉取,得到我們需要的數(shù)據(jù)包:得到的真實(shí)url:

        https://www.zhihu.com/api/v4/questions/478781972/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Cis_labeled%2Cpaid_info%2Cpaid_info_content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_recognized%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cvip_info%2Cbadge%5B%2A%5D.topics%3Bdata%5B%2A%5D.settings.table_of_content.enabled&limit=5&offset=0&platform=desktop&sort_by=default
        https://www.zhihu.com/api/v4/questions/478781972/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Cis_labeled%2Cpaid_info%2Cpaid_info_content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_recognized%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cvip_info%2Cbadge%5B%2A%5D.topics%3Bdata%5B%2A%5D.settings.table_of_content.enabled&limit=5&offset=5&platform=desktop&sort_by=default

        url有很多不必要的參數(shù),大家可以在瀏覽器中自行刪減。兩條url的區(qū)別在于后面的offset參數(shù),首條url的offset參數(shù)為0,第二條為5,offset是以公差為5遞增;網(wǎng)頁(yè)數(shù)據(jù)格式為json格式。

        ??實(shí)戰(zhàn)代碼

        import?requests
        import?pandas?as?pd
        import?re
        import?time
        import?random

        df?=?pd.DataFrame()
        headers?=?{
        ????'user-agent':?'Mozilla/5.0?(Windows?NT?10.0;?Win64;?x64)?AppleWebKit/537.36?(KHTML,?like?Gecko)?Chrome/81.0.4044.138?Safari/537.36'
        }
        for?page?in?range(0,?1360,?5):
        ????url?=?f'https://www.zhihu.com/api/v4/questions/478781972/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Cis_labeled%2Cpaid_info%2Cpaid_info_content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_recognized%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cvip_info%2Cbadge%5B%2A%5D.topics%3Bdata%5B%2A%5D.settings.table_of_content.enabled&limit=5&offset={page}&platform=desktop&sort_by=default'
        ????response?=?requests.get(url=url,?headers=headers).json()
        ????data?=?response['data']
        ????for?list_?in?data:
        ????????name?=?list_['author']['name']??#?知乎作者
        ????????id_?=?list_['author']['id']??#?作者id
        ????????created_time?=?time.strftime("%Y-%m-%d?%H:%M:%S",?time.localtime(list_['created_time']?))?#?回答時(shí)間
        ????????voteup_count?=?list_['voteup_count']??#?贊同數(shù)
        ????????comment_count?=?list_['comment_count']??#?底下評(píng)論數(shù)
        ????????content?=?list_['content']??#?回答內(nèi)容
        ????????content?=?''.join(re.findall("[\u3002\uff1b\uff0c\uff1a\u201c\u201d\uff08\uff09\u3001\uff1f\u300a\u300b\u4e00-\u9fa5]",?content))??#?正則表達(dá)式提取
        ????????print(name,?id_,?created_time,?comment_count,?content,?sep='|')
        ????????dataFrame?=?pd.DataFrame(
        ????????????{'知乎作者':?[name],?'作者id':?[id_],?'回答時(shí)間':?[created_time],?'贊同數(shù)':?[voteup_count],?'底下評(píng)論數(shù)':?[comment_count],
        ?????????????'回答內(nèi)容':?[content]})
        ????????df?=?pd.concat([df,?dataFrame])
        ????time.sleep(random.uniform(2,?3))
        df.to_csv('知乎回答.csv',?encoding='utf-8',?index=False)
        print(df.shape)

        結(jié)果展示:

        微博

        這里以爬取微博熱搜《霍尊手寫(xiě)道歉信》為例,講解如何爬取微博評(píng)論。

        網(wǎng)頁(yè)地址:

        https://m.weibo.cn/detail/4669040301182509

        ??分析網(wǎng)頁(yè)

        微博評(píng)論是動(dòng)態(tài)加載的,進(jìn)入瀏覽器的開(kāi)發(fā)者工具后,在網(wǎng)頁(yè)上向下拉取會(huì)得到我們需要的數(shù)據(jù)包:得到真實(shí)url:

        https://m.weibo.cn/comments/hotflow?id=4669040301182509&mid=4669040301182509&max_id_type=0
        https://m.weibo.cn/comments/hotflow?id=4669040301182509&mid=4669040301182509&max_id=3698934781006193&max_id_type=0

        兩條url區(qū)別很明顯,首條url是沒(méi)有參數(shù)max_id的,第二條開(kāi)始max_id才出現(xiàn),而max_id其實(shí)是前一條數(shù)據(jù)包中的max_id:但有個(gè)需要注意的是參數(shù)max_id_type,它其實(shí)也是會(huì)變化的,所以我們需要從數(shù)據(jù)包中獲取max_id_type:

        ??實(shí)戰(zhàn)代碼

        import?re
        import?requests
        import?pandas?as?pd
        import?time
        import?random

        df?=?pd.DataFrame()
        try:
        ????a?=?1
        ????while?True:
        ????????header?=?{
        ????????????'User-Agent':?'Mozilla/5.0?(Windows?NT?6.1;?WOW64)?AppleWebKit/537.36?(KHTML,?like?Gecko)?Chrome/38.0.2125.122?UBrowser/4.0.3214.0?Safari/537.36'
        ????????}
        ????????resposen?=?requests.get('https://m.weibo.cn/detail/4669040301182509',?headers=header)
        ????????#?微博爬取大概幾十頁(yè)會(huì)封賬號(hào)的,而通過(guò)不斷的更新cookies,會(huì)讓爬蟲(chóng)更持久點(diǎn)...
        ????????cookie?=?[cookie.value?for?cookie?in?resposen.cookies]??#?用列表推導(dǎo)式生成cookies部件
        ????????headers?=?{
        ?????????#?登錄后的cookie,?SUB用登錄后的
        ????????????'cookie':?f'WEIBOCN_FROM={cookie[3]};?SUB=;?_T_WM={cookie[4]};?MLOGIN={cookie[1]};?M_WEIBOCN_PARAMS={cookie[2]};?XSRF-TOKEN={cookie[0]}',
        ????????????'referer':?'https://m.weibo.cn/detail/4669040301182509',
        ????????????'User-Agent':?'Mozilla/5.0?(Windows?NT?6.1;?WOW64)?AppleWebKit/537.36?(KHTML,?like?Gecko)?Chrome/38.0.2125.122?UBrowser/4.0.3214.0?Safari/537.36'
        ????????}
        ????????if?a?==?1:
        ????????????url?=?'https://m.weibo.cn/comments/hotflow?id=4669040301182509&mid=4669040301182509&max_id_type=0'
        ????????else:
        ????????????url?=?f'https://m.weibo.cn/comments/hotflow?id=4669040301182509&mid=4669040301182509&max_id={max_id}&max_id_type={max_id_type}'

        ????????html?=?requests.get(url=url,?headers=headers).json()
        ????????data?=?html['data']
        ????????max_id?=?data['max_id']??#?獲取max_id和max_id_type返回給下一條url
        ????????max_id_type?=?data['max_id_type']
        ????????for?i?in?data['data']:
        ????????????screen_name?=?i['user']['screen_name']
        ????????????i_d?=?i['user']['id']
        ????????????like_count?=?i['like_count']??#?點(diǎn)贊數(shù)
        ????????????created_at?=?i['created_at']??#?時(shí)間
        ????????????text?=?re.sub(r'<[^>]*>',?'',?i['text'])??#?評(píng)論
        ????????????print(text)
        ????????????data_json?=?pd.DataFrame({'screen_name':?[screen_name],?'i_d':?[i_d],?'like_count':?[like_count],?'created_at':?[created_at],'text':?[text]})
        ????????????df?=?pd.concat([df,?data_json])
        ????????time.sleep(random.uniform(2,?7))
        ????????a?+=?1
        except?Exception?as?e:
        ????print(e)

        df.to_csv('微博.csv',?encoding='utf-8',?mode='a+',?index=False)
        print(df.shape)

        結(jié)果展示:

        以上便是今天的全部?jī)?nèi)容了,如果文章對(duì)你有幫助,歡迎轉(zhuǎn)發(fā)/點(diǎn)贊/收藏~

        作者:小刀

        來(lái)源:Python與Excel之交

        _往期文章推薦_

        Python抓取網(wǎng)頁(yè)亂碼原因分析




        如需了解付費(fèi)精品課程教學(xué)答疑服務(wù)
        請(qǐng)?jiān)?strong style="max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;">Crossin的編程教室內(nèi)回復(fù): 666

        瀏覽 79
        點(diǎn)贊
        評(píng)論
        收藏
        分享

        手機(jī)掃一掃分享

        分享
        舉報(bào)
        評(píng)論
        圖片
        表情
        推薦
        點(diǎn)贊
        評(píng)論
        收藏
        分享

        手機(jī)掃一掃分享

        分享
        舉報(bào)
        1. <strong id="7actg"></strong>
        2. <table id="7actg"></table>

        3. <address id="7actg"></address>
          <address id="7actg"></address>
          1. <object id="7actg"><tt id="7actg"></tt></object>
            啊轻点灬太粗嗯太深了视频在线观看 | 另类丅s人妖一区二区 | 小早川怜子乳巨码无在线 | 国产A片电影 | 男人添女荫道口免费视频 | 天天谢天天拍夜夜爽 | 成人视频在线播放 | wwwsesese | 五月天色婷婷婷 | 成人在线性爱视频 |