1. arXiv API + Github Actions 實(shí)現(xiàn)每天自動(dòng)獲取arXiv論文摘要

        共 13598字,需瀏覽 28分鐘

         ·

        2021-11-20 02:28

        大家好,今天跟大家分享一個(gè)實(shí)用的工具,可以幫你檢索最新的論文成果,我也試著搭建了一個(gè):https://github.com/DWCTOD/cv-arxiv-daily

        經(jīng)常關(guān)注學(xué)術(shù)界最新動(dòng)態(tài)的同學(xué)對(duì)arXiv可能會(huì)非常熟悉,它是全球最大的學(xué)術(shù)開(kāi)放共享平臺(tái),目前存儲(chǔ)了8個(gè)學(xué)科領(lǐng)域近200萬(wàn)篇學(xué)術(shù)文章[1],學(xué)者們經(jīng)常會(huì)將其即將發(fā)表的文章掛在arXiv上進(jìn)行同行評(píng)議,這極大地促進(jìn)了學(xué)術(shù)界的開(kāi)放性與協(xié)作性。

        眾多的文章讓人眼花繚亂,讓人無(wú)法馬上獲取自己關(guān)注領(lǐng)域的文章。筆者最近使用arXiv API[2] + Github Actions[3] 實(shí)現(xiàn)了每天自動(dòng)從arXiv獲取相關(guān)主題文章并發(fā)布在Github的功

        首先給出最終效果圖,下圖所示為 Github 頁(yè)面中的README.md,它以表格的形式列出了關(guān)于SLAM的最新文章。

        太長(zhǎng)不想看,直接翻到文末,我把代碼祭了出來(lái)!

        arXiv API 簡(jiǎn)介

        基本語(yǔ)法

        arXiv API[2]允許用戶以編程方式訪問(wèn)arXiv.org上托管的數(shù)百萬(wàn)份電子論文。arXiv API[2]用戶手冊(cè)提供了論文檢索的基本語(yǔ)法,按照其提供的語(yǔ)法檢索可得到對(duì)應(yīng)論文的metadata,即元數(shù)據(jù),包括論文題目,作者,摘要,評(píng)論等信息。API調(diào)用的格式如下所示:

        http://export.arxiv.org/api/{method_name}?{parameters}

        method_name=query為例子,我們想要檢索論文作者Adrian DelMaestro且論文題目中包含checkerboard的文章,可以這么寫(xiě):

        http://export.arxiv.org/api/query?search_query=au:del_maestro+AND+ti:checkerboard

        其中前綴au表示author,ti表示Title,+是對(duì)空格的編碼(由于url中不可出現(xiàn)空格)。

        prefixexplanation
        tiTitle
        auAuthor
        absAbstract
        coComment
        jrJournal Reference
        catSubject Category
        rnReport Number
        idId (use id_list instead)
        allAll of the above

        另外,AND表示運(yùn)算,API的query方法支持布爾運(yùn)算:AND、OR以及ANDNOT。

        上述搜索的結(jié)果是以Atom feeds的形式返回的,任何能夠進(jìn)行HTTP請(qǐng)求并能夠解析Atom feeds的語(yǔ)言都可調(diào)用該API,以Python為例:

        import?urllib.request?as?libreq
        with?libreq.urlopen('http://export.arxiv.org/api/query?search_query=au:del_maestro+AND+ti:checkerboard')?as?url:
        ????r?=?url.read()
        print(r)

        打印出的結(jié)果中包含了論文的metadata,那么接下來(lái)的任務(wù)是解析該數(shù)據(jù)并將其中我們關(guān)注的信息按照某種格式寫(xiě)下來(lái)。

        arxiv.py 小試牛刀

        已經(jīng)有人幫我們做好了上述結(jié)果的解析,我們不必重復(fù)造輪子。同時(shí),論文查詢的方式也更加優(yōu)雅。在這里我們推薦的是arxiv.py[5]。

        首先安裝arxiv.py

        pip?install?arxiv

        然后在Python腳本中import arxiv即可。

        以搜索SLAM為關(guān)鍵詞,要求返回10個(gè)結(jié)果,同時(shí)按照發(fā)布日期排序,腳本如下:

        import?arxiv

        search?=?arxiv.Search(
        ??query?=?"SLAM",
        ??max_results?=?10,
        ??sort_by?=?arxiv.SortCriterion.SubmittedDate
        )
        for?result?in?search.results():
        ??print(result.entry_id,?'->',?result.title)

        上述腳本中(Search).results()函數(shù)返回了論文的metadata,arxiv.py已經(jīng)幫我們解析好了,可以直接調(diào)用諸如result.title這樣的元素,類似的還有如下元素:

        elementexplanation
        entry_idA url http://arxiv.org/abs/{id}.
        updatedWhen the result was last updated.
        publishedWhen the result was originally published.
        titleThe title of the result.
        authorsThe result's authors, as arxiv.Authors.
        summaryThe result abstract.
        commentThe authors' comment if present.
        journal_refA journal reference if present.
        doiA URL for the resolved DOI to an external resource if present.
        primary_categoryThe result's primary arXiv category. See arXiv: Category Taxonomy[4].
        categoriesAll of the result's categories. See arXiv: Category Taxonomy.
        linksUp to three URLs associated with this result, as arxiv.Links.
        pdf_urlA URL for the result's PDF if present. Note: this URL also appears among result.links.

        上述搜索腳本在終端打印出如下結(jié)果:

        http://arxiv.org/abs/2110.11040v1?->?InterpolationSLAM:?A?Novel?Robust?Visual?SLAM?System?in?Rotational?Motion
        http://arxiv.org/abs/2110.10329v1?->?SLAM:?A?Unified?Encoder?for?Speech?and?Language?Modeling?via?Speech-Text?Joint?Pre-Training
        http://arxiv.org/abs/2110.09156v1?->?Enhancing?exploration?algorithms?for?navigation?with?visual?SLAM
        http://arxiv.org/abs/2110.08977v1?->?Accurate?and?Robust?Object-oriented?SLAM?with?3D?Quadric?Landmark?Construction?in?Outdoor?Environment
        http://arxiv.org/abs/2110.08639v1?->?Partial?Hierarchical?Pose?Graph?Optimization?for?SLAM
        http://arxiv.org/abs/2110.07546v1?->?Active?SLAM?over?Continuous?Trajectory?and?Control:?A?Covariance-Feedback?Approach
        http://arxiv.org/abs/2110.06541v2?->?Collaborative?Radio?SLAM?for?Multiple?Robots?based?on?WiFi?Fingerprint?Similarity
        http://arxiv.org/abs/2110.05734v1?->?Learning?Efficient?Multi-Agent?Cooperative?Visual?Exploration
        http://arxiv.org/abs/2110.03234v1?->?Self-Supervised?Depth?Completion?for?Active?Stereo
        http://arxiv.org/abs/2110.02593v1?->?InterpolationSLAM:?A?Novel?Robust?Visual?SLAM?System?in?Rotating?Scenes

        接下來(lái)的腳本daily_arxiv.py將實(shí)現(xiàn)從arXiv獲取關(guān)于SLAM的論文,并將論文的發(fā)布時(shí)間、論文名、作者以及代碼等信息制作成Markdown表格并寫(xiě)為README.md文件。

        import?datetime
        import?requests
        import?json
        import?arxiv
        import?os
        def?get_authors(authors,?first_author?=?False):
        ????output?=?str()
        ????if?first_author?==?False:
        ????????output?=?",?".join(str(author)?for?author?in?authors)
        ????else:
        ????????output?=?authors[0]
        ????return?output
        def?sort_papers(papers):
        ????output?=?dict()
        ????keys?=?list(papers.keys())
        ????keys.sort(reverse=True)
        ????for?key?in?keys:
        ????????output[key]?=?papers[key]
        ????return?output????

        def?get_daily_papers(topic,query="slam",?max_results=2):
        ????"""
        ????@param?topic:?str
        ????@param?query:?str
        ????@return?paper_with_code:?dict
        ????"""


        ????#?output?
        ????content?=?dict()?
        ????
        ????search_engine?=?arxiv.Search(
        ????????query?=?query,
        ????????max_results?=?max_results,
        ????????sort_by?=?arxiv.SortCriterion.SubmittedDate
        ????)

        ????for?result?in?search_engine.results():

        ????????paper_id???????=?result.get_short_id()
        ????????paper_title????=?result.title
        ????????paper_url??????=?result.entry_id

        ????????paper_abstract?=?result.summary.replace("\n","?")
        ????????paper_authors??=?get_authors(result.authors)
        ????????paper_first_author?=?get_authors(result.authors,first_author?=?True)
        ????????primary_category?=?result.primary_category

        ????????publish_time?=?result.published.date()

        ????????print("Time?=?",?publish_time?,
        ??????????????"?title?=?",?paper_title,
        ??????????????"?author?=?",?paper_first_author)

        ????????#?eg:?2108.09112v1?->?2108.09112
        ????????ver_pos?=?paper_id.find('v')
        ????????if?ver_pos?==?-1:
        ????????????paper_key?=?paper_id
        ????????else:
        ????????????paper_key?=?paper_id[0:ver_pos]?

        ????????content[paper_key]?=?f"|**{publish_time}**|**{paper_title}**|{paper_first_author}?et.al.|[{paper_id}]({paper_url})|\n"
        ????data?=?{topic:content}
        ????
        ????return?data?

        def?update_json_file(filename,data_all):
        ????with?open(filename,"r")?as?f:
        ????????content?=?f.read()
        ????????if?not?content:
        ????????????m?=?{}
        ????????else:
        ????????????m?=?json.loads(content)
        ????????????
        ????json_data?=?m.copy()?
        ????
        ????#?update?papers?in?each?keywords?????????
        ????for?data?in?data_all:
        ????????for?keyword?in?data.keys():
        ????????????papers?=?data[keyword]

        ????????????if?keyword?in?json_data.keys():
        ????????????????json_data[keyword].update(papers)
        ????????????else:
        ????????????????json_data[keyword]?=?papers

        ????with?open(filename,"w")?as?f:
        ????????json.dump(json_data,f)
        ????
        def?json_to_md(filename):
        ????"""
        ????@param?filename:?str
        ????@return?None
        ????"""

        ????
        ????DateNow?=?datetime.date.today()
        ????DateNow?=?str(DateNow)
        ????DateNow?=?DateNow.replace('-','.')
        ????
        ????with?open(filename,"r")?as?f:
        ????????content?=?f.read()
        ????????if?not?content:
        ????????????data?=?{}
        ????????else:
        ????????????data?=?json.loads(content)

        ????md_filename?=?"README.md"??
        ??????
        ????#?clean?README.md?if?daily?already?exist?else?create?it
        ????with?open(md_filename,"w+")?as?f:
        ????????pass

        ????#?write?data?into?README.md
        ????with?open(md_filename,"a+")?as?f:
        ??
        ????????f.write("##?Updated?on?"?+?DateNow?+?"\n\n")
        ????????
        ????????for?keyword?in?data.keys():
        ????????????day_content?=?data[keyword]
        ????????????if?not?day_content:
        ????????????????continue
        ????????????#?the?head?of?each?part
        ????????????f.write(f"##?{keyword}\n\n")
        ????????????f.write("|Publish?Date|Title|Authors|PDF|\n"?+?"|---|---|---|---|\n")
        ????????????#?sort?papers?by?date
        ????????????day_content?=?sort_papers(day_content)
        ????????
        ????????????for?_,v?in?day_content.items():
        ????????????????if?v?is?not?None:
        ????????????????????f.write(v)

        ????????????f.write(f"\n")
        ????print("finished")?????

        if?__name__?==?"__main__":

        ????data_collector?=?[]
        ????keywords?=?dict()
        ????keywords["SLAM"]?=?"SLAM"
        ?
        ????for?topic,keyword?in?keywords.items():
        ?
        ????????print("Keyword:?"?+?topic)
        ????????data?=?get_daily_papers(topic,?query?=?keyword,?max_results?=?10)
        ????????data_collector.append(data)
        ????????print("\n")

        ????#?update?README.md?file
        ????json_file?=?"cv-arxiv-daily.json"
        ????if?~os.path.exists(json_file):
        ????????with?open(json_file,'w')as?a:
        ????????????print("create?"?+?json_file)
        ????#?update?json?data
        ????update_json_file(json_file,data_collector)
        ????#?json?data?to?markdown
        ????json_to_md(json_file)

        上述腳本的要點(diǎn)在于:

        1. 檢索的主題關(guān)鍵詞都是SLAM,返回最新的10篇文章;
        2. 注意,上述主題是用作表格前二級(jí)標(biāo)題的名字,而關(guān)鍵詞才是真正要檢索的內(nèi)容,特別注意對(duì)于有空格關(guān)鍵詞多搜索格式,如camera localization要寫(xiě)成\"camera?Localization\",其中的\"表轉(zhuǎn)義,各位同學(xué)可按照規(guī)則增加自己感興趣的keywords;
        3. 論文列表按照發(fā)布在arXiv上的時(shí)間排序,最新的排在最前面;

        這看起來(lái)似乎已經(jīng)大功告成,但這里存在兩個(gè)問(wèn)題:1. 每次使用必須手動(dòng)運(yùn)行;2. 僅可在本地進(jìn)行查看。為了能夠每天自動(dòng)地運(yùn)行上述腳本且同步在Github倉(cāng)庫(kù),Github Actions就派上用場(chǎng)了。

        Github Actions 簡(jiǎn)介

        再次明確,我們的目標(biāo)是使用GitHub Actions每天自動(dòng)從arXiv獲取關(guān)于SLAM的論文,并將論文的發(fā)布時(shí)間、論文名、作者以及代碼等信息制作成Markdown表格發(fā)布在Github上。

        什么是 Github Actions ?

        Github Actions 是 GitHub 的持續(xù)集成服務(wù),于2018年10月推出。

        以下是官方解釋[3]:

        GitHub Actions help you automate tasks within your software development life cycle. GitHub Actions are event-driven, meaning that you can run a series of commands after a specified event has occurred. For example, every time someone creates a pull request for a repository, you can automatically run a command that executes a software testing script.

        簡(jiǎn)而言之,GitHub ActionsEvents驅(qū)動(dòng),可實(shí)現(xiàn)任務(wù)自動(dòng)化。

        基本概念

        GitHub Actions 有一些自己的術(shù)語(yǔ)[10],[9]。

        1. workflow (工作流程):持續(xù)集成一次運(yùn)行的過(guò)程,就是一個(gè) workflow;
        2. job (任務(wù)):一個(gè) workflow 由一個(gè)或多個(gè) jobs 構(gòu)成,含義是一次持續(xù)集成的運(yùn)行,可以完成多個(gè)任務(wù);
        3. step(步驟):每個(gè) job 由多個(gè) step 構(gòu)成,一步步完成;
        4. action (動(dòng)作):每個(gè) step 可以依次執(zhí)行一個(gè)或多個(gè)命令(action);

        部署

        登陸自己的Github賬號(hào),新建一個(gè)倉(cāng)庫(kù),如cv-arxiv-daily,點(diǎn)擊Actions,然后點(diǎn)擊Set up this workflow,如下圖所示:

        經(jīng)過(guò)上述步驟后,會(huì)新建一個(gè)名為black.yml的文件(如下圖所示),它所在的目錄是.github/workflows/,注意這個(gè)目錄絕對(duì)不可改變,這個(gè)文件夾下存放了需要執(zhí)行的workflow,即工作流,GitHub Actions會(huì)自動(dòng)識(shí)別這個(gè)文件夾下的yml工作流文件并按照規(guī)則執(zhí)行。

        這個(gè)black.yml實(shí)現(xiàn)了一個(gè)最簡(jiǎn)單的工作流:打印Hello, world!。

        需要注意的是GitHub Actions工作流有自己的一套語(yǔ)法,由于篇幅限制,不在此處細(xì)說(shuō),具體請(qǐng)參考這里[9]。

        為了能夠?qū)崿F(xiàn)上節(jié)的python腳本daily_arxiv.py自動(dòng)運(yùn)行,不難得到如下工作流配置cv-arxiv-daily.yml,注意其中的兩個(gè)環(huán)境變量GITHUB_USER_NAME以及GITHUB_USER_EMAIL分別替換成自己的ID與郵箱。

        #?name?of?workflow
        name:?Run?Arxiv?Papers?Daily

        #?Controls?when?the?workflow?will?run
        on:
        ??#?Allows?you?to?run?this?workflow?manually?from?the?Actions?tab
        ??workflow_dispatch:
        ??schedule:
        ????-?cron:??"*?12?*?*?*"??#?Runs?every?minute?of?12th?hour
        env:

        ??GITHUB_USER_NAME:?your_github_id?#?your?github?id
        ??GITHUB_USER_EMAIL:?your_email_addr?#?your?email?address
        ??
        ??
        #?A?workflow?run?is?made?up?of?one?or?more?jobs?that?can?run?sequentially?or?in?parallel
        jobs:
        ??#?This?workflow?contains?a?single?job?called?"build"
        ??build:
        ????name:?update
        ????#?The?type?of?runner?that?the?job?will?run?on
        ????runs-on:?ubuntu-latest
        ????
        ????#?Steps?represent?a?sequence?of?tasks?that?will?be?executed?as?part?of?the?job
        ????steps:
        ??????-?name:?Checkout
        ????????uses:?actions/checkout@v2
        ????????
        ??????-?name:?Set?up?Python?Env
        ????????uses:?actions/setup-python@v1
        ????????with:
        ??????????python-version:?3.6????????

        ??????-?name:?Install?dependencies
        ????????run:?|
        ??????????python?-m?pip?install?--upgrade?pip
        ??????????pip?install?arxiv
        ??????????pip?install?requests
        ??????????
        ??????-?name:?Run?daily?arxiv?
        ????????run:?|
        ??????????python?daily_arxiv.py
        ??????????
        ??????-?name:?Push?new?cv-arxiv-daily.md
        ????????uses:?github-actions-x/[email protected]
        ????????with:
        ??????????github-token:?${{?secrets.GITHUB_TOKEN?}}
        ??????????commit-message:?"Github?Action?Automatic?Update?CV?Arxiv?Papers"
        ??????????files:?README.md?cv-arxiv-daily.json
        ??????????rebase:?'true'
        ??????????name:?${{?env.GITHUB_USER_NAME?}}
        ??????????email:?${{?env.GITHUB_USER_EMAIL?}}

        其中,workflow_dispatch表示用戶可以通過(guò)手動(dòng)點(diǎn)擊的方式運(yùn)行,schedule[7]表示定時(shí)執(zhí)行,具體規(guī)則請(qǐng)查看Events that trigger workflows [8]。

        這里使用了cron的語(yǔ)法,它有5個(gè)字段,分別用空格分開(kāi),具體如下:

        ┌─────────────?minute?(0?-?59)
        │?┌─────────────?hour?(0?-?23)
        │?│?┌─────────────?day?of?the?month?(1?-?31)
        │?│?│?┌─────────────?month?(1?-?12?or?JAN-DEC)
        │?│?│?│?┌─────────────?day?of?the?week?(0?-?6?or?SUN-SAT)
        │?│?│?│?│
        │?│?│?│?│
        │?│?│?│?│
        *?*?*?*?*

        補(bǔ)充語(yǔ)法:

        OperatorDescriptionExample
        *Any value* * * * * runs every minute of every day.
        ,Value list separator2,10 4,5 * * * runs at minute 2 and 10 of the 4th and 5th hour of every day.
        -Range of values0 4-6 * * * runs at minute 0 of the 4th, 5th, and 6th hour.
        /Step values20/15 * * * * runs every 15 minutes starting from minute 20 through 59 (minutes 20, 35, and 50).

        上述 workflow 的要點(diǎn)總結(jié)如下:

        1. 每天 UTC 12:00 觸發(fā)事件,運(yùn)行workflow;
        2. 僅有一個(gè)名為buildjob,運(yùn)行在虛擬機(jī)環(huán)境ubuntu-latest;
        3. 第一步是獲取源碼,使用的 actionactions/checkout@v2;
        4. 第二步是配置Python環(huán)境,使用的 actionactions/setup-python@v1,python版本是3.6;
        5. 第三步是安裝依賴庫(kù),分別進(jìn)行升級(jí)pip,安裝arxiv.py庫(kù),安裝requests庫(kù);
        6. 第四步是運(yùn)行 daily_arxiv.py腳本,該步驟生成json臨時(shí)文件以及對(duì)應(yīng)的README.md;
        7. 第五步是推送代碼到本倉(cāng)庫(kù),使用的 actiongithub-actions-x/[email protected][11],需要配置的參數(shù)包括,提交的commit-message,需要提交的文件files,Github用戶名name以及郵箱email;

        workflow成功部署后就會(huì)在Github repo下生成一個(gè)json文件以及README.md文件,同時(shí)將會(huì)看到如本文開(kāi)頭的文章列表,Github Action后臺(tái)的log如下:

        總結(jié)

        本文介紹了一種使用Github Actions實(shí)現(xiàn)自動(dòng)每天獲取arXiv論文的方法,可較為方便地獲取并預(yù)覽感興趣的最新文章。本文列舉的例子較為方便修改,各位讀者可通過(guò)增加keywords的內(nèi)容來(lái)甄選感興趣的主題。文中所有的代碼已開(kāi)源,地址見(jiàn)文章結(jié)尾。

        最新的代碼中增加了獲取arXiv論文源代碼的功能,增加了幾個(gè)關(guān)鍵詞以及增加了自動(dòng)部署到一個(gè)Github Page頁(yè)面的功能。

        此外,本文列舉的方法存在幾個(gè)問(wèn)題:1. 所生成的json文件為臨時(shí)文件,可優(yōu)化將其刪除;2. README.md文件大小會(huì)隨時(shí)間推移逐漸增大,后續(xù)可增加歸檔功能;3. 并非每個(gè)人每天都會(huì)瀏覽Github,后續(xù)將增加發(fā)送文章到個(gè)人郵箱的功能。

        代碼:github.com/Vincentqyw/cv-arxiv-daily

        歡迎大家?fork?&?star,打造自己的論文搜索利器:)

        參考

        [1]: About arXiv, https://arxiv.org/about

        [2]: arXiv API User's Manual, https://arxiv.org/help/api/user-manual

        [3]: Github Actions: https://docs.github.com/en/actions/learn-github-actions

        [4]: arXiv Category Taxonomy: https://arxiv.org/category_taxonomy

        [5]: Python wrapper for the arXiv API, https://github.com/lukasschwab/arxiv.py

        [6]: Full package documentation: arxiv.arxiv, http://lukasschwab.me/arxiv.py/index.html

        [7]: Github Actions on.schedule: https://docs.github.com/en/actions/learn-github-actions/workflow-syntax-for-github-actions#onschedule

        [8]: Github Actions Events that trigger workflows: https://docs.github.com/en/actions/learn-github-actions/events-that-trigger-workflows#scheduled-events

        [9]: Workflow syntax for GitHub Actions, https://docs.github.com/en/actions/learn-github-actions/workflow-syntax-for-github-actions

        [10]: GitHub Actions 入門(mén)教程, http://www.ruanyifeng.com/blog/2019/09/getting-started-with-github-actions.html

        [11]: Git commit and push, https://github.com/github-actions-x/commit

        [12]: Generate a list of papers daily arxiv, https://github.com/zhuwenxing/daily_arxiv


        -END-

        瀏覽 140
        點(diǎn)贊
        評(píng)論
        收藏
        分享

        手機(jī)掃一掃分享

        分享
        舉報(bào)
        評(píng)論
        圖片
        表情
        推薦
        點(diǎn)贊
        評(píng)論
        收藏
        分享

        手機(jī)掃一掃分享

        分享
        舉報(bào)
          
          

            1. 小早川ThePorn在线播放 | 亚洲一卡二卡 | 主人调教巨奴性奶牛少妇小说 | 精品久久久久久18禁免费网站 | 开心久久五月天 |