arXiv API + Github Actions 實(shí)現(xiàn)每天自動(dòng)獲取arXiv論文摘要
大家好,今天跟大家分享一個(gè)實(shí)用的工具,可以幫你檢索最新的論文成果,我也試著搭建了一個(gè):https://github.com/DWCTOD/cv-arxiv-daily

經(jīng)常關(guān)注學(xué)術(shù)界最新動(dòng)態(tài)的同學(xué)對(duì)arXiv可能會(huì)非常熟悉,它是全球最大的學(xué)術(shù)開(kāi)放共享平臺(tái),目前存儲(chǔ)了8個(gè)學(xué)科領(lǐng)域近200萬(wàn)篇學(xué)術(shù)文章[1],學(xué)者們經(jīng)常會(huì)將其即將發(fā)表的文章掛在arXiv上進(jìn)行同行評(píng)議,這極大地促進(jìn)了學(xué)術(shù)界的開(kāi)放性與協(xié)作性。
眾多的文章讓人眼花繚亂,讓人無(wú)法馬上獲取自己關(guān)注領(lǐng)域的文章。筆者最近使用arXiv API[2] + Github Actions[3] 實(shí)現(xiàn)了每天自動(dòng)從arXiv獲取相關(guān)主題文章并發(fā)布在Github的功能。
首先給出最終效果圖,下圖所示為 Github 頁(yè)面中的README.md,它以表格的形式列出了關(guān)于SLAM的最新文章。

太長(zhǎng)不想看,直接翻到文末,我把代碼祭了出來(lái)!
arXiv API 簡(jiǎn)介
基本語(yǔ)法
arXiv API[2]允許用戶以編程方式訪問(wèn)arXiv.org上托管的數(shù)百萬(wàn)份電子論文。arXiv API[2]用戶手冊(cè)提供了論文檢索的基本語(yǔ)法,按照其提供的語(yǔ)法檢索可得到對(duì)應(yīng)論文的metadata,即元數(shù)據(jù),包括論文題目,作者,摘要,評(píng)論等信息。API調(diào)用的格式如下所示:
http://export.arxiv.org/api/{method_name}?{parameters}
以method_name=query為例子,我們想要檢索論文作者Adrian DelMaestro且論文題目中包含checkerboard的文章,可以這么寫(xiě):
http://export.arxiv.org/api/query?search_query=au:del_maestro+AND+ti:checkerboard
其中前綴au表示author,ti表示Title,+是對(duì)空格的編碼(由于url中不可出現(xiàn)空格)。
| prefix | explanation |
|---|---|
| ti | Title |
| au | Author |
| abs | Abstract |
| co | Comment |
| jr | Journal Reference |
| cat | Subject Category |
| rn | Report Number |
| id | Id (use id_list instead) |
| all | All of the above |
另外,AND表示與運(yùn)算,API的query方法支持布爾運(yùn)算:AND、OR以及ANDNOT。
上述搜索的結(jié)果是以Atom feeds的形式返回的,任何能夠進(jìn)行HTTP請(qǐng)求并能夠解析Atom feeds的語(yǔ)言都可調(diào)用該API,以Python為例:
import?urllib.request?as?libreq
with?libreq.urlopen('http://export.arxiv.org/api/query?search_query=au:del_maestro+AND+ti:checkerboard')?as?url:
????r?=?url.read()
print(r)
打印出的結(jié)果中包含了論文的metadata,那么接下來(lái)的任務(wù)是解析該數(shù)據(jù)并將其中我們關(guān)注的信息按照某種格式寫(xiě)下來(lái)。
arxiv.py 小試牛刀
已經(jīng)有人幫我們做好了上述結(jié)果的解析,我們不必重復(fù)造輪子。同時(shí),論文查詢的方式也更加優(yōu)雅。在這里我們推薦的是arxiv.py[5]。
首先安裝arxiv.py:
pip?install?arxiv
然后在Python腳本中import arxiv即可。
以搜索SLAM為關(guān)鍵詞,要求返回10個(gè)結(jié)果,同時(shí)按照發(fā)布日期排序,腳本如下:
import?arxiv
search?=?arxiv.Search(
??query?=?"SLAM",
??max_results?=?10,
??sort_by?=?arxiv.SortCriterion.SubmittedDate
)
for?result?in?search.results():
??print(result.entry_id,?'->',?result.title)
上述腳本中(Search).results()函數(shù)返回了論文的metadata,arxiv.py已經(jīng)幫我們解析好了,可以直接調(diào)用諸如result.title這樣的元素,類似的還有如下元素:
| element | explanation |
|---|---|
| entry_id | A url http://arxiv.org/abs/{id}. |
| updated | When the result was last updated. |
| published | When the result was originally published. |
| title | The title of the result. |
| authors | The result's authors, as arxiv.Authors. |
| summary | The result abstract. |
| comment | The authors' comment if present. |
| journal_ref | A journal reference if present. |
| doi | A URL for the resolved DOI to an external resource if present. |
| primary_category | The result's primary arXiv category. See arXiv: Category Taxonomy[4]. |
| categories | All of the result's categories. See arXiv: Category Taxonomy. |
| links | Up to three URLs associated with this result, as arxiv.Links. |
| pdf_url | A URL for the result's PDF if present. Note: this URL also appears among result.links. |
上述搜索腳本在終端打印出如下結(jié)果:
http://arxiv.org/abs/2110.11040v1?->?InterpolationSLAM:?A?Novel?Robust?Visual?SLAM?System?in?Rotational?Motion
http://arxiv.org/abs/2110.10329v1?->?SLAM:?A?Unified?Encoder?for?Speech?and?Language?Modeling?via?Speech-Text?Joint?Pre-Training
http://arxiv.org/abs/2110.09156v1?->?Enhancing?exploration?algorithms?for?navigation?with?visual?SLAM
http://arxiv.org/abs/2110.08977v1?->?Accurate?and?Robust?Object-oriented?SLAM?with?3D?Quadric?Landmark?Construction?in?Outdoor?Environment
http://arxiv.org/abs/2110.08639v1?->?Partial?Hierarchical?Pose?Graph?Optimization?for?SLAM
http://arxiv.org/abs/2110.07546v1?->?Active?SLAM?over?Continuous?Trajectory?and?Control:?A?Covariance-Feedback?Approach
http://arxiv.org/abs/2110.06541v2?->?Collaborative?Radio?SLAM?for?Multiple?Robots?based?on?WiFi?Fingerprint?Similarity
http://arxiv.org/abs/2110.05734v1?->?Learning?Efficient?Multi-Agent?Cooperative?Visual?Exploration
http://arxiv.org/abs/2110.03234v1?->?Self-Supervised?Depth?Completion?for?Active?Stereo
http://arxiv.org/abs/2110.02593v1?->?InterpolationSLAM:?A?Novel?Robust?Visual?SLAM?System?in?Rotating?Scenes
接下來(lái)的腳本daily_arxiv.py將實(shí)現(xiàn)從arXiv獲取關(guān)于SLAM的論文,并將論文的發(fā)布時(shí)間、論文名、作者以及代碼等信息制作成Markdown表格并寫(xiě)為README.md文件。
import?datetime
import?requests
import?json
import?arxiv
import?os
def?get_authors(authors,?first_author?=?False):
????output?=?str()
????if?first_author?==?False:
????????output?=?",?".join(str(author)?for?author?in?authors)
????else:
????????output?=?authors[0]
????return?output
def?sort_papers(papers):
????output?=?dict()
????keys?=?list(papers.keys())
????keys.sort(reverse=True)
????for?key?in?keys:
????????output[key]?=?papers[key]
????return?output????
def?get_daily_papers(topic,query="slam",?max_results=2):
????"""
????@param?topic:?str
????@param?query:?str
????@return?paper_with_code:?dict
????"""
????#?output?
????content?=?dict()?
????
????search_engine?=?arxiv.Search(
????????query?=?query,
????????max_results?=?max_results,
????????sort_by?=?arxiv.SortCriterion.SubmittedDate
????)
????for?result?in?search_engine.results():
????????paper_id???????=?result.get_short_id()
????????paper_title????=?result.title
????????paper_url??????=?result.entry_id
????????paper_abstract?=?result.summary.replace("\n","?")
????????paper_authors??=?get_authors(result.authors)
????????paper_first_author?=?get_authors(result.authors,first_author?=?True)
????????primary_category?=?result.primary_category
????????publish_time?=?result.published.date()
????????print("Time?=?",?publish_time?,
??????????????"?title?=?",?paper_title,
??????????????"?author?=?",?paper_first_author)
????????#?eg:?2108.09112v1?->?2108.09112
????????ver_pos?=?paper_id.find('v')
????????if?ver_pos?==?-1:
????????????paper_key?=?paper_id
????????else:
????????????paper_key?=?paper_id[0:ver_pos]?
????????content[paper_key]?=?f"|**{publish_time}**|**{paper_title}**|{paper_first_author}?et.al.|[{paper_id}]({paper_url})|\n"
????data?=?{topic:content}
????
????return?data?
def?update_json_file(filename,data_all):
????with?open(filename,"r")?as?f:
????????content?=?f.read()
????????if?not?content:
????????????m?=?{}
????????else:
????????????m?=?json.loads(content)
????????????
????json_data?=?m.copy()?
????
????#?update?papers?in?each?keywords?????????
????for?data?in?data_all:
????????for?keyword?in?data.keys():
????????????papers?=?data[keyword]
????????????if?keyword?in?json_data.keys():
????????????????json_data[keyword].update(papers)
????????????else:
????????????????json_data[keyword]?=?papers
????with?open(filename,"w")?as?f:
????????json.dump(json_data,f)
????
def?json_to_md(filename):
????"""
????@param?filename:?str
????@return?None
????"""
????
????DateNow?=?datetime.date.today()
????DateNow?=?str(DateNow)
????DateNow?=?DateNow.replace('-','.')
????
????with?open(filename,"r")?as?f:
????????content?=?f.read()
????????if?not?content:
????????????data?=?{}
????????else:
????????????data?=?json.loads(content)
????md_filename?=?"README.md"??
??????
????#?clean?README.md?if?daily?already?exist?else?create?it
????with?open(md_filename,"w+")?as?f:
????????pass
????#?write?data?into?README.md
????with?open(md_filename,"a+")?as?f:
??
????????f.write("##?Updated?on?"?+?DateNow?+?"\n\n")
????????
????????for?keyword?in?data.keys():
????????????day_content?=?data[keyword]
????????????if?not?day_content:
????????????????continue
????????????#?the?head?of?each?part
????????????f.write(f"##?{keyword}\n\n")
????????????f.write("|Publish?Date|Title|Authors|PDF|\n"?+?"|---|---|---|---|\n")
????????????#?sort?papers?by?date
????????????day_content?=?sort_papers(day_content)
????????
????????????for?_,v?in?day_content.items():
????????????????if?v?is?not?None:
????????????????????f.write(v)
????????????f.write(f"\n")
????print("finished")?????
if?__name__?==?"__main__":
????data_collector?=?[]
????keywords?=?dict()
????keywords["SLAM"]?=?"SLAM"
?
????for?topic,keyword?in?keywords.items():
?
????????print("Keyword:?"?+?topic)
????????data?=?get_daily_papers(topic,?query?=?keyword,?max_results?=?10)
????????data_collector.append(data)
????????print("\n")
????#?update?README.md?file
????json_file?=?"cv-arxiv-daily.json"
????if?~os.path.exists(json_file):
????????with?open(json_file,'w')as?a:
????????????print("create?"?+?json_file)
????#?update?json?data
????update_json_file(json_file,data_collector)
????#?json?data?to?markdown
????json_to_md(json_file)
上述腳本的要點(diǎn)在于:
檢索的 主題和關(guān)鍵詞都是SLAM,返回最新的10篇文章;注意,上述 主題是用作表格前二級(jí)標(biāo)題的名字,而關(guān)鍵詞才是真正要檢索的內(nèi)容,特別注意對(duì)于有空格關(guān)鍵詞多搜索格式,如camera localization要寫(xiě)成\"camera?Localization\",其中的\"表轉(zhuǎn)義,各位同學(xué)可按照規(guī)則增加自己感興趣的keywords;論文列表按照發(fā)布在arXiv上的時(shí)間排序,最新的排在最前面;
這看起來(lái)似乎已經(jīng)大功告成,但這里存在兩個(gè)問(wèn)題:1. 每次使用必須手動(dòng)運(yùn)行;2. 僅可在本地進(jìn)行查看。為了能夠每天自動(dòng)地運(yùn)行上述腳本且同步在Github倉(cāng)庫(kù),Github Actions就派上用場(chǎng)了。
Github Actions 簡(jiǎn)介
再次明確,我們的目標(biāo)是使用GitHub Actions每天自動(dòng)從arXiv獲取關(guān)于SLAM的論文,并將論文的發(fā)布時(shí)間、論文名、作者以及代碼等信息制作成Markdown表格發(fā)布在Github上。
什么是 Github Actions ?
Github Actions 是 GitHub 的持續(xù)集成服務(wù),于2018年10月推出。
以下是官方解釋[3]:
“GitHub Actions help you automate tasks within your software development life cycle. GitHub Actions are event-driven, meaning that you can run a series of commands after a specified event has occurred. For example, every time someone creates a pull request for a repository, you can automatically run a command that executes a software testing script.
”
簡(jiǎn)而言之,GitHub Actions由Events驅(qū)動(dòng),可實(shí)現(xiàn)任務(wù)自動(dòng)化。
基本概念
GitHub Actions 有一些自己的術(shù)語(yǔ)[10],[9]。
workflow(工作流程):持續(xù)集成一次運(yùn)行的過(guò)程,就是一個(gè) workflow;job(任務(wù)):一個(gè) workflow 由一個(gè)或多個(gè) jobs 構(gòu)成,含義是一次持續(xù)集成的運(yùn)行,可以完成多個(gè)任務(wù);step(步驟):每個(gè) job 由多個(gè) step 構(gòu)成,一步步完成;action(動(dòng)作):每個(gè) step 可以依次執(zhí)行一個(gè)或多個(gè)命令(action);

部署
登陸自己的Github賬號(hào),新建一個(gè)倉(cāng)庫(kù),如cv-arxiv-daily,點(diǎn)擊Actions,然后點(diǎn)擊Set up this workflow,如下圖所示:
經(jīng)過(guò)上述步驟后,會(huì)新建一個(gè)名為black.yml的文件(如下圖所示),它所在的目錄是.github/workflows/,注意這個(gè)目錄絕對(duì)不可改變,這個(gè)文件夾下存放了需要執(zhí)行的workflow,即工作流,GitHub Actions會(huì)自動(dòng)識(shí)別這個(gè)文件夾下的yml工作流文件并按照規(guī)則執(zhí)行。
這個(gè)black.yml實(shí)現(xiàn)了一個(gè)最簡(jiǎn)單的工作流:打印Hello, world!。
“需要注意的是
”GitHub Actions工作流有自己的一套語(yǔ)法,由于篇幅限制,不在此處細(xì)說(shuō),具體請(qǐng)參考這里[9]。
為了能夠?qū)崿F(xiàn)上節(jié)的python腳本daily_arxiv.py自動(dòng)運(yùn)行,不難得到如下工作流配置cv-arxiv-daily.yml,注意其中的兩個(gè)環(huán)境變量GITHUB_USER_NAME以及GITHUB_USER_EMAIL分別替換成自己的ID與郵箱。
#?name?of?workflow
name:?Run?Arxiv?Papers?Daily
#?Controls?when?the?workflow?will?run
on:
??#?Allows?you?to?run?this?workflow?manually?from?the?Actions?tab
??workflow_dispatch:
??schedule:
????-?cron:??"*?12?*?*?*"??#?Runs?every?minute?of?12th?hour
env:
??GITHUB_USER_NAME:?your_github_id?#?your?github?id
??GITHUB_USER_EMAIL:?your_email_addr?#?your?email?address
??
??
#?A?workflow?run?is?made?up?of?one?or?more?jobs?that?can?run?sequentially?or?in?parallel
jobs:
??#?This?workflow?contains?a?single?job?called?"build"
??build:
????name:?update
????#?The?type?of?runner?that?the?job?will?run?on
????runs-on:?ubuntu-latest
????
????#?Steps?represent?a?sequence?of?tasks?that?will?be?executed?as?part?of?the?job
????steps:
??????-?name:?Checkout
????????uses:?actions/checkout@v2
????????
??????-?name:?Set?up?Python?Env
????????uses:?actions/setup-python@v1
????????with:
??????????python-version:?3.6????????
??????-?name:?Install?dependencies
????????run:?|
??????????python?-m?pip?install?--upgrade?pip
??????????pip?install?arxiv
??????????pip?install?requests
??????????
??????-?name:?Run?daily?arxiv?
????????run:?|
??????????python?daily_arxiv.py
??????????
??????-?name:?Push?new?cv-arxiv-daily.md
????????uses:?github-actions-x/[email protected]
????????with:
??????????github-token:?${{?secrets.GITHUB_TOKEN?}}
??????????commit-message:?"Github?Action?Automatic?Update?CV?Arxiv?Papers"
??????????files:?README.md?cv-arxiv-daily.json
??????????rebase:?'true'
??????????name:?${{?env.GITHUB_USER_NAME?}}
??????????email:?${{?env.GITHUB_USER_EMAIL?}}
其中,workflow_dispatch表示用戶可以通過(guò)手動(dòng)點(diǎn)擊的方式運(yùn)行,schedule[7]表示定時(shí)執(zhí)行,具體規(guī)則請(qǐng)查看Events that trigger workflows [8]。
這里使用了cron的語(yǔ)法,它有5個(gè)字段,分別用空格分開(kāi),具體如下:
┌─────────────?minute?(0?-?59)
│?┌─────────────?hour?(0?-?23)
│?│?┌─────────────?day?of?the?month?(1?-?31)
│?│?│?┌─────────────?month?(1?-?12?or?JAN-DEC)
│?│?│?│?┌─────────────?day?of?the?week?(0?-?6?or?SUN-SAT)
│?│?│?│?│
│?│?│?│?│
│?│?│?│?│
*?*?*?*?*
補(bǔ)充語(yǔ)法:
| Operator | Description | Example |
|---|---|---|
| * | Any value | * * * * * runs every minute of every day. |
| , | Value list separator | 2,10 4,5 * * * runs at minute 2 and 10 of the 4th and 5th hour of every day. |
| - | Range of values | 0 4-6 * * * runs at minute 0 of the 4th, 5th, and 6th hour. |
| / | Step values | 20/15 * * * * runs every 15 minutes starting from minute 20 through 59 (minutes 20, 35, and 50). |
上述 workflow 的要點(diǎn)總結(jié)如下:
每天 UTC 12:00 觸發(fā)事件,運(yùn)行 workflow;僅有一個(gè)名為 build的job,運(yùn)行在虛擬機(jī)環(huán)境ubuntu-latest;第一步是獲取源碼,使用的 action是actions/checkout@v2;第二步是配置Python環(huán)境,使用的 action是actions/setup-python@v1,python版本是3.6;第三步是安裝依賴庫(kù),分別進(jìn)行升級(jí) pip,安裝arxiv.py庫(kù),安裝requests庫(kù);第四步是運(yùn)行 daily_arxiv.py腳本,該步驟生成json臨時(shí)文件以及對(duì)應(yīng)的README.md;第五步是推送代碼到本倉(cāng)庫(kù),使用的 action是github-actions-x/[email protected][11],需要配置的參數(shù)包括,提交的commit-message,需要提交的文件files,Github用戶名name以及郵箱email;
workflow成功部署后就會(huì)在Github repo下生成一個(gè)json文件以及README.md文件,同時(shí)將會(huì)看到如本文開(kāi)頭的文章列表,Github Action后臺(tái)的log如下:
總結(jié)
本文介紹了一種使用Github Actions實(shí)現(xiàn)自動(dòng)每天獲取arXiv論文的方法,可較為方便地獲取并預(yù)覽感興趣的最新文章。本文列舉的例子較為方便修改,各位讀者可通過(guò)增加keywords的內(nèi)容來(lái)甄選感興趣的主題。文中所有的代碼已開(kāi)源,地址見(jiàn)文章結(jié)尾。
最新的代碼中增加了獲取arXiv論文源代碼的功能,增加了幾個(gè)關(guān)鍵詞以及增加了自動(dòng)部署到一個(gè)Github Page頁(yè)面的功能。
此外,本文列舉的方法存在幾個(gè)問(wèn)題:1. 所生成的json文件為臨時(shí)文件,可優(yōu)化將其刪除;2. README.md文件大小會(huì)隨時(shí)間推移逐漸增大,后續(xù)可增加歸檔功能;3. 并非每個(gè)人每天都會(huì)瀏覽Github,后續(xù)將增加發(fā)送文章到個(gè)人郵箱的功能。
代碼:github.com/Vincentqyw/cv-arxiv-daily
歡迎大家?fork?&?star,打造自己的論文搜索利器:)
參考
[1]: About arXiv, https://arxiv.org/about
[2]: arXiv API User's Manual, https://arxiv.org/help/api/user-manual
[3]: Github Actions: https://docs.github.com/en/actions/learn-github-actions
[4]: arXiv Category Taxonomy: https://arxiv.org/category_taxonomy
[5]: Python wrapper for the arXiv API, https://github.com/lukasschwab/arxiv.py
[6]: Full package documentation: arxiv.arxiv, http://lukasschwab.me/arxiv.py/index.html
[7]: Github Actions on.schedule: https://docs.github.com/en/actions/learn-github-actions/workflow-syntax-for-github-actions#onschedule
[8]: Github Actions Events that trigger workflows: https://docs.github.com/en/actions/learn-github-actions/events-that-trigger-workflows#scheduled-events
[9]: Workflow syntax for GitHub Actions, https://docs.github.com/en/actions/learn-github-actions/workflow-syntax-for-github-actions
[10]: GitHub Actions 入門(mén)教程, http://www.ruanyifeng.com/blog/2019/09/getting-started-with-github-actions.html
[11]: Git commit and push, https://github.com/github-actions-x/commit
[12]: Generate a list of papers daily arxiv, https://github.com/zhuwenxing/daily_arxiv
-END-



