一文看懂 | fork 系統(tǒng)調(diào)用
前言
Unix標(biāo)準(zhǔn)的復(fù)制進(jìn)程的系統(tǒng)調(diào)用時(shí)fork(即分叉),但是Linux,BSD等操作系統(tǒng)并不止實(shí)現(xiàn)這一個(gè),確切的說linux實(shí)現(xiàn)了三個(gè),fork,vfork,clone(確切說vfork創(chuàng)造出來的是輕量級(jí)進(jìn)程,也叫線程,是共享資源的進(jìn)程)
| 系統(tǒng)調(diào)用 | 描述 |
|---|---|
| fork | fork創(chuàng)造的子進(jìn)程是父進(jìn)程的完整副本,復(fù)制了父親進(jìn)程的資源,包括內(nèi)存的內(nèi)容task_struct內(nèi)容 |
| vfork | vfork創(chuàng)建的子進(jìn)程與父進(jìn)程共享數(shù)據(jù)段,而且由vfork()創(chuàng)建的子進(jìn)程將先于父進(jìn)程運(yùn)行 |
| clone | Linux上創(chuàng)建線程一般使用的是pthread庫 實(shí)際上linux也給我們提供了創(chuàng)建線程的系統(tǒng)調(diào)用,就是clone |
fork, vfork和clone的系統(tǒng)調(diào)用的入口地址分別是sys_fork, sys_vfork和sys_clone, 而他們的定義是依賴于體系結(jié)構(gòu)的, 因?yàn)樵谟脩艨臻g和內(nèi)核空間之間傳遞參數(shù)的方法因體系結(jié)構(gòu)而異
系統(tǒng)調(diào)用的參數(shù)傳遞
系統(tǒng)調(diào)用的實(shí)現(xiàn)與C庫不同, 普通C函數(shù)通過將參數(shù)的值壓入到進(jìn)程的棧中進(jìn)行參數(shù)的傳遞。由于系統(tǒng)調(diào)用是通過中斷進(jìn)程從用戶態(tài)到內(nèi)核態(tài)的一種特殊的函數(shù)調(diào)用,沒有用戶態(tài)或者內(nèi)核態(tài)的堆??梢员挥脕碓谡{(diào)用函數(shù)和被調(diào)函數(shù)之間進(jìn)行參數(shù)傳遞。系統(tǒng)調(diào)用通過CPU的寄存器來進(jìn)行參數(shù)傳遞。在進(jìn)行系統(tǒng)調(diào)用之前,系統(tǒng)調(diào)用的參數(shù)被寫入CPU的寄存器,而在實(shí)際調(diào)用系統(tǒng)服務(wù)例程之前,內(nèi)核將CPU寄存器的內(nèi)容拷貝到內(nèi)核堆棧中,實(shí)現(xiàn)參數(shù)的傳遞。
因此不同的體系結(jié)構(gòu)可能采用不同的方式或者不同的寄存器來傳遞參數(shù),而上面函數(shù)的任務(wù)就是從處理器的寄存器中提取用戶空間提供的信息, 并調(diào)用體系結(jié)構(gòu)無關(guān)的?_do_fork(或者早期的do_fork)函數(shù), 負(fù)責(zé)進(jìn)程的復(fù)制
即不同的體系結(jié)構(gòu)可能需要采用不同的方式或者寄存器來存儲(chǔ)函數(shù)調(diào)用的參數(shù), 因此linux在設(shè)計(jì)系統(tǒng)調(diào)用的時(shí)候, 將其劃分成體系結(jié)構(gòu)相關(guān)的層次和體系結(jié)構(gòu)無關(guān)的層次, 前者復(fù)雜提取出依賴與體系結(jié)構(gòu)的特定的參數(shù),后者則依據(jù)參數(shù)的設(shè)置執(zhí)行特定的真正操作。
fork, vfork, clone系統(tǒng)調(diào)用的實(shí)現(xiàn)
關(guān)于do_fork和_do_frok
linux2.5.32以后, 添加了TLS(Thread Local Storage)機(jī)制, clone的標(biāo)識(shí)CLONE_SETTLS接受一個(gè)參數(shù)來設(shè)置線程的本地存儲(chǔ)區(qū)。sys_clone也因此增加了一個(gè)int參數(shù)來傳入相應(yīng)的點(diǎn)tls_val。sys_clone通過do_fork來調(diào)用copy_process完成進(jìn)程的復(fù)制,它調(diào)用特定的copy_thread和copy_thread把相應(yīng)的系統(tǒng)調(diào)用參數(shù)從pt_regs寄存器列表中提取出來,但是會(huì)導(dǎo)致意外的情況。
only one code path into copy_thread can pass the CLONE_SETTLS flag, and that code path comes from sys_clone with its architecture-specific argument-passing order.
前面我們說了, 在實(shí)現(xiàn)函數(shù)調(diào)用的時(shí)候,我iosys_clone等將特定體系結(jié)構(gòu)的參數(shù)從寄存器中提取出來, 然后到達(dá)do_fork這步的時(shí)候已經(jīng)應(yīng)該是體系結(jié)構(gòu)無關(guān)了, 但是我們sys_clone需要設(shè)置的CLONE_SETTLS的tls仍然是個(gè)依賴與體系結(jié)構(gòu)的參數(shù), 這里就會(huì)出現(xiàn)問題。
因此linux-4.2之后選擇引入一個(gè)新的CONFIG_HAVE_COPY_THREAD_TLS,和一個(gè)新的COPY_THREAD_TLS接受TLS參數(shù)為 額外的長整型(系統(tǒng)調(diào)用參數(shù)大?。┑臓幷?。改變sys_clone的TLS參數(shù)unsigned long,并傳遞到copy_thread_tls。
/*?http://lxr.free-electrons.com/source/include/linux/sched.h?v=4.5#L2646??*/
extern?long?_do_fork(unsigned?long,?unsigned?long,?unsigned?long,?int?__user?*,?int?__user?*,?unsigned?long);
extern?long?do_fork(unsigned?long,?unsigned?long,?unsigned?long,?int?__user?*,?int?__user?*);
/*?linux2.5.32以后,?添加了TLS(Thread?Local?Storage)機(jī)制,?
?在最新的linux-4.2中添加了對(duì)CLONE_SETTLS?的支持?
????底層的_do_fork實(shí)現(xiàn)了對(duì)其的支持,?
????dansh*/
#ifndef?CONFIG_HAVE_COPY_THREAD_TLS
/*?For?compatibility?with?architectures?that?call?do_fork?directly?rather?than
?*?using?the?syscall?entry?points?below.?*/
long?do_fork(unsigned?long?clone_flags,
??????????????unsigned?long?stack_start,
??????????????unsigned?long?stack_size,
??????????????int?__user?*parent_tidptr,
??????????????int?__user?*child_tidptr)
{
????????return?_do_fork(clone_flags,?stack_start,?stack_size,
????????????????????????parent_tidptr,?child_tidptr,?0);
}
#endif
我們會(huì)發(fā)現(xiàn),新版本的系統(tǒng)中clone的TLS設(shè)置標(biāo)識(shí)會(huì)通過TLS參數(shù)傳遞, 因此_do_fork替代了老版本的do_fork。
老版本的do_fork只有在如下情況才會(huì)定義
只有當(dāng)系統(tǒng)不支持通過TLS參數(shù)通過參數(shù)傳遞而是使用pt_regs寄存器列表傳遞時(shí)
未定義CONFIG_HAVE_COPY_THREAD_TLS宏
| 參數(shù) | 描述 |
|---|---|
| clone_flags | 與clone()參數(shù)flags相同, 用來控制進(jìn)程復(fù)制過的一些屬性信息, 描述你需要從父進(jìn)程繼承那些資源。該標(biāo)志位的4個(gè)字節(jié)分為兩部分。最低的一個(gè)字節(jié)為子進(jìn)程結(jié)束時(shí)發(fā)送給父進(jìn)程的信號(hào)代碼,通常為SIGCHLD;剩余的三個(gè)字節(jié)則是各種clone標(biāo)志的組合(本文所涉及的標(biāo)志含義詳見下表),也就是若干個(gè)標(biāo)志之間的或運(yùn)算。通過clone標(biāo)志可以有選擇的對(duì)父進(jìn)程的資源進(jìn)行復(fù)制; |
| stack_start | 與clone()參數(shù)stack_start相同, 子進(jìn)程用戶態(tài)堆棧的地址 |
| regs | 是一個(gè)指向了寄存器集合的指針, 其中以原始形式, 保存了調(diào)用的參數(shù), 該參數(shù)使用的數(shù)據(jù)類型是特定體系結(jié)構(gòu)的struct pt_regs,其中按照系統(tǒng)調(diào)用執(zhí)行時(shí)寄存器在內(nèi)核棧上的存儲(chǔ)順序, 保存了所有的寄存器, 即指向內(nèi)核態(tài)堆棧通用寄存器值的指針,通用寄存器的值是在從用戶態(tài)切換到內(nèi)核態(tài)時(shí)被保存到內(nèi)核態(tài)堆棧中的(指向pt_regs結(jié)構(gòu)體的指針。當(dāng)系統(tǒng)發(fā)生系統(tǒng)調(diào)用,即用戶進(jìn)程從用戶態(tài)切換到內(nèi)核態(tài)時(shí),該結(jié)構(gòu)體保存通用寄存器中的值,并被存放于內(nèi)核態(tài)的堆棧中) |
| stack_size | 用戶狀態(tài)下棧的大小, 該參數(shù)通常是不必要的, 總被設(shè)置為0 |
| parent_tidptr | 與clone的ptid參數(shù)相同, 父進(jìn)程在用戶態(tài)下pid的地址,該參數(shù)在CLONE_PARENT_SETTID標(biāo)志被設(shè)定時(shí)有意義 |
| child_tidptr | 與clone的ctid參數(shù)相同, 子進(jìn)程在用戶太下pid的地址,該參數(shù)在CLONE_CHILD_SETTID標(biāo)志被設(shè)定時(shí)有意義 |
其中clone_flags如下表所示

sys_fork的實(shí)現(xiàn)
不同體系結(jié)構(gòu)下的fork實(shí)現(xiàn)sys_fork主要是通過標(biāo)志集合區(qū)分, 在大多數(shù)體系結(jié)構(gòu)上, 典型的fork實(shí)現(xiàn)方式與如下
早期實(shí)現(xiàn)
| 架構(gòu) | 實(shí)現(xiàn) |
|---|---|
| arm | arch/arm/kernel/sys_arm.c, line 239 |
| i386 | arch/i386/kernel/process.c, line 710 |
| x86_64 | arch/x86_64/kernel/process.c, line 706 |
asmlinkage?long?sys_fork(struct?pt_regs?regs)
{
????return?do_fork(SIGCHLD,?regs.rsp,?®s,?0);
}
新版本
http://lxr.free-electrons.com/source/kernel/fork.c?v=4.5#L1785
#ifdef?__ARCH_WANT_SYS_FORK
SYSCALL_DEFINE0(fork)
{
#ifdef?CONFIG_MMU
????return?_do_fork(SIGCHLD,?0,?0,?NULL,?NULL,?0);
#else
????/*?can?not?support?in?nommu?mode?*/
????return?-EINVAL;
#endif
}
#endif
我們可以看到唯一使用的標(biāo)志是SIGCHLD。這意味著在子進(jìn)程終止后將發(fā)送信號(hào)SIGCHLD信號(hào)通知父進(jìn)程,
由于寫時(shí)復(fù)制(COW)技術(shù), 最初父子進(jìn)程的棧地址相同, 但是如果操作棧地址閉并寫入數(shù)據(jù), 則COW機(jī)制會(huì)為每個(gè)進(jìn)程分別創(chuàng)建一個(gè)新的棧副本
如果do_fork成功, 則新建進(jìn)程的pid作為系統(tǒng)調(diào)用的結(jié)果返回, 否則返回錯(cuò)誤碼
sys_vfork的實(shí)現(xiàn)
早期實(shí)現(xiàn)
| 架構(gòu) | 實(shí)現(xiàn) |
|---|---|
| arm | arch/arm/kernel/sys_arm.c, line 254 |
| i386 | arch/i386/kernel/process.c, line 737 |
| x86_64 | arch/x86_64/kernel/process.c, line 728 |
asmlinkage?long?sys_vfork(struct?pt_regs?regs)
{
????return?do_fork(CLONE_VFORK?|?CLONE_VM?|?SIGCHLD,?regs.rsp,?®s,?0);
}
新版本
http://lxr.free-electrons.com/source/kernel/fork.c?v=4.5#L1797
#ifdef?__ARCH_WANT_SYS_VFORK
SYSCALL_DEFINE0(vfork)
{
????return?_do_fork(CLONE_VFORK?|?CLONE_VM?|?SIGCHLD,?0,
????????????????????0,?NULL,?NULL,?0);
}
#endif
可以看到sys_vfork的實(shí)現(xiàn)與sys_fork只是略微不同, 前者使用了額外的標(biāo)志CLONE_VFORK | CLONE_VM
sys_clone的實(shí)現(xiàn)
早期實(shí)現(xiàn)
| 架構(gòu) | 實(shí)現(xiàn) |
|---|---|
| arm | arch/arm/kernel/sys_arm.c, line 247 |
| i386 | arch/i386/kernel/process.c, line 715 |
| x86_64 | arch/x86_64/kernel/process.c, line 711 |
sys_clone的實(shí)現(xiàn)方式與上述系統(tǒng)調(diào)用類似, 但實(shí)際差別在于do_fork如下調(diào)用
casmlinkage?int?sys_clone(struct?pt_regs?regs)
{
????/*?注釋中是i385下增加的代碼,?其他體系結(jié)構(gòu)無此定義
????unsigned?long?clone_flags;
????unsigned?long?newsp;
????clone_flags?=?regs.ebx;
????newsp?=?regs.ecx;*/
????if?(!newsp)
????????newsp?=?regs.esp;
????return?do_fork(clone_flags,?newsp,?®s,?0);
}
新版本
http://lxr.free-electrons.com/source/kernel/fork.c?v=4.5#L1805
#ifdef?__ARCH_WANT_SYS_CLONE
#ifdef?CONFIG_CLONE_BACKWARDS
SYSCALL_DEFINE5(clone,?unsigned?long,?clone_flags,?unsigned?long,?newsp,
?????????????????int?__user?*,?parent_tidptr,
?????????????????unsigned?long,?tls,
?????????????????int?__user?*,?child_tidptr)
#elif?defined(CONFIG_CLONE_BACKWARDS2)
SYSCALL_DEFINE5(clone,?unsigned?long,?newsp,?unsigned?long,?clone_flags,
?????????????????int?__user?*,?parent_tidptr,
?????????????????int?__user?*,?child_tidptr,
?????????????????unsigned?long,?tls)
#elif?defined(CONFIG_CLONE_BACKWARDS3)
SYSCALL_DEFINE6(clone,?unsigned?long,?clone_flags,?unsigned?long,?newsp,
????????????????int,?stack_size,
????????????????int?__user?*,?parent_tidptr,
????????????????int?__user?*,?child_tidptr,
????????????????unsigned?long,?tls)
#else
SYSCALL_DEFINE5(clone,?unsigned?long,?clone_flags,?unsigned?long,?newsp,
?????????????????int?__user?*,?parent_tidptr,
?????????????????int?__user?*,?child_tidptr,
?????????????????unsigned?long,?tls)
#endif
{
????????return?_do_fork(clone_flags,?newsp,?0,?parent_tidptr,?child_tidptr,?tls);
}
#endif
我們可以看到sys_clone的標(biāo)識(shí)不再是硬編碼的, 而是通過各個(gè)寄存器參數(shù)傳遞到系統(tǒng)調(diào)用, 因而我們需要提取這些參數(shù)。
另外,clone也不再復(fù)制進(jìn)程的棧, 而是可以指定新的棧地址, 在生成線程時(shí), 可能需要這樣做, 線程可能與父進(jìn)程共享地址空間, 但是線程自身的??赡茉诹硗庖粋€(gè)地址空間
另外還指令了用戶空間的兩個(gè)指針(parent_tidptr和child_tidptr), 用于與線程庫通信
創(chuàng)建子進(jìn)程的流程
_do_fork的流程
_do_fork和do_fork在進(jìn)程的復(fù)制的時(shí)候并沒有太大的區(qū)別, 他們就只是在進(jìn)程tls復(fù)制的過程中實(shí)現(xiàn)有細(xì)微差別
所有進(jìn)程復(fù)制(創(chuàng)建)的fork機(jī)制最終都調(diào)用了kernel/fork.c中的_do_fork(一個(gè)體系結(jié)構(gòu)無關(guān)的函數(shù)),
其定義在 http://lxr.free-electrons.com/source/kernel/fork.c?v=4.2#L1679
_do_fork以調(diào)用copy_process開始, 后者執(zhí)行生成新的進(jìn)程的實(shí)際工作, 并根據(jù)指定的標(biāo)志復(fù)制父進(jìn)程的數(shù)據(jù)。在子進(jìn)程生成后, 內(nèi)核必須執(zhí)行下列收尾操作:
調(diào)用 copy_process 為子進(jìn)程復(fù)制出一份進(jìn)程信息
如果是 vfork(設(shè)置了CLONE_VFORK和ptrace標(biāo)志)初始化完成處理信息
調(diào)用 wake_up_new_task 將子進(jìn)程加入調(diào)度器,為之分配 CPU
如果是 vfork,父進(jìn)程等待子進(jìn)程完成 exec 替換自己的地址空間
我們從<深入linux'內(nèi)核架構(gòu)>中找到了早期的流程圖,基本一致可以作為參考

long?_do_fork(unsigned?long?clone_flags,
??????unsigned?long?stack_start,
??????unsigned?long?stack_size,
??????int?__user?*parent_tidptr,
??????int?__user?*child_tidptr,
??????unsigned?long?tls)
{
????struct?task_struct?*p;
????int?trace?=?0;
????long?nr;
??
????/*
?????*?Determine?whether?and?which?event?to?report?to?ptracer.??When
?????*?called?from?kernel_thread?or?CLONE_UNTRACED?is?explicitly
?????*?requested,?no?event?is?reported;?otherwise,?report?if?the?event
?????*?for?the?type?of?forking?is?enabled.
?????*/
????if?(!(clone_flags?&?CLONE_UNTRACED))?{
????if?(clone_flags?&?CLONE_VFORK)
????????trace?=?PTRACE_EVENT_VFORK;
????else?if?((clone_flags?&?CSIGNAL)?!=?SIGCHLD)
????????trace?=?PTRACE_EVENT_CLONE;
????else
????????trace?=?PTRACE_EVENT_FORK;
??
????if?(likely(!ptrace_event_enabled(current,?trace)))
????????trace?=?0;
????}
???/*??復(fù)制進(jìn)程描述符,copy_process()的返回值是一個(gè)?task_struct?指針??*/
????p?=?copy_process(clone_flags,?stack_start,?stack_size,
?????????child_tidptr,?NULL,?trace,?tls);
????/*
?????*?Do?this?prior?waking?up?the?new?thread?-?the?thread?pointer
?????*?might?get?invalid?after?that?point,?if?the?thread?exits?quickly.
?????*/
????if?(!IS_ERR(p))?{
????struct?completion?vfork;
????struct?pid?*pid;
??
????trace_sched_process_fork(current,?p);
???/*??得到新創(chuàng)建的進(jìn)程的pid信息??*/
????pid?=?get_task_pid(p,?PIDTYPE_PID);
????nr?=?pid_vnr(pid);
??
????if?(clone_flags?&?CLONE_PARENT_SETTID)
????????put_user(nr,?parent_tidptr);
???
????/*??如果調(diào)用的?vfork()方法,初始化?vfork?完成處理信息?*/
????if?(clone_flags?&?CLONE_VFORK)?{
????????p->vfork_done?=?&vfork;
????????init_completion(&vfork);
????????get_task_struct(p);
????}
?/*??將子進(jìn)程加入到調(diào)度器中,為其分配?CPU,準(zhǔn)備執(zhí)行??*/
????wake_up_new_task(p);
??
????/*?forking?complete?and?child?started?to?run,?tell?ptracer?*/
????if?(unlikely(trace))
????????ptrace_event_pid(trace,?pid);
???
????/*??如果是?vfork,將父進(jìn)程加入至等待隊(duì)列,等待子進(jìn)程完成??*/
????if?(clone_flags?&?CLONE_VFORK)?{
????????if?(!wait_for_vfork_done(p,?&vfork))
????????ptrace_event_pid(PTRACE_EVENT_VFORK_DONE,?pid);
????}
??
????put_pid(pid);
????}?else?{
????nr?=?PTR_ERR(p);
????}
????return?nr;
}
copy_process流程
http://lxr.free-electrons.com/source/kernel/fork.c?v=4.5#L1237
調(diào)用 dup_task_struct 復(fù)制當(dāng)前的 task_struct
檢查進(jìn)程數(shù)是否超過限制
初始化自旋鎖、掛起信號(hào)、CPU 定時(shí)器等
調(diào)用 sched_fork 初始化進(jìn)程數(shù)據(jù)結(jié)構(gòu),并把進(jìn)程狀態(tài)設(shè)置為 TASK_RUNNING
復(fù)制所有進(jìn)程信息,包括文件系統(tǒng)、信號(hào)處理函數(shù)、信號(hào)、內(nèi)存管理等
調(diào)用 copy_thread_tls 初始化子進(jìn)程內(nèi)核棧
為新進(jìn)程分配并設(shè)置新的 pid
我們從<深入linux'內(nèi)核架構(gòu)>中找到了早期的流程圖,基本一致可以作為參考

/*
?*?This?creates?a?new?process?as?a?copy?of?the?old?one,
?*?but?does?not?actually?start?it?yet.
?*
?*?It?copies?the?registers,?and?all?the?appropriate
?*?parts?of?the?process?environment?(as?per?the?clone
?*?flags).?The?actual?kick-off?is?left?to?the?caller.
?*/
static?struct?task_struct?*copy_process(unsigned?long?clone_flags,
????????????????????unsigned?long?stack_start,
????????????????????unsigned?long?stack_size,
????????????????????int?__user?*child_tidptr,
????????????????????struct?pid?*pid,
????????????????????int?trace,
????????????????????unsigned?long?tls)
{
????int?retval;
????struct?task_struct?*p;
????retval?=?security_task_create(clone_flags);
????if?(retval)
????????goto?fork_out;
?//??復(fù)制當(dāng)前的?task_struct
????retval?=?-ENOMEM;
????p?=?dup_task_struct(current);
????if?(!p)
????????goto?fork_out;
????ftrace_graph_init_task(p);
????//初始化互斥變量
????rt_mutex_init_task(p);
#ifdef?CONFIG_PROVE_LOCKING
????DEBUG_LOCKS_WARN_ON(!p->hardirqs_enabled);
????DEBUG_LOCKS_WARN_ON(!p->softirqs_enabled);
#endif
?//檢查進(jìn)程數(shù)是否超過限制,由操作系統(tǒng)定義
????retval?=?-EAGAIN;
????if?(atomic_read(&p->real_cred->user->processes)?>=
????????????task_rlimit(p,?RLIMIT_NPROC))?{
????????if?(p->real_cred->user?!=?INIT_USER?&&
????????????!capable(CAP_SYS_RESOURCE)?&&?!capable(CAP_SYS_ADMIN))
????????????goto?bad_fork_free;
????}
????current->flags?&=?~PF_NPROC_EXCEEDED;
????retval?=?copy_creds(p,?clone_flags);
????if?(retval?0)
????????goto?bad_fork_free;
????/*
?????*?If?multiple?threads?are?within?copy_process(),?then?this?check
?????*?triggers?too?late.?This?doesn't?hurt,?the?check?is?only?there
?????*?to?stop?root?fork?bombs.
?????*/
?//檢查進(jìn)程數(shù)是否超過?max_threads?由內(nèi)存大小決定
????retval?=?-EAGAIN;
????if?(nr_threads?>=?max_threads)
????????goto?bad_fork_cleanup_count;
????delayacct_tsk_init(p);??/*?Must?remain?after?dup_task_struct()?*/
????p->flags?&=?~(PF_SUPERPRIV?|?PF_WQ_WORKER);
????p->flags?|=?PF_FORKNOEXEC;
????INIT_LIST_HEAD(&p->children);
????INIT_LIST_HEAD(&p->sibling);
????rcu_copy_process(p);
????p->vfork_done?=?NULL;
????//??初始化自旋鎖
????spin_lock_init(&p->alloc_lock);
?//??初始化掛起信號(hào)
????init_sigpending(&p->pending);
????//??初始化?CPU?定時(shí)器
????posix_cpu_timers_init(p);
?//??......
????/*?Perform?scheduler?related?setup.?Assign?this?task?to?a?CPU.?
?????初始化進(jìn)程數(shù)據(jù)結(jié)構(gòu),并把進(jìn)程狀態(tài)設(shè)置為?TASK_RUNNING
????*/
????retval?=?sched_fork(clone_flags,?p);
????if?(retval)
????????goto?bad_fork_cleanup_policy;
?retval?=?perf_event_init_task(p);
????/*?復(fù)制所有進(jìn)程信息,包括文件系統(tǒng)、信號(hào)處理函數(shù)、信號(hào)、內(nèi)存管理等???*/
????if?(retval)
????????goto?bad_fork_cleanup_policy;
????retval?=?audit_alloc(p);
????if?(retval)
????????goto?bad_fork_cleanup_perf;
????/*?copy?all?the?process?information?*/
????shm_init_task(p);
????retval?=?copy_semundo(clone_flags,?p);
????if?(retval)
????????goto?bad_fork_cleanup_audit;
????retval?=?copy_files(clone_flags,?p);
????if?(retval)
????????goto?bad_fork_cleanup_semundo;
????retval?=?copy_fs(clone_flags,?p);
????if?(retval)
????????goto?bad_fork_cleanup_files;
????retval?=?copy_sighand(clone_flags,?p);
????if?(retval)
????????goto?bad_fork_cleanup_fs;
????retval?=?copy_signal(clone_flags,?p);
????if?(retval)
????????goto?bad_fork_cleanup_sighand;
????retval?=?copy_mm(clone_flags,?p);
????if?(retval)
????????goto?bad_fork_cleanup_signal;
????retval?=?copy_namespaces(clone_flags,?p);
????if?(retval)
????????goto?bad_fork_cleanup_mm;
????retval?=?copy_io(clone_flags,?p);
????if?(retval)
????????goto?bad_fork_cleanup_namespaces;
????/*????初始化子進(jìn)程內(nèi)核棧
?????linux-4.2新增處理TLS
????????之前版本是?retval?=?copy_thread(clone_flags,?stack_start,?stack_size,?p);
????????*/
????retval?=?copy_thread_tls(clone_flags,?stack_start,?stack_size,?p,?tls);
????if?(retval)
????????goto?bad_fork_cleanup_io;
?/*??為新進(jìn)程分配新的pid??*/
????if?(pid?!=?&init_struct_pid)?{
????????pid?=?alloc_pid(p->nsproxy->pid_ns_for_children);
????????if?(IS_ERR(pid))?{
????????????retval?=?PTR_ERR(pid);
????????????goto?bad_fork_cleanup_io;
????????}
????}
?/*??設(shè)置子進(jìn)程的pid??*/
????/*?ok,?now?we?should?be?set?up..?*/
????p->pid?=?pid_nr(pid);
????if?(clone_flags?&?CLONE_THREAD)?{
????????p->exit_signal?=?-1;
????????p->group_leader?=?current->group_leader;
????????p->tgid?=?current->tgid;
????}?else?{
????????if?(clone_flags?&?CLONE_PARENT)
????????????p->exit_signal?=?current->group_leader->exit_signal;
????????else
????????????p->exit_signal?=?(clone_flags?&?CSIGNAL);
????????p->group_leader?=?p;
????????p->tgid?=?p->pid;
????}
????p->nr_dirtied?=?0;
????p->nr_dirtied_pause?=?128?>>?(PAGE_SHIFT?-?10);
????p->dirty_paused_when?=?0;
????p->pdeath_signal?=?0;
????INIT_LIST_HEAD(&p->thread_group);
????p->task_works?=?NULL;
????/*
?????*?Make?it?visible?to?the?rest?of?the?system,?but?dont?wake?it?up?yet.
?????*?Need?tasklist?lock?for?parent?etc?handling!
?????*/
????write_lock_irq(&tasklist_lock);
?/*??調(diào)用fork的進(jìn)程為其父進(jìn)程??*/
????/*?CLONE_PARENT?re-uses?the?old?parent?*/
????if?(clone_flags?&?(CLONE_PARENT|CLONE_THREAD))?{
????????p->real_parent?=?current->real_parent;
????????p->parent_exec_id?=?current->parent_exec_id;
????}?else?{
????????p->real_parent?=?current;
????????p->parent_exec_id?=?current->self_exec_id;
????}
????spin_lock(¤t->sighand->siglock);
????//?......
????return?p;
}
dup_task_struct 流程
http://lxr.free-electrons.com/source/kernel/fork.c?v=4.5#L334
static?struct?task_struct?*dup_task_struct(struct?task_struct?*orig)
{
?struct?task_struct?*tsk;
?struct?thread_info?*ti;
?int?node?=?tsk_fork_get_node(orig);
?int?err;
?//分配一個(gè)?task_struct?節(jié)點(diǎn)
?tsk?=?alloc_task_struct_node(node);
?if?(!tsk)
??return?NULL;
?//分配一個(gè)?thread_info?節(jié)點(diǎn),包含進(jìn)程的內(nèi)核棧,ti?為棧底
?ti?=?alloc_thread_info_node(tsk,?node);
?if?(!ti)
??goto?free_tsk;
?//將棧底的值賦給新節(jié)點(diǎn)的棧
?tsk->stack?=?ti;
?//……
?return?tsk;
}
調(diào)用alloc_task_struct_node分配一個(gè) task_struct 節(jié)點(diǎn)
調(diào)用alloc_thread_info_node分配一個(gè) thread_info 節(jié)點(diǎn),其實(shí)是分配了一個(gè)thread_union聯(lián)合體,將棧底返回給 ti
union?thread_union?{
???struct?thread_info?thread_info;
??unsigned?long?stack[THREAD_SIZE/sizeof(long)];
};
最后將棧底的值 ti 賦值給新節(jié)點(diǎn)的棧
最終執(zhí)行完dup_task_struct之后,子進(jìn)程除了tsk->stack指針不同之外,全部都一樣!
sched_fork 流程
int?sched_fork(unsigned?long?clone_flags,?struct?task_struct?*p)
{
?unsigned?long?flags;
?int?cpu?=?get_cpu();
?__sched_fork(clone_flags,?p);
?//??將子進(jìn)程狀態(tài)設(shè)置為?TASK_RUNNING
?p->state?=?TASK_RUNNING;
?//??……
?//??為子進(jìn)程分配?CPU
?set_task_cpu(p,?cpu);
?put_cpu();
?return?0;
}
我們可以看到sched_fork大致完成了兩項(xiàng)重要工作,
一是將子進(jìn)程狀態(tài)設(shè)置為 TASK_RUNNING,
二是為其分配 CPU
copy_thread和copy_thread_tls流程
我們可以看到linux-4.2之后增加了copy_thread_tls函數(shù)和CONFIG_HAVE_COPY_THREAD_TLS宏
但是如果未定義CONFIG_HAVE_COPY_THREAD_TLS宏默認(rèn)則使用copy_thread同時(shí)將定義copy_thread_tls為copy_thread
#ifdef?CONFIG_HAVE_COPY_THREAD_TLS
extern?int?copy_thread_tls(unsigned?long,?unsigned?long,?unsigned?long,
????????????struct?task_struct?*,?unsigned?long);
#else
extern?int?copy_thread(unsigned?long,?unsigned?long,?unsigned?long,
????????????struct?task_struct?*);
/*?Architectures?that?haven't?opted?into?copy_thread_tls?get?the?tls?argument
?*?via?pt_regs,?so?ignore?the?tls?argument?passed?via?C.?*/
static?inline?int?copy_thread_tls(
????????unsigned?long?clone_flags,?unsigned?long?sp,?unsigned?long?arg,
????????struct?task_struct?*p,?unsigned?long?tls)
{
????return?copy_thread(clone_flags,?sp,?arg,?p);
}
#endif
| 內(nèi)核 | 實(shí)現(xiàn) |
|---|---|
| 4.5 | arch/x86/kernel/process_32.c, line 132 |
| 4.5 | arch/x86/kernel/process_64.c, line 155 |
下面我們來看32位架構(gòu)的copy_thread_tls函數(shù),他與原來的copy_thread變動(dòng)并不大, 只是多了后面TLS的設(shè)置信息
int?copy_thread_tls(unsigned?long?clone_flags,?unsigned?long?sp,
????unsigned?long?arg,?struct?task_struct?*p,?unsigned?long?tls)
{
????struct?pt_regs?*childregs?=?task_pt_regs(p);
????struct?task_struct?*tsk;
????int?err;
?/*??獲取寄存器的信息??*/
????p->thread.sp?=?(unsigned?long)?childregs;
????p->thread.sp0?=?(unsigned?long)?(childregs+1);
????memset(p->thread.ptrace_bps,?0,?sizeof(p->thread.ptrace_bps));
????if?(unlikely(p->flags?&?PF_KTHREAD))?{
????????/*?kernel?thread
?????????內(nèi)核線程的設(shè)置??*/
????????memset(childregs,?0,?sizeof(struct?pt_regs));
????????p->thread.ip?=?(unsigned?long)?ret_from_kernel_thread;
????????task_user_gs(p)?=?__KERNEL_STACK_CANARY;
????????childregs->ds?=?__USER_DS;
????????childregs->es?=?__USER_DS;
????????childregs->fs?=?__KERNEL_PERCPU;
????????childregs->bx?=?sp;?????/*?function?*/
????????childregs->bp?=?arg;
????????childregs->orig_ax?=?-1;
????????childregs->cs?=?__KERNEL_CS?|?get_kernel_rpl();
????????childregs->flags?=?X86_EFLAGS_IF?|?X86_EFLAGS_FIXED;
????????p->thread.io_bitmap_ptr?=?NULL;
????????return?0;
????}
????/*??將當(dāng)前寄存器信息復(fù)制給子進(jìn)程??*/
????*childregs?=?*current_pt_regs();
????/*??子進(jìn)程?eax?置?0,因此fork?在子進(jìn)程返回0??*/
????childregs->ax?=?0;
????if?(sp)
????????childregs->sp?=?sp;
?/*??子進(jìn)程ip?設(shè)置為ret_from_fork,因此子進(jìn)程從ret_from_fork開始執(zhí)行??*/
????p->thread.ip?=?(unsigned?long)?ret_from_fork;
????task_user_gs(p)?=?get_user_gs(current_pt_regs());
????p->thread.io_bitmap_ptr?=?NULL;
????tsk?=?current;
????err?=?-ENOMEM;
????if?(unlikely(test_tsk_thread_flag(tsk,?TIF_IO_BITMAP)))?{
????????p->thread.io_bitmap_ptr?=?kmemdup(tsk->thread.io_bitmap_ptr,
????????????????????????IO_BITMAP_BYTES,?GFP_KERNEL);
????????if?(!p->thread.io_bitmap_ptr)?{
????????????p->thread.io_bitmap_max?=?0;
????????????return?-ENOMEM;
????????}
????????set_tsk_thread_flag(p,?TIF_IO_BITMAP);
????}
????err?=?0;
????/*
?????*?Set?a?new?TLS?for?the?child?thread?
?????*?為進(jìn)程設(shè)置一個(gè)新的TLS
?????*/
????if?(clone_flags?&?CLONE_SETTLS)
????????err?=?do_set_thread_area(p,?-1,
????????????(struct?user_desc?__user?*)tls,?0);
????if?(err?&&?p->thread.io_bitmap_ptr)?{
????????kfree(p->thread.io_bitmap_ptr);
????????p->thread.io_bitmap_max?=?0;
????}
????return?err;
}
copy_thread 這段代碼為我們解釋了兩個(gè)相當(dāng)重要的問題!
一是,為什么 fork 在子進(jìn)程中返回0,原因是childregs->ax = 0;這段代碼將子進(jìn)程的 eax 賦值為0 二是,p->thread.ip = (unsigned long) ret_from_fork;將子進(jìn)程的 ip 設(shè)置為 ret_form_fork 的首地址,因此子進(jìn)程是從 ret_from_fork 開始執(zhí)行的
總結(jié)
fork, vfork和clone的系統(tǒng)調(diào)用的入口地址分別是sys_fork, sys_vfork和sys_clone, 而他們的定義是依賴于體系結(jié)構(gòu)的, 而他們最終都調(diào)用了_do_fork(linux-4.2之前的內(nèi)核中是do_fork),在_do_fork中通過copy_process復(fù)制進(jìn)程的信息,調(diào)用wake_up_new_task將子進(jìn)程加入調(diào)度器中
dup_task_struct中為其分配了新的堆棧
調(diào)用了sched_fork,將其置為TASK_RUNNING
copy_thread(_tls)中將父進(jìn)程的寄存器上下文復(fù)制給子進(jìn)程,保證了父子進(jìn)程的堆棧信息是一致的,
將ret_from_fork的地址設(shè)置為eip寄存器的值
為新進(jìn)程分配并設(shè)置新的pid
最終子進(jìn)程從ret_from_fork開始執(zhí)行
進(jìn)程的創(chuàng)建到執(zhí)行過程如下圖所示

轉(zhuǎn)自:
https://blog.csdn.net/gatieme/article/details/51569932
