问题

内核是怎么感知cpu和调度cpu的？也就是说，内核怎么分配进程到某个cpu上执行？

LDE（Limited Direct Execution Protocol）

LDE——受限直接执行协议

Direct Execution：直接运行用户代码，比如main函数的所有指令，都直接放到CPU上运行

用户程序只能执行 “非特权指令”（如算术运算、普通内存读写、函数调用等）。

当谈到LDE的时候，就是讲用户态、内核态的时候，那么为什么需要LDE？

问题1——使用Direct Execution怎么实现权限控制

回答：通过user mode和kernel mode的切换机制。

为了实现操作系统对进程的权限控制。想象一下，一个进程可以随意读写任何的文件、硬件设备，而不受权限控制，那将是很可怕的事情。

实现：操作系统启动时会初始化一个Trap Table，里面记录了「事件」与「内核处理代码地址」的映射。比如：进程需要执行IO时，需要通过Trap事件进入内核态，由内核来完成权限校验，再执行IO操作。有点像AOP，内核可以在执行系统调用前后做任意逻辑，这是内核实现对进程的权限校验的基础。

事件

陷阱（Trap）：进程主动触发（如系统调用，x86 的 syscall 指令）；
中断（Interrupt）：外部设备触发（如键盘输入、硬盘 I/O 完成、时钟中断）；
异常（Exception）：进程执行错误（如除零、访问非法内存、执行特权指令）。

用户态 user mode

code that runs in user mode is restricted in what it can do. For example, when running in user mode, a process can’t issue I/O requests; doing so would result in the processor raising an exception; the OS would then likely kill the process.

内核态 kernel mode

In contrast to user mode is kernel mode, which the operating system (or kernel) runs in. In this mode, code that runs can do what it likes, in-cluding privileged operations such as issuing I/O requests and executing

all types of restricted instructions.

问题2——怎么实现进程间切换

如果此时一个进程运行在cpu上，那么内核肯定是没在运行的，那么内核怎么去控制进程切换呢？这有点像个哲学问题😄。

回答：通过时钟中断可以内核重新获得cpu执行时间

RTFM

man章节号	描述（对应文档类型）	典型示例
1	用户命令（可执行程序）	`man 1 ls`（查看 `ls` 命令用法）、`man 1 grep`
2	系统调用（内核提供的服务接口）	`man 2 open`（查看 `open()` 系统调用的参数 / 返回值）、`man 2 fork`
3	库函数（编程语言库，如 C 标准库）	`man 3 printf`（查看 C 语言 `printf()` 函数）、`man 3 strlen`
4	特殊文件（设备文件、文件系统）	`man 4 tty`（查看终端设备文件 `/dev/tty`）、`man 4 ext4`（ext4 文件系统说明）
5	配置文件格式（含文件约定）	`man 5 passwd`（查看 `/etc/passwd` 配置文件的格式）、`man 5 fstab`
6	游戏（较少使用）	`man 6 fortune`（查看 `fortune` 小游戏命令）
7	杂项（协议、文件格式、宏等）	`man 7 utf-8`（查看 UTF-8 编码规范）、`man 7 regex`（正则表达式语法）
8	系统管理命令（root 用户常用）	`man 8 sudo`（查看 `sudo` 管理命令）、`man 8 fdisk`（磁盘分区工具）
9	内核相关（内核函数、模块）	`man 9 printk`（内核打印函数）、`man 9 module_init`（内核模块初始化）

C语言库函数

函数	作用	备注
fork
exec
open
close
dup
dup2
pipe

一次cpu上下文切换需要多长时间？

结论：在mac m4上需要1.4微秒左右

Total time: 296200 microseconds
Iterations: 100000
Average context switch cost: 1.4810 microseconds

写了一个benchmark来计算。

逻辑：两个进程，绑定在同一个cpu上（sched_setaffinity，仅Linux使用），创建两个pipe，在一个很大的循环下，进程1写pipe1读pipe2，进程2读pipe1写pipe2，通过gettimeofday来计算时间。

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/wait.h>
#include <sys/time.h>
#include <sched.h>

#ifdef __APPLE__
#include <mach/mach.h>
#include <mach/thread_policy.h>
#endif

#define ITERATIONS 100000

// Function to bind the current process/thread to a specific CPU core
void bind_to_cpu(int cpu_id) {
#ifdef __linux__
    cpu_set_t mask;
    CPU_ZERO(&mask);
    CPU_SET(cpu_id, &mask);
    if (sched_setaffinity(0, sizeof(mask), &mask) == -1) {
        perror("sched_setaffinity");
        exit(EXIT_FAILURE);
    }
#elif defined(__APPLE__)
    // macOS doesn't have sched_setaffinity, use Mach thread policy
    // This is a "hint" to the scheduler, not a strict guarantee like Linux
    thread_affinity_policy_data_t policy = { cpu_id };
    thread_port_t thread = mach_thread_self();
    kern_return_t ret = thread_policy_set(thread, THREAD_AFFINITY_POLICY,
                                          (thread_policy_t)&policy,
                                          THREAD_AFFINITY_POLICY_COUNT);
    if (ret != KERN_SUCCESS) {
        fprintf(stderr, "Warning: thread_policy_set failed: %d. Continuing without explicit affinity.\n", ret);
        // Do not exit, just continue
    }
#else
    // Fallback for other OSs
    fprintf(stderr, "Warning: CPU pinning not supported on this platform.\n");
#endif
}

int main() {
    int pipe1[2], pipe2[2];
    pid_t pid;
    char buf = 'x';
    struct timeval start, end;

    // Create two pipes
    if (pipe(pipe1) == -1 || pipe(pipe2) == -1) {
        perror("pipe");
        exit(EXIT_FAILURE);
    }

    // Fork the process
    pid = fork();
    if (pid == -1) {
        perror("fork");
        exit(EXIT_FAILURE);
    }

    // Bind both processes to the same CPU (e.g., CPU 0) to avoid cross-CPU migration costs
    // and ensure we are measuring context switch overhead on a single core.
    bind_to_cpu(0);

    if (pid == 0) { // Child process
        // Close unused ends
        close(pipe1[1]); // Child reads from pipe1
        close(pipe2[0]); // Child writes to pipe2

        for (int i = 0; i < ITERATIONS; i++) {
            // Read from pipe1 (blocks until parent writes)
            if (read(pipe1[0], &buf, 1) != 1) {
                perror("child read");
                exit(EXIT_FAILURE);
            }
            // Write to pipe2 (wakes up parent)
            if (write(pipe2[1], &buf, 1) != 1) {
                perror("child write");
                exit(EXIT_FAILURE);
            }
        }

        close(pipe1[0]);
        close(pipe2[1]);
        exit(EXIT_SUCCESS);
    } else { // Parent process
        // Close unused ends
        close(pipe1[0]); // Parent writes to pipe1
        close(pipe2[1]); // Parent reads from pipe2

        gettimeofday(&start, NULL);

        for (int i = 0; i < ITERATIONS; i++) {
            // Write to pipe1 (wakes up child)
            if (write(pipe1[1], &buf, 1) != 1) {
                perror("parent write");
                exit(EXIT_FAILURE);
            }
            // Read from pipe2 (blocks until child writes)
            if (read(pipe2[0], &buf, 1) != 1) {
                perror("parent read");
                exit(EXIT_FAILURE);
            }
        }

        gettimeofday(&end, NULL);

        // Wait for child
        wait(NULL);

        close(pipe1[1]);
        close(pipe2[0]);

        long long start_usec = start.tv_sec * 1000000LL + start.tv_usec;
        long long end_usec = end.tv_sec * 1000000LL + end.tv_usec;
        long long diff_usec = end_usec - start_usec;

        // Each iteration involves 2 context switches:
        // 1. Parent -> Child (when parent waits for read)
        // 2. Child -> Parent (when child waits for read or finishes)
        double context_switch_cost = (double)diff_usec / (ITERATIONS * 2);

        printf("Total time: %lld microseconds\n", diff_usec);
        printf("Iterations: %d\n", ITERATIONS);
        printf("Average context switch cost: %.4f microseconds\n", context_switch_cost);
    }

    return 0;
}

进程调度算法

关注指标（平均值）：

turnaround time 周转时间：任务结束时间 - 任务提交时间
response time 响应时间：任务首次运行时间 - 任务提交时间

scheduling quantum：调度时间片，也称为time slice

算法	核心优点（turnaround time /response time）	核心缺点（turnaround time /response time）
SJF（非抢占）	短进程平均 turnaround time 极优	长进程饥饿（turnaround time 无限延长）；response time 不均（短进程快、长进程慢）
STCF（抢占式 SJF）	最短作业 / 剩余时间优先，平均 turnaround time 最优	长进程饥饿；上下文切换开销增加；短进程 response time 极快，但长进程 response time 极差
RR（时间片轮转）	response time 公平（所有进程均衡响应）；无饥饿问题	平均 turnaround time 较长（受 scheduling quantum 影响）；上下文切换开销高（scheduling quantum 越短越明显）
MTFQ（多级反馈队列）	兼顾短进程（高优先级）和长进程（低优先级）；response time 较均衡；无饥饿（低优先级 scheduling quantum 放大）	配置复杂（需调优先级数、scheduling quantum 大小）；极端情况下短进程密集时，长进程 turnaround time 偏长

FIFO

FIFO 先来先服务

Shortest Job First (SJF)

SJF 短作业优先

Shortest Time-to-Completion First (STCF)

STCF 最短完成时间有限

Round Robin（RR）

RR 轮训

Multi-Level Feedback Queue (MLFQ)

MLFQ 多级反馈队列，如今使用最多的进程调度算法。

作用：结合了以上算法的长处，平衡了turnaround time与response time，并且使得整个进程调度更加公平，不会出现饥饿

规则

Rule 1: If Priority(A) Priority(B), A runs (B doesn’t).
Rule 2: If Priority(A) Priority(B), A & B run in RR.
Rule 3: When a job enters the system, it is placed at the highest priority (the topmost queue).
Rule 4: Once a job uses up its time allotment at a given level (re-gardless of how many times it has given up the CPU), its priority is reduced (i.e., it moves down one queue).
Rule 5: After some time period , move all the jobs in the system to the topmost queue.