Virtualization of CPU and memory

Why virtualize CPU and memory#

Virtualizing memory makes application programming simpler. The operating system provides the program with an illusion of having a large independent address space without needing to consider how the actual small and crowded physical memory is allocated. Additionally, it can protect the memory of other programs from being modified by malicious programs.

Virtualizing the CPU allows the operating system to use scheduling algorithms to ensure that the CPU is not exclusively occupied by a single program during multi-program execution.

The three goals of virtual memory are transparency, efficiency, and protection.

Zombie State#

A process can be in a final state where it has exited but has not yet been cleaned up (known as the zombie state in UNIX-based systems). This final state is useful because it allows other processes (usually the parent process that created the process) to check the return code of the process and see if the completed process executed successfully (typically, in UNIX-based systems, a process returns zero when it successfully completes a task, otherwise it returns a non-zero value).

Shell Invocation Process, Fork, Exec#

The shell is also a user program. It first displays a prompt and then waits for user input. You can enter a command (the name of an executable program and any required parameters) to the shell. In most cases, the shell can find this executable program in the file system, create a new process by calling fork(), and execute the executable program by calling one of the variants of exec(). It then waits for the command to complete by calling wait(). After the child process finishes execution, the shell returns from wait() and outputs another prompt, waiting for the user to enter the next command.

The separation of fork() and exec() allows the shell to easily implement many useful features. For example:

wc p3.c > newfile.txt

In the example above, the output of wc is redirected to the file newfile.txt (indicated by the greater than sign before newfile.txt). The shell implements output redirection by closing the standard output before calling exec() and opening the file newfile.txt. This way, the output of the program wc is sent to the file instead of being printed on the screen.

Key Underlying Mechanisms of CPU Virtualization - Restricted Direct Execution#

System Call Process, Trap Instruction, Trap Table#

To execute a system call, a program must execute a special trap instruction. This instruction jumps into the kernel and elevates the privilege level to kernel mode. Once inside the kernel, the system can perform any necessary privileged operations (if allowed) to perform the work required by the calling process. After completion, the operating system calls a special return-from-trap instruction, which returns to the user program that initiated the call and lowers the privilege level back to user mode.

There is an important detail that has not been discussed: how does the trap know which code is running inside the OS?

Obviously, the calling process cannot specify the address to jump to (just like when making a procedure call), as this would allow the program to jump to any location in the kernel, which is a bad idea. In fact, this ability would likely allow a clever programmer to make the kernel execute any sequence of code. Therefore, the kernel must carefully control the code executed on a trap.

The kernel achieves this by setting up a trap table at startup. When the machine boots, it executes in privileged (kernel) mode, allowing it to configure the machine hardware as needed. The first thing the operating system does is tell the hardware where to run the trap handlers when certain exceptional events occur.

For example, when a disk interrupt occurs, a keyboard interrupt occurs, or a program makes a system call, where should the code run? The operating system typically notifies the hardware of the location of these trap handlers using some special instruction. Once the hardware is notified, it remembers the locations of these handlers until the machine is rebooted, and the hardware knows what to do (i.e., where to jump) when a system call or other exceptional event occurs.

Finally, one more thing: the ability to execute an instruction to tell the hardware where the trap table is located is a very powerful feature. Therefore, as you might guess, it is also a privileged operation. If you try to execute this instruction in user mode, the hardware will not allow it.

Cooperative Scheduling in Cooperative Scheduling Systems#

In cooperative scheduling systems, the operating system regains control of the CPU by waiting for a system call or some kind of illegal operation to occur.

Preemptive Scheduling in Preemptive Scheduling Systems#

In preemptive scheduling systems, the operating system regains control of the CPU by waiting for a clock interrupt, system call, or some kind of illegal operation to occur. The scheduling algorithm of the operating system determines which program to run next.

Virtual Memory#

Hardware-Based Dynamic Relocation#

Each CPU requires two registers: a base register and a limit register. When writing and compiling a program, it is assumed that the address space starts from 0. When the program is actually executed, the operating system determines the actual loading address in physical memory and records the starting address in the base register. When a process is running, all memory references generated by the process are processed by the hardware as physical address = virtual address + base. This method leads to memory fragmentation (allocated space within the program is unused).

Segmentation, Generalized Base/Limit#

A typical address space has three different logical segments: code, stack, and heap. These three parts are divided into three segments and placed in physical memory. Each segment has a set of base/limit registers, which are used by the hardware to translate virtual addresses.

For example, a segment register's first 2 bits identify the segment to which the virtual address belongs, and the remaining 12 bits represent the offset within the segment. At this time, the physical address = offset + base of the corresponding segment, and the value of the limit register is used to check for segment overflow. In addition to the base/limit registers, the hardware also needs to know the direction of segment growth (e.g., the stack may grow in the opposite direction).

Segmentation leads to the problem of quickly filling physical memory with small holes of free space, making it difficult to allocate to new segments or expand existing ones. This problem is known as external fragmentation.

Paging#

Paging involves dividing a process's address space into fixed-size units called pages. Similarly, physical memory is divided into page frames, each containing a virtual memory page.

To record the location of each virtual page in physical memory, the operating system typically maintains a data structure called a page table for each process.

The main purpose of the page table is to save address translations for each virtual page of the address space, so that we know the location of each page in physical memory. The page table is a per-process data structure.

To perform the translation, we first need to split the virtual address generated by the process into two components: the virtual page number (VPN) and the offset within the page. Assuming the virtual address space of the process is 64 bytes, we need a total of 6 bits for the virtual address (2^6 = 64). If the page size is set to 16 bytes, the offset within the page needs 4 bits (2^4 = 16), and the remaining 2 bits are allocated to the VPN (2^2 can represent 4 pages, 4 * 16 = 64).

When performing address translation, the operating system queries the page table to translate the VPN into a physical frame number (PFN), while keeping the offset unchanged.

The page table is a data structure used to map virtual addresses to physical addresses, so any data structure can be used. The simplest form is called a linear page table, which is an array. The operating system uses the VPN to index this array and looks up the page table entry (PTE) at that index to find the desired physical frame number (PFN).

Paging Problems#

Linear page tables occupy a large amount of memory because they store entries for all virtual pages in the page table, regardless of whether the VPN is mapped to a physical address or not.

In addition, paging requires an additional memory reference for every memory reference (whether it is fetching instructions or explicit loads or stores) to first obtain the address translation from the page table. The additional memory reference overhead is significant and can slow down a process by a factor of two or more.

Therefore, there are two practical problems that must be addressed. Without careful design of hardware and software, page tables can slow down the system and consume too much memory.

Fast Address Translation - TLB#

TLB (Translation Look-aside Buffer) is a hardware-based cache that can solve the problem of slow address translation. When hardware translates a virtual address, it first checks if it is in the TLB. If it is not, it then accesses the page table in memory.

The virtual-to-physical address mappings stored in the TLB are only valid for the current process and have no meaning for other processes. Therefore, when a process context switch occurs, the hardware or operating system (or both) must ensure that the upcoming process does not mistakenly read the address mappings of the previous process.

Simply clear the TLB.
Add an address space identifier (identifying which process it belongs to) to the TLB.

Multi-Level Page Tables#

Linear page tables occupy a large amount of memory. One simple way to reduce the size of the page table is to use larger pages (reducing the number of VPNs and increasing the offset), but larger pages also mean more memory fragmentation. Another approach is to use a hybrid of segmentation and paging.

We mentioned that the page table is a data structure, and we can use a hash table or tree to store the valid page table entries (PTEs). Of course, using a binary tree has a high cost, as it requires multiple memory accesses during lookup.

Modern operating systems often use multi-level page tables, which can be seen as a simple hash table (in fact, a linear array is also a kind of hash table) that divides the ordinary linear page table into page-sized units and puts them into a new linear table called a page directory. This is a time-space trade-off. If a two-level page table is used, when the TLB misses, the page directory needs to be queried from memory first, and then the required PTE needs to be queried from the page directory. Getting the PTE requires two memory accesses, and after obtaining the physical address from the PTE, another memory access is required to access the data. The cost is higher compared to a simple linear table, but fortunately, we have the TLB.