Learnings from Stanford CS 140
CS 140 Pintos Project review (2024 version)
This post is for computer science students tackling Stanford CS 140 or similar OS projects, or for any professional engineer curious about the practical challenges of kernel development. In this deep dive, you will find my key learnings on implementing core functionalities of Unix-like systems.
Over the past several months, I have been intermittently working through the CS 140 Pintos project and have finally reached the end. This is one of the most challenging projects I have ever undertaken, and in total, I estimate it took me a couple of hundred hours to finish. However, the gain was well worth the effort. Only after this deep dive into the core code of a Unix-like kernel did I begin to appreciate the wisdom behind OS design. Here, I want to share my hard-earned insights from the course.
There’s no magic in the computer world; everything is abstracted upon the interaction of CPU registers, one-dimensional hardware memory, and I/O (like disk sectors). In my view, the sophisticated arrangement of memory (including the stack and heap) is at the core of what enables every operation. After this course, I can finally understand (or at least glimpse) how an OS and computer programs run at the byte level.
Out of respect for the course’s integrity, I will not publish my code online. But I can share what I have done at a high level. I implemented everything, including the extra credits/challenges, and passed all the tests on Bochs and QEMU. (As a side note, I recommend all students test on both simulators, since the failed tests typically vary a lot in my experience. Passing all tests on one simulator doesn’t guarantee the solution is foolproof.) Whenever there were design tradeoffs to make, I aligned with modern Linux’s behavior, which typically makes the implementation more complicated. Passing all the tests, though challenging enough, was not my ultimate goal. Instead, I was always looking for the optimal solution that delivered the best performance (like maximum concurrency and minimal memory usage), which made it even harder to get right. For example, I was able to design the system so that no thread holds a global lock during an I/O operation. This required deep thought into lock granularity and synchronization. The progress (especially debugging) was never easy, but fortunately, it all worked out in the end. Next, I will briefly discuss my learnings from each project.
Pre-step
At the very beginning, the 100+ pages of Pintos documentation can be daunting. Looking back, I realize it’s crucial to read it thoroughly, without missing any details. No word in this document is redundant; it is a dense and informative read, especially in combination with the code.
Without any time pressure, I didn’t hop onto the Project 1 implementation immediately but started by reading through the code from the BIOS loading sequence. This was not an easy journey, and it’s not required for project completion, but it gave me invaluable insight into how an OS runs at the ground level. In my opinion, the assembly code is the most difficult part of the codebase to understand, including:
- Boot sector finding and loading
 - Setting up the page directory in real mode
 - Thread switching
 - Interrupt registration
 - Interrupt entrance, handling, and exit
 
Also, the linker script (kernel.lds.S) is worth a look, as it specifies the OS layout in memory.
At that time, I was mind-blown by the complex offset calculations of different registers. But concepts like context switching and interrupts are at the core of the OS, and the implementation of stack push and pop operations is truly ingenious. For example, I am still amazed by how the switch_threads function call in one thread can end up landing in another.
After chunking through the assembly code, the reading journey became much easier. A noteworthy part is the malloc (or, technically, kmalloc) implementation. It’s quite concise and efficient and gave me good insight into how memory fragmentation can be elegantly addressed.
Once I gained a good understanding of how threads and interrupts are implemented in the codebase, I proceeded to Project 1.
Environment Setup (Click to expand)
I don’t recall all the bugs I encountered when running Pintos on Bochs, but it took me some time to get everything working. Here are a few important things I remember:
- Don’t download the 2020 project code. It’s buggy and doesn’t work on Ubuntu 24; use the 2024 version.
 - Stick to Bochs 2.2.6 and apply the patch. Other versions didn’t work for me.
 - Pay close attention to any binary paths in the codebase; they need to be fixed to point to local binaries.
 
Project 1: Threads
Project 1 is already challenging enough; to some extent, it is harder than the later projects. In particular, priority donation is quite tricky and can go awry in extreme cases. Passing all the tests doesn’t mean that the scheduling decision is foolproof. I remember passing all the tests in Project 1 at first, but only in Project 4 did I happen to derive an elusive edge case where the scheduling was wrong. This overturned my entire original implementation because it couldn’t support such a scenario. The details are listed below.
Priority Donation: A Complex Multi-Lock Scenario
This scenario breaks down a complex case of priority donation to highlight critical behaviors for a robust scheduler:
- Correct Priority Recalculation: A thread’s priority must be correctly recalculated when it holds multiple locks and releases only one.
 - Preemption and Lock Interception: A newly awakened thread is not guaranteed to run immediately.
 - Chained Donations: A waiting thread can receive a priority donation, which must then be propagated up the chain to the lock holder.
 
Thread & Priority Setup
Let’s assume the following initial base priorities for our threads:
- p8 (80) > p7 (70) > p5 (50) > p4 (40) > p3 (30) > p2 (20) > p1 (10) > p6 (5)
 
Event Timeline
| Time | Event | System State | Key Takeaway | 
|---|---|---|---|
| T0 | Initial Condition: Low-priority thread t6 (p6) is created. It runs and acquires l4. | t6 (p6, Running, holds l4) | The system starts. As t6 is the only thread, it runs. | 
| T1 | Thread t1 (p1) is created. Since p1 > p6, it preempts t6. t1 runs and acquires l1. | t1 (p1, Running, holds l1); t6 (p6, Ready, holds l4) | A standard preemption based on base priority. | 
| T2 | Thread t2 (p2) is created. Since p2 > p1, it preempts t1. t2 runs and acquires l5. | t2 (p2, Running, holds l5); t1 (p1, Ready, holds l1); t6 (p6, Ready, holds l4) | Another standard preemption. The highest-priority ready thread always runs. | 
| T3 | t2 (still running) attempts to acquire l1, which is held by t1. t2 blocks. Then, t3 (p3) is created and also attempts to acquire l1, blocking as well. | t1 (p3, Running, holds l1); t2 (p2, Waiting on l1, holds l5); t3 (p3, Waiting on l1); t6 (p6, Ready, holds l4) | With t2 blocked, t1 becomes the highest-priority runnable thread. It inherits the highest priority of its waiters (t3), and its priority becomes p3. | 
| T4 | t1 acquires a second lock, l2. | t1 (p3, Running, holds l1, l2) | A thread with an elevated priority can acquire more locks. | 
| T5 | t4 (p4) and t5 (p5) attempt to acquire l2 and block. | t1 (p5, Running, holds l1, l2); t4 (p4, Waiting on l2); t5 (p5, Waiting on l2) | t1’s priority is now donated by t5, the highest-priority waiter across all its held locks. Its effective priority becomes p5. | 
| T6 | t1 releases lock l1. | t1 (p5, Running, holds l2); t3 (p3, Ready) | Priority Recalculation: t1’s priority remains p5 due to t5 waiting on l2. t1 continues running. Lock l1 is now free, but t2 is still on its waitlist. t3 is moved to the ready list. (t3 is only ‘Ready’ in the sense that it could run if scheduled). | 
| T7 | Chained Donation: A new thread t8 (p8) attempts to acquire l5 (held by t2). t8 blocks. | t2 (p8, Waiting on l1, holds l5); t8 (p8, Waiting on l5) | t8 donates its p8 priority to t2. t2 is now an extremely high-priority waiter for l1. This donation does not propagate further yet, as l1 is currently unowned. | 
| T8 | t7 (p7) attempts to acquire l4 (held by t6). | t6 (p7, Running, holds l4); t1 (p5, Ready, holds l2); t7 (p7, Waiting on l4) | t7 donates p7 to t6. The scheduler sees p7 > p5 and preempts t1 to run t6. | 
| T9 | t6, now running, acquires the free lock l1. | t6 (p8, Running, holds l4, l1); t3 (p3, Waiting on l1); t2 (p8, Waiting on l1) | Immediate Propagation: The moment t6 acquires l1, it inherits priority from its highest waiter, t2 (p8). t6’s effective priority is now max(p7 from t7, p8 from t2) = p8. | 
| T10 | t6 releases its original lock l4. | t6 (p8, Running, holds l1); t7 (p7, Ready) | t7 becomes Ready. t6 recalculates its priority: the p7 donation is gone, but the p8 donation from t2 on l1 remains. t6’s priority is still p8. Since p8 > p7, t6 continues to run. | 
Implementation Implications
The priority scheduler must correctly handle these points:
- Priority is a Function of All Held Locks: A thread’s effective priority is the maximum of its base priority and the priorities of all threads waiting on any lock it holds.
 - Recalculate on Release: When a thread releases a lock, its priority must be recalculated based on the waiters for any locks it still holds. If no locks are held, it reverts to its base priority.
 - Chained Donation & Propagation: A waiting thread (
t2) can itself receive a donation from another new waiter (t8). This new, higher priority must be propagated to the lock holder (t6) as soon as that lock is acquired. This check must happen upon lock acquisition. - Wakeup vs. Run: Moving a thread from a wait queue to the Ready list does not guarantee it runs immediately. This creates a window where another thread can acquire the released lock.
 - Transitional Lock State: Be aware of the time interim after a lock is released but before a waiter acquires it. During this window, the lock is unowned. The OS must handle another, higher-priority thread acquiring the lock in this window.
 
There could be other extreme cases I haven’t thought of. My heuristic was to add as many assertions as possible about thread priority related to thread state and scheduling decisions to capture them.
Project 2: User Programs
Project 2 is about user programs. Starting with this project, I began to tangle with all sorts of intricate concurrency issues. Yet the outcome was pretty fruitful. In this project, I was able to learn:
- How an ELF executable binary is parsed, loaded, and executed
 - How to set up the page directory and stacks for a new process
 - How command-line arguments are copied onto the stack (specifically, copying the 
argvargument requires a***C pointer, which was rare for me) - How common syscalls like 
exec()andwait()are implemented under the hood - …
 
A challenging aspect was the process_wait() implementation. It typically involves a shared semaphore for each parent-child process pair, which must be freed by the last process to exit. This requires careful coordination. My design includes a shared struct between each parent and child process that contains a semaphore, two boolean fields to indicate whether the child or parent has exited, and a lock to guarantee atomic updates to these two fields. This approach ensures the shared struct is safely freed when the last process exits.
There’s not much to say about the coding; after all, it’s all about painful debugging. Honing GDB debugging skills is beneficial. Personally, I feel that gdbgui is easier to use than the command line. There are a few caveats to note based on my experience:
- Always use pointers, especially for lists. The 
=assignment is a copy operation in C. If one is not careful, it could result in a copy at a different address, clobbering thelist_elemoffset calculation. Moreover, it saves memory since it doesn’t copy the whole struct onto the stack. The Pintos documentation provides a code block to detect write issues, as follows:
1 2 3 4 5 6
static bool put_user (uint8_t *udst, uint8_t byte) { int error_code; asm ("movl $1f, %0; movb %b2, %1; 1:" : "=&a" (error_code), "=m" (*udst) : "q" (byte)); return error_code != -1; }
Use it cautiously to ensure it doesn’t tamper with unintended addresses. An overwrite from this code block is hard to catch and can produce seemingly random errors. In fact, this caused me random file data corruption and took me long to debug.
Project 3: Virtual Memory
To me, Project 3 would have been simple enough without sharing read-only pages. However, sharing read-only pages changes everything. It was a significant challenge to think through the complex sequence of operations around page eviction, loading, freeing, sharing, faulting, refilling, writing, and pinning. Besides that, I chose granular locking over a global lock as much as possible to achieve maximum concurrency, which required a strict and thoughtful ordering of granular locking and memory-freeing operations. My final design involves a page_status enum to represent all possible states of a page. This helped me better organize and handle the transitions between states, much like a finite state machine.
This project also introduces a few trade-offs, like when a read-only page should be freed, which prompted me to investigate more about modern Linux’s design decisions and the rationale behind them.
Overall, this project is really informative and gave me a thorough view of virtual memory. I never anticipated so much synchronization consideration around it.
Project 4: File Systems
Finally, Project 4! I don’t need to repeat the cliché about synchronization; it’s always a pain point. Luckily, I had gone through what I consider the hardest part in Project 3. Through this project, I got the chance to implement an EXT-like file system from scratch, which is a standard Linux file system type. After completing it, I am confident anyone would fully understand how the file system and inodes work. Essentially, it’s just an indexing system for disk sectors. Implementing the indexing of doubly-indirect blocks is a bit tricky, especially in the face of limited memory/disk space, where one needs to return how many bytes were successfully written.
There is significant room for improvement beyond the project requirements. I want to note a few possibilities:
- Do not acquire a global lock when doing I/O to achieve maximum concurrency.
 - In the inode struct, there are many opportunities to save space, such as using bits instead of bytes, although the operations become slightly more complex.
 - The logic for copying the current directory path from a parent process to a child process can be made completely recursion-free by using a 
whileloop. - Each process can maintain its open directories using a tree-like structure to avoid storing duplicate path segments.
 - Think about when background threads start and exit. Ideally, they should exit gracefully before the main thread calls 
shutdown. 
Conclusion
Here concludes the long, arduous, but rewarding journey of CS 140. I really appreciate the creators of this Pintos project; it’s well-written and documented, including a comprehensive and sophisticated suite of test cases. Through this course, I hope more computer science students can appreciate the sophistication of operating system design.