## 36 I/O Devices⌗

### 36.1 System Architecture⌗

Devices are connected to the system via:

• memory bus (CPU - main memory)
• general I/O bus: PCI (graphics, high performance devices)
• peripheral bus: SCSI, SATA, USB (slow devices)

In modern systems:

• PCIe Graphics
• memory interconnect
• DMI (Direct Media Interface)
• PCIe
• eSATA

### 36.2 A Canonical Device⌗

A device has two important components:

• interface
• status
• command
• data
• internal structure

### 36.3 The Canonical Protocol⌗

Programmed I/O (PIO): the main CPU is involved with the data movement

### 36.4 Lowering CPU Overhead With Interrupts⌗

Utilize CPU:

• interrupt
• a two-phased approach: polls first, using interrupts if not yet finished
• coalescing

### 36.5 More Efficient Data Movement With DMA⌗

Direct Memory Access (DMA): transfer data between devices and main memory without much CPU intervention

### 36.6 Methods Of Device Interaction⌗

• explicit I/O instructions (on x86, the in and out instructions), usually privileged
• memory-mapped I/O (no new instructions are needed)

### 36.7 Fitting Into The OS: The Device Driver⌗

• file system
• raw interface (support low-level storage management)

### 36.8 Case Study: A Simple IDE Disk Driver⌗

Registers are available by reading or writing to specific “I/O addresses” using (on x86) the in and out I/O instructions.

• wait for drive to be ready
• write parameters to command registers
• start the I/O: issuing read/write to command register
• data transfer (for writes): write data to data port
• handle interrupts

## 37 Hard Disk Drives⌗

### 37.1 The Interface⌗

The drive consists of a large number of sectors (512-byte blocks), each of which can be read or written

View the disk as an array of sectors (扇区)

• multi-sector operations are possible
• only guarantee a single 512-byte write is atomic

Efficiency:

• access closer blocks are faster than farther ones
• sequential read or write are faster than the random access manner

### 37.2 Basic Geometry⌗

• platter: 磁盘
• surface: the 2 sides of a platter
• spindle: 轴
• track: 道, concentric circles of sectors
• disk arm

### 37.3 A Simple Disk Drive⌗

I/O time:

• rotational delay
• seek time
• acceleration
• coasting
• deceleration
• settling (significant)
• transfer

Track skew: sequential reads can be properly serviced when crossing track boundaries

Multi-zoned disk drives:

• a zone is a consecutive set of tracks on a surface
• each zone has the same number of sectors per track
• outer zones have more sectors

Cache (track buffer):

• read: to read in all of the sectors on that track and cache them in its memory
• write
• write back: acknowledge when puts into memory
• write through: acknowledge after actual writes

### 37.4 I/O Time: Doing The Math⌗

The rate of I/O:

$$T_\mathrm{I/O} = T_\mathrm{seek}+T_\mathrm{rotation}+T_\mathrm{transfer}$$

$$R_\mathrm{I/O}=\cfrac{Size_\mathrm{transfer}}{T_\mathrm{I/O}}$$

• a huge gap in drive performance between random and sequential workloads
• a large difference in performance between high-end “performance” drives and low-end “capacity” drives

### 37.5 Disk Scheduling⌗

Principle: shortest job first, SJF

Shortest-seek-time-first (SSTF, also shortest-seek-first or SSF): picking requests on nearest track to complete first

problems:

• the drive geometry is not available to the OS (fix: nearest-block-first, NBF)

#### Elevator algorithm⌗

SCAN: simply moves back and forth across the disks

F-SCAN: freezes the queue when doing a sweep (avoid starvation of far-away requests)

C-SCAN (Circular SCAN): only sweeps from outer-to-inner (more fair to the inner and outer tracks)

Prevent fights from taking place on elevators!

SCAN and SSTF does not adhere to SJF closly: they ignore rotation

#### SPTF: Shortest positioning time first⌗

Also called shortest access time first, SATF

In engineering, it turns out “it depends” is almost always the answer, reflecting that trade-offs are part of the life of the engineer.

On modern drives both seek and rotation are roughly equivalent

SPTF is usually performed inside a drive

#### Other scheduling issues⌗

Where to perform scheduling: The OS scheduler usually picks (what it thinks) some best few requests (say 16) and issues them all to disk; the disk then uses its internal knowledge to service said requests in the best possible (SPTF) order.

I/O merging: merge I/O requests by the OS

Non-work-conserving: waiting for some time for a possible better request

## 38 Redundant Arrays of Inexpensive Disks (RAIDs)⌗

RAID: a technique to use multiple disks in concert to build a faster, bigger, and more reliable disk system

• externally looks like a disk
• internally consisting multiple disks, memory and processor(s)

### 38.1 Interface And RAID Internals⌗

To perform a logical I/O:

• calculate the actual disk(s)
• perform one or more physical I/Os

### 38.2 Fault Model⌗

RAIDs are designed to detect and recover from certain kinds of disk faults.

The fail-stop fault model: a not-working disk is assumed permanently lost

Characteristics:

• capacity
• reliability
• performance

### 38.4 RAID Level 0: Striping⌗

RAID level 0, or striping as it is better known, serves as an excellent upper-bound on performance and capacity.

Striping: no redundancy; spread the blocks of the array across the disks in a round-robin fashion to get the most parrallelism.

Simple striping:

Disk 0Disk 1Disk 2Disk 3
0123
4567
891011
12131415

Stripe: blocks in the same row (0, 1, 2, 3)

Striping with a bigger chunk size (8KB, 2 blocks, a stripe consists of 32KB of data):

Disk 0Disk 1Disk 2Disk 3
0246
1357
8101214
9111315

#### Chunk sizes⌗

• small chunk size: on large files, positioning more on a single disk
• large chunk size: less parallelism, but also less posisioning time

Thus, determining the “best” chunk size is hard to do, as it requires a great deal of knowledge about the workload presented to the disk system.

#### RAID-0 Analysis⌗

• capacity: perfect
• reliability: any disk failure will lead to data loss
• performance: all disks are utilized, often in parallel

#### Evaluating RAID Performances⌗

Two performance metrics:

• single-request latency
• steady-state throughput (critical in high-perfromance environments)

Two types of workloads (considered here):

• sequential: large contiguous chunks
• searching through a large file for a keyword
• random: samll requests, to different random locations on disk
• transactional workloads on a database management system

Denote the sequential and random workload’s throughput as:

• $S$ for sequential
• $R$ for random

### 38.5 RAID Level 1: Mirroring⌗

RAID-10 (or RAID 1+0, stripe of mirrors):

Disk 0Disk 1Disk 2Disk 3
0011
2233
4455
6677

RAID-01

Disk 0Disk 1Disk 2Disk 3
0101
2323
4545
6767
• write: update both

#### Analysis⌗

Consistent-update problem

untimely failures on write: inconsistent copies

solution: write-ahead log and recovery procedure

• capacity: expensive, $N\cdot B/2$ for mirroring level = 2
• reliability: tolerate failure on any disk, up to $N/2$ disks can fail
• single request performance
• read: sames as on a single disk ($R$, writes to two blocks)
• write: slightly higher than writing to a single disk (worst-case seek and rotational delay)
• write: ($\frac{N}{2}\cdot S$, writes to $N$ blocks)
• read: same as write ($\frac{N}{2}\cdot S$, each disk receives a request for every other block)
• read: $N\cdot R$
• write: $\frac{N}{2}\cdot R$

### 38.6 RAID Level 4: Saving Space With Parity⌗

Parity-based approaches attempt to use less capacity and thus overcome the huge space penalty paid by mirrored systems. They do so at a cost, however: performance.

Disk 0Disk 1Disk 2Disk 3Disk 4
0123P0
4567P1
891011P2
12131415P3

Simple function: XOR, which makes a row having even number of 1

#### Analysis⌗

• capacity: $(N-1)\cdot B$
• reliability: tolerates 1 disk failure
• performance
• steady-state: $(N-1)\cdot S$ (full-stripe write: calculate parity bits first then write all disks in parallel)
• random
• read: $(N-1)\cdot R$
• write: $R/2$ (the parity disk is the bottleneck)
• additive parity: read in parallel, calculate XOR, and write in parallel (larger RAIDs require a high number of reads)
• subtractive parity: $P_\mathrm{new}=(C_\mathrm{old}\oplus C_\mathrm{new})\oplus P_\mathrm{old}$ (Why not always better?– because it has to read the parity disk?)

Small-write problem: the parity disk is a bottleneck under small writes which prevents parallelism

### 38.7 RAID Level 5: Rotating Parity⌗

Disk 0Disk 1Disk 2Disk 3Disk 4
0123P0
567P14
1011P289
15P3121314
P416171819

#### Analysis⌗

random write: $\frac{N}{4}\cdot R$ (four I/O operations) (aren’t the reads/writes parallel?–the data disk itself serves as parity disk as well)

## 39 Interlude: Files and Directories⌗

Thus, the OS must take extra care with such a device: this is where users keep data that they really care about.

### 39.1 Files And Directories⌗

• file: linear array of bytes
• low-level name (inode number)
• directory
• low-level name (inode number)
• a list of (user-readable name, low-level name) pairs

### 39.3 Creating Files⌗

Create a new file: call open() and pass O_CREAT flag:

int fd = open("foo", O_CREAT|O_WRONLY|O_TRUNC, S_IRUSR|S_IWUSR);

• O_CREAT: create the file if it does not exist
• O_WRONLY: read only
• O_TRUNC: if exists already, truncates it to a size of zero bytes

### 39.5 Reading And Writing, But Not Sequentially⌗

Reading some random offsets within the document:

off_t lseek(int fildes, off_t offset, int whence);

• fildes: file descriptor
• offset: the file offset to a particular location
• whence
• SEEK_SET: the offset is set to offset bytes
• SEEK_CUR: the offset is set to its current location plus offset bytes
• SEEK_END: the offset is set to the size of the file plus offset bytes

Calling lseek() does not perform a disk seek

The “current offset” tracked by the OS:

• explicitly updates with lseek

The file structure:

struct file {
int ref;
char writable;
struct inode * ip;
uint off;
};


The open file table:

struct {
struct spinlock lock;
struct file file[NFILE];
} ftable;


### 39.6 Shared File Table Entries: fork() And dup()⌗

The open file table entries are shared by 1. the parent and child processes created by fork(); 2. one file descriptor and its duplicates created by dup(), with the reference count incremented.

### 39.7 Writing Immediately With fsync()⌗

Calls fsync() to force all dirty (not yet written) data to disk.

int rc = write(fd, buffer, size); assert(rc == size);


In some cases, it is also needed to fsync() the containing directory to ensure that the file is part of the directory.

Not surprisingly, this type of detail is often overlooked, leading to many application-level bugs.

mmap() and persistent memory

By combining the persistence of files with the access semantics of memory, file-backed memory mappings support a software abstraction called persistent memory.

### 39.8 Renaming Files⌗

rename(char *old, char *new)


rename() call is usually implemented as an atomic call.

An example to use this kind of guarantee:

int fd = open("foo.txt.tmp", O_WRONLY|O_CREAT|O_TRUNC, S_IRUSR|S_IWUSR);
write(fd, buffer, size);
fsync(fd);
close(fd);
rename("foo.txt.tmp", "foo.txt");


### 39.9 Getting Information About Files⌗

Use stat() or fstat() to access metadata of files or directories.

### 39.10 Removing Files⌗

Use unlink() to remove files.

### 39.11 Making Directories⌗

Use mkdir() to create a directory.

An empty directory has two entries:

• refer to itself
• refer to its parent

Use opendir(), readdir() and closedir() to read contents of a directory.

DIR * dp = opendir(".");
assert(dp != NULL);
struct dirent * d;
while ((d = readdir(dp)) != NULL) {
printf("%lu %s\n", (unsigned long) d->d_ino, d->d_name);
}
closedir(dp);


### 39.13 Deleting Directories⌗

Call rmdir() to delete a directory.

Call link() to make an entry in the file system tree, simply creates another name in the directory and refers it to the same inode number.

ln file file2


When creating a file, two things happened:

1. making a structure (the inode) that tracks relevent information

When the file system unlinks a file, it checks the reference count (link count) within the inode number, and only free the node and related data blocks when it reaches zero.

Call symlink() to make an symbolic link (soft link) to a file.

ln -s file file2


• symbolic links is actually a file itself of a different type
• possibility for dangling reference

### 39.16 Permission Bits And Access Control Lists⌗

Permission bits consist of three groupings:

• owner
• group
• other (anyone)

The owner of the file can change the permissions.

But now you have learned an even more important lesson: there can be no secrets in a good marriage, even within the file system.

The Time Of Check To Time Of Use (TOCTTOU) problem

The problem is caused by the gap between the check and the update.

### 39.17 Making And Mounting A File System⌗

And mkfs said, let there be a file system!

Mount: to take an existing directory as a target mount point and paste a new file system onto the directory tree at that point.

Mount unfies all file systems into one tree, making naming uniform and convenient.

## 40 File System Implementation⌗

vsfs (the Very Simple File System)

The file system is pure software; unlike our development of CPU and memory virtualization, we will not be adding hardware features to make some aspect of the file system work better (though we will want to pay attention to device characteristics to make sure the file system works well).

### 40.2 Overall Organization⌗

The disk is divided into:

• data blocks: user data
• inodes
• allocation structures: to track whether inodes or data blocks are free or allocated
• superblock: contains information about the fie system

### 40.3 File Organization: The Inode⌗

Inode (index node): the data structure that holds the metadata of a given file

i-number (low-level name): the number that the inode is implicitly referred to; disk position can be calculated by the number

Calculation:

$$block = (inumber\times sizeof(inode)) / blockSize$$

$$sector = ((block\times blockSize) + inodeStartAddr) / sectorSize$$

• type: directory or file (kept in the directory entry and thus is not found in the inode)
• size
• number of blocks
• protection information
• time
• data block position on disk

#### The Multi-Level Index⌗

Indirect pointer: points to a block that contains pointers which each points to user data

An inode may have some fixed number of direct pointers and a single indirect pointer.

If a file grows large enough, an indirect block is allocated (from the data block) and the inode’s indirect pointer is set to point to it.

Extent-based approaches: a disk pointer plus a length; allow for more than one extent

Linked-based approaches: use a linked list to keep track all blocks of a file

Double indirect pointer: refer to a block that contains indirect pointers

Most files are small: a small number of direct pointers is sufficient

### 40.4 Directory Organization⌗

Deleting a file can leave an empty space in the middle of the directory; a new entry may reuse an old, bigger, deleted entry.

File systems often streat directories as a special type of file, and thus they have inodes in the inode table.

### 40.5 Free Space Management⌗

To track free inode/data blocks:

• use bitmap
• use free list
• use more soplisticated ds, e.g. B-tree

Pre-allocation policy: reserve a sequence of blocks for a newly-created file, to guarantee that a portion of the file is contiguous on the disk.

### 40.6 Access Paths: Reading and Writing⌗

#### Reading A File From Disk⌗

The timeline of reading a file /foo/bar:

open("/foo/bar", O_RDONLY)

• read the root inode (/) (whose i-number is well known; 2 in most UNIX FSs)
• read / data (to look for an entry for foo)
• read foo inode
• read foo data (to look for an entry for bar)
• read bar inode

read(...)

• read bar inode
• read specific block of bar’s data
• write bar inode for last access time

Yes, life can get pretty bad when reading a file; as you’re about to find out, writing out a file (and especially, creating a new one) is even worse.

Reads don’t access allocation structures (e.g. bitmaps); they are only accessed when allocation is needed. There is no need to make sure a block is allocated when the inode already points to it

#### Writing A File To Disk⌗

Each write to a file logically generates 5 I/Os:

• write data bitmap (to mark the newly-allocated block as used)
• write file data
• write file inode

For file creation:

• read inode and inode data traversly to find the containing directory
• read inode bitmap (find a free inode)
• write inode bitmap (to mark the inode allocated)
• read/write inode of the new file (initialize)
• additional I/O might be needed if the directory needs to grow

### 40.7 Caching and Buffering⌗

Most file systems aggressively use system memory (DRAM) to cache important blocks.

Early file systems use a fixed-size cache to hold popular blocks; this static partitioning method can be wasteful.

Modern systems use a dynamic partitioning approach.

Static vs. dynamic partitioning

• static: predictable performance
• dynamic: better utilization; more complex to implement

Buffering: delay writes to reduce I/O

In this case, laziness (in writing blocks to disk) is a virtue.

## 41 Locality and The Fast File System⌗

### 41.1 The Problem: Poor Performance⌗

Main issue of the old UNIX file system: data was spread all over the place, thus had expensive positioning costs

Some problems:

• the inode and the data block are often far away
• the file system is getting more and more fragmented
• the block size is too small, so that transferring data is inefficient

### 41.3 Organizing Structure: The Cylinder Group⌗

Cylinder (柱面): a set of tracks on different surfaces that are the same distance from the center

FFS divides the disk into a number of cylinder groups.

As modern drives only export a logical address space of blocks, modern file systems instead organize the drive into block groups, each of which is a consecutive portion of the disk’s address space. By placing two files within the same group, FFS can ensure that accessing one after the other will not result in long seek time.

FFS needs to have the ability to place files and directories into a groups and track relevant information, thus all the structures that a file system is expected to have is included in each group.

The components of a single cylinder group:

• a copy of the super block
• a per-group inode bitmap and data bitmap
• inode region
• data block region

### 41.4 Policies: How To Allocate Files and Directories⌗

The basic mantra is simple: keep related stuff together (and its corollary, keep unrelated stuff far apart).

A few placement heuristics:

• directories: find the group
• with a low number of allocated directories (to balance directories across groups)
• with a high number of free inodes (to be able to allocate a bunch of files)
• files:
• allocate the data blocks in the same group as its inode
• places all files that are in the same directory in the cylinder group of the directory they are in

### 41.6 The Large-File Exception⌗

Filling an entire block group with large files will prevents related files from being placed within the same group.

FFS’ policy for large files: after some number of blocks are allocated into the first block group, it places the next large chunk of the file in another block group.

To address the performance problem: choosing chunk size carefully. By amotization, if the chunk size is large enough, the seek time will be relatively little.

FFS’s approach

• the first 12 blocks are placed in the same group as the inode
• each subsquent indirect block and the blocks it points to are placed in a different group

Note that the trend in disk drives is that transfer rate improves fairly rapidly, as disk manufacturers are good at cramming more bits into the same surface, but the mechanical aspects of drives related to seeks (disk arm speed and the rate of rotation) improve rather slowly.

### 41.7 A Few Other Things About FFS⌗

Internal fragmentation: 4KB blocks are used by many 2KB files

Sub-blocks: 512-byte blocks can be allocated to files; copy to a 4KB block when the size grows beyond 4KB

To address the performance issue caused by copies: buffer writes, to avoid this in most cases.

A refined disk layout (parameterization): skipping a block when issuing a sequential read (e.g. 0 6 1 7 2 8 3 9 4 10 5 11); to issue the performance problem: using track buffer.

FFS also supports:

• long file names