How Vinum addresses the Three Problems
The big picture
Increased resilience: RAID-5
Driver structureVinum can issue multiple disk transfers for a single I/O request:
The second set of RAID-5 operations and I/O recovery do not match well with the design of UNIX device drivers: typically, the ``top half'' of a UNIX device driver issues I/O commands and returns to the caller. The caller may choose to wait for completion, but one of the most frequent uses of a block device is where the virtual memory subsystem issues writes and does not wait for completion.
- As the result of striping or concatenation, the data for a single request may map to more than one drive. In this case, Vinum builds a request structure which issues all necessary I/O requests at one time. This behaviour has had the unexpected effect of highlighting problems with dubious SCSI hardware by imposing heavy activity on the bus.
- As seen above, many RAID-5 operations require a second set of I/O transfers after the initial transfers have completed.
- In case of an I/O failure on a resilient volume, Vinum must reschedule the I/O to a different plex.UNIX device drivers run in two separate environments. The ``top half'' runs in the process context, while the ``bottom half'' runs in the interrupt context. There are severe restrictions on the functions that the bottom half of the driver can perform.
This poses a problem: who issues the second set of requests? The following possibilities, listed in order of increasing desirability, exist:
Initially, Vinum used the fourth solution. This worked for most drivers, but some drivers required functions only available in the ``top half'', such as malloc for ISA bounce buffers. Current FreeBSD drivers no longer call these functions, but it is possible that the situation will arise again.
- The top half can wait for completion of the first set of requests and then launch the second set before returning to the caller. This approach can seriously impact system performance and possibly cause deadlocks.
- In a threaded kernel, the strategy routine can create a thread which waits for completion of the first set of requests and starts the second set without impacting the main thread of the process. At the moment this approach is not possible, since FreeBSD currently does not provide kernel thread support. It also appears likely that it could cause a number of problems in the areas of thread synchronization and performance.
- Ownership of the requests can be ``given'' to another process, which will be awakened when they complete. This process can then issue the second set of requests. This approach is feasible, and it is used by some subsystems, notably NFS. It does not pose the same severe performance penalty of the previous possibility, but it does require that another process be scheduled twice for every I/O.
- The second set of requests can be launched from the ``bottom half'' of the driver. This is potentially dangerous: the interrupt routine must call the start routine. While this is not expressly prohibited, the start routine is normally used by the top half of a driver, and may call functions which are prohibited in the bottom half.
On the other hand, this method does not allow I/O recovery. Vinum now uses a dæmon process for I/O recovery and a couple of other housekeeping activities, such as saving the configuration database. The additional scheduling overhead for these activities is negligible, but it is the reason that the RAID-5 second stage does not use the dæmon.