vinum(4) FreeBSD Kernel Interfaces Manual vinum(4) NNAAMMEE vviinnuumm - Logical Volume Manager SSYYNNOOPPSSIISS kkllddllooaadd vviinnuumm kkllddllooaadd VViinnuumm DDEESSCCRRIIPPTTIIOONN vviinnuumm is a logical volume manager inspired by, but not derived from, the VERITAS Volume Manager. It provides the following features: ++oo It provides device-independent logical disks, called _v_o_l_u_m_e_s. Vol- umes are not restricted to the size of any disk on the system. ++oo The volumes consist of one or more _p_l_e_x_e_s, each of which maps the entire address space of a volume. This represents an implementation of RAID-1 (mirroring). Multiple plexes can also be used for: ++oo Increased read throughput. vviinnuumm reads data from the least active disk, so if a volume has plexes on multiple disks, more data can be read in parallel. vviinnuumm reads data from only one plex, but it writes data to all plexes. ++oo Increased reliability. By storing plexes on different disks, data will remain available even if one of the plexes becomes unavailable. In comparison with a RAID-4 or RAID-5 plex (see below), using multiple plexes requires more storage space, but gives better performance, particularly in the case of a drive failure. ++oo Additional plexes can be used for on-line data reorganization. By attaching an additional plex and subsequently detaching one of the older plexes, data can be moved on-line without compromising access. ++oo An additional plex can be used to obtain a consistent dump of a file system. By attaching an additional plex and detaching at a specific time, the detached plex becomes an accurate snapshot of the file system at the time of detachment. ++oo Each plex consists of one or more logical disk slices, called _s_u_b_- _d_i_s_k_s. Subdisks are defined as a contiguous block of physical disk storage. A plex may consist of any reasonable number of subdisks (in other words, the real limit is not the number, but other factors, such as memory and performance, associated with maintaining a large number of subdisks). ++oo A number of mappings between subdisks and plexes are available: ++oo _C_o_n_c_a_t_e_n_a_t_e_d _p_l_e_x_e_s consist of one or more subdisks, each of which is mapped to a contiguous part of the plex address space. ++oo _S_t_r_i_p_e_d _p_l_e_x_e_s consist of two or more subdisks of equal size. The file address space is mapped in _s_t_r_i_p_e_s, integral fractions of the subdisk size. Consecutive plex address space is mapped to stripes in each subdisk in turn. The subdisks of a striped plex must all be the same size. ++oo _R_A_I_D_-_5 _p_l_e_x_e_s require at least three equal-sized subdisks. They resemble striped plexes, except that in each stripe, one subdisk stores parity information. This subdisk changes in each stripe: in the first stripe, it is the first subdisk, in the second it is the second subdisk, etc. In the event of a single disk failure, vviinnuumm recovers the data based on the information stored on the remaining subdisks. This mapping is particularly suited to read- intensive access. The subdisks of a RAID-5 plex must all be the same size. ++oo _R_A_I_D_-_4 _p_l_e_x_e_s are almost identical to RAID-5 plexes. The only difference is the manner in which the parity data is stored. RAID-4 has no advantage over RAID-5 and should be ignored. ++oo DDrriivveess are the lowest level of the storage hierarchy. They represent disk special devices. ++oo vviinnuumm offers automatic startup. Unlike UNIX file systems, vviinnuumm vol- umes contain all the configuration information needed to ensure that they are started correctly when the subsystem is enabled. This is also a significant advantage over the VERITAS(tm) File System. This does not mean that the volumes will be mounted automatically, since the standard startup procedures with _/_e_t_c_/_f_s_t_a_b perform this func- tion. KKEERRNNEELL CCOONNFFIIGGUURRAATTIIOONN vviinnuumm is supplied as a kernel loadable module (kld), and does not require configuration. As with other klds, it is absolutely necessary to match the kld to the version of the operating system. Failure to do so will cause vviinnuumm to issue an error message and terminate. It is possible to configure vviinnuumm in the kernel, but this is not recom- mended, since some functionality is lost. To do so, add this line to the kernel configuration file: pseudo-device vinum DDEEBBUUGG OOPPTTIIOONNSS The current version of vviinnuumm, both the kernel module and the user program vinum(8), include significant debugging support. It is not recommended to remove this support at the moment. vviinnuumm previously required matching debug support between the kernel mod- ule and the userland program. This is no longer required. vviinnuumm was previously available in two versions: a freely available version which did not contain RAID-5 functionality, and a full version including RAID-5 functionality, which was available only from Cybernet Systems Inc. The present version of vviinnuumm includes RAID-5 and RAID-4 functionality. RRUUNNNNIINNGG VVIINNUUMM vviinnuumm is part of the base FreeBSD system. It does not require installa- tion. To start it, start the vinum(8) program, which loads the kld if it is not already present. Before using vviinnuumm, it must be configured. See vinum(8) for information on how to create a vviinnuumm configuration. Normally, you start a configured version of vviinnuumm at boot time. To do so, put the following lines in _/_b_o_o_t_/_l_o_a_d_e_r_._c_o_n_f: vinum_load="YES" vinum.autostart="YES" It is also possible to start vviinnuumm by putting the following line into _/_e_t_c_/_r_c_._c_o_n_f: start_vinum="YES" This method is deprecated. If vviinnuumm is loaded as a kld (the recommended way), the vviinnuumm _s_t_o_p command will unload it. You can also do this with the kklldduunnllooaadd command. The kld can only be unloaded when idle, in other words when no volumes are mounted and no other instances of the vviinnuumm program are active. Unloading the kld does not harm the data in the volumes. CCOONNFFIIGGUURRIINNGG AANNDD SSTTAARRTTIINNGG OOBBJJEECCTTSS Use the vinum(8) utility to configure and start vviinnuumm objects. IIOOCCTTLL CCAALLLLSS _i_o_c_t_l calls are intended for the use of the vviinnuumm configuration program only. They are described in the header file _/_s_y_s_/_d_e_v_/_v_i_n_u_m_/_v_i_n_u_m_i_o_._h DDIISSKK LLAABBEELLSS Conventional disk special devices have a _d_i_s_k _l_a_b_e_l in the second sector of the device. See disklabel(5) for more details. This disk label describes the layout of the partitions within the device. vviinnuumm does not subdivide volumes, so volumes do not contain a physical disk label. In the past, vviinnuumm faked a disk label, but this is no longer the case. For convenience, vviinnuumm implements the ioctl calls DIOCGDINFO (get disk label), DIOCGPART (get partition information), DIOCWDINFO (write parti- tion information) and DIOCSDINFO (set partition information). DIOCGDINFO and DIOCGPART refer to an internal representation of the disk label which is not present on the volume. As a result, the --rr option of disklabel(8), which reads the "raw disk", will fail. In general, disklabel(8) serves no useful purpose on a vinum volume. If you run it, it will show you three partitions, a, b and c, all the same except for the fstype, for example: 3 partitions: # size offset fstype [fsize bsize bps/cpg] a: 2048 0 4.2BSD 1024 8192 0 # (Cyl. 0 - 0) b: 2048 0 swap # (Cyl. 0 - 0) c: 2048 0 unused 0 0 # (Cyl. 0 - 0) vviinnuumm ignores the DIOCWDINFO and DIOCSDINFO ioctls, since there is noth- ing to change. As a result, any attempt to modify the disk label will be silently ignored. MMAAKKIINNGG FFIILLEE SSYYSSTTEEMMSS Since vviinnuumm volumes do not contain partitions, the names do not need to conform to the standard rules for naming disk partitions. For a physical disk partition, the last letter of the device name specifies the parti- tion identifier (a to h). vviinnuumm volumes need not conform to this conven- tion, but if they do not, nneewwffss will complain that it cannot determine the partition. To solve this problem, use the --vv flag to nneewwffss. For example, if you have a volume _c_o_n_c_a_t, use the following command to create a ufs file system on it: # newfs -v /dev/vinum/concat OOBBJJEECCTT NNAAMMIINNGG Names may contain any non-blank character, but it is recommended to restrict them to letters, digits and the underscore characters. The names of volumes, plexes and subdisks may be up to 64 characters long, and the names of drives may up to 32 characters long. When choosing vol- ume and plex names, bear in mind that automatically generated plex and subdisk names are longer than the name from which they are derived. ++oo vviinnuumm creates device nodes for volumes in the directory _/_d_e_v_/_v_i_n_u_m. It also creates the subdirectories _/_d_e_v_/_v_i_n_u_m_/_p_l_e_x and _/_d_e_v_/_v_i_n_u_m_/_s_d, _/_d_e_v_/_v_i_n_u_m_/_p_l_e_x, _/_d_e_v_/_v_i_n_u_m_/_r_p_l_e_x, _/_d_e_v_/_v_i_n_u_m_/_s_d and _/_d_e_v_/_v_i_n_u_m_/_r_s_d in which it stores device entries for the plexes and subdisks. ++oo In addition, vviinnuumm creates two super-devices, _/_d_e_v_/_v_i_n_u_m_/_c_o_n_t_r_o_l and _/_d_e_v_/_v_i_n_u_m_/_c_o_n_t_r_o_l_d. _/_d_e_v_/_v_i_n_u_m_/_c_o_n_t_r_o_l is used by vinum(8), and _/_d_e_v_/_v_i_n_u_m_/_c_o_n_t_r_o_l_d is used by the vviinnuumm daemon. ++oo Unlike vviinnuumm UNIX drives, vviinnuumm volumes are not subdivided into par- titions, and thus do not contain a disk label. Unfortunately, this confuses a number of utilities, notably nneewwffss, which normally tries to interpret the last letter of a vviinnuumm volume name as a partition identifier. If you use a volume name which does not end in the let- ters _a to _c, you must use the --vv flag to nneewwffss in order to tell it to ignore this convention. ++oo It is not necessary to assign explicit names to plexes. By default, a plex name is the name of the volume followed by the letters .p and the number of the plex. For example, the plexes of volume _v_o_l_3 are called _v_o_l_3_._p_0, _v_o_l_3_._p_1 and so on. These names can be overridden, but it is not recommended. ++oo Like plexes, subdisks are assigned names automatically, and explicit naming is discouraged. A subdisk name is the name of the plex fol- lowed by the letters .s and a number identifying the subdisk. For example, the subdisks of plex _v_o_l_3_._p_0 are called _v_o_l_3_._p_0_._s_0, _v_o_l_3_._p_0_._s_1 and so on. ++oo By contrast, ddrriivveess must be named. This makes it possible to move a drive to a different location and still recognize it automatically. Drive names should not be related to device names, since this could be extremely confusing if they are moved elsewhere. Drive names may be up to 32 characters long. OOBBJJEECCTT SSTTAATTEESS Each vviinnuumm object has a _s_t_a_t_e associated with it. vviinnuumm uses this state to determine the handling of the object. VVOOLLUUMMEE SSTTAATTEESS Volumes may have the following states: down The volume is completely inaccessible. up The volume is up and at least partially functional. Not all plexes may be available. PPLLEEXX SSTTAATTEESS Plexes may have the following states: referenced A plex entry which has been referenced as part of a vol- ume, but which is currently not known. vviinnuumm knows the name, but nothing else. faulty A plex which has gone completely down because of I/O errors. down A plex which has been taken down by the administrator. initializing A plex whose subdisks are being initialized. The remaining states represent plexes which are at least partially up. corrupt A plex entry which is at least partially up. Not all subdisks are available, and an inconsistency has occurred. If no other plex is uncorrupted, the volume is no longer consistent. degraded A RAID-5 or RAID-4 plex entry which is accessible, but one subdisk is down, requiring recovery for many I/O requests. flaky A plex which is really up, but which has a reborn subdisk which we don't completely trust, and which we don't want to read if we can avoid it. up A plex entry which is completely up. All subdisks are up. SSUUBBDDIISSKK SSTTAATTEESS Subdisks can have the following states: uninit The subdisk has not been configured. This is a transient state during object creation and should never be visible to the user. referenced A subdisk entry which has been referenced as part of a plex, but which is currently not known. empty A subdisk entry which has been created completely. All fields are correct, and the disk has been updated, but the on the disk is not valid. initializing A subdisk entry which has been created completely and which is currently being initialized. initialized A subdisk entry which has been initialized, but which can't be brought to an ``up'' state because it would cause inconsistencies, for example, after replacing a subdisk in a degraded RAID-5 plex. The following states represent invalid data. obsolete A subdisk entry which has been created completely. All fields are correct, the config on disk has been updated, and the data was valid, but since then the drive has been taken down, and as a result updates have been missed. stale A subdisk entry which has been created completely. All fields are correct, the disk has been updated, and the data was valid, but since then the drive has been crashed and updates have been lost. The following states represent valid, inaccessible data. crashed A subdisk entry which has been created completely. All fields are correct, the disk has been updated, and the data was valid, but since then the drive has gone down. No attempt has been made to write to the subdisk since the crash, so the data is valid. down A subdisk entry which was up, which contained valid data, and which was taken down by the administrator. The data is valid. reviving The subdisk is currently in the process of being revived. We can write but not read. The following states represent accessible subdisks with valid data. reborn A subdisk entry which has been created completely. All fields are correct, the disk has been updated, and the data was valid, but since then the drive has gone down and up again. No updates were lost, but it is possible that the subdisk has been damaged. We won't read from this subdisk if we have a choice. If this is the only subdisk which covers this address space in the plex, we set its state to up under these circumstances, so this status implies that there is another subdisk to fulfil the request. up A subdisk entry which has been created completely. All fields are correct, the disk has been updated, and the data is valid. DDRRIIVVEE SSTTAATTEESS Drives can have the following states: referenced At least one subdisk refers to the drive, but it is not currently accessible to the system. No device name is known. down The drive is not accessible. up The drive is up and running. BBUUGGSS 1. vviinnuumm is a complicated product. Bugs can be expected. The configu- ration mechanism is not yet fully functional. If you have difficul- ties, please look at the section DEBUGGING PROBLEMS WITH VINUM before reporting problems. 2. Detection of differences between the version of the kernel and the kld is not yet implemented. DDEEBBUUGGGGIINNGG PPRROOBBLLEEMMSS WWIITTHH VVIINNUUMM Solving problems with vviinnuumm can be a difficult affair. This section sug- gests some approaches. CCoonnffiigguurraattiioonn pprroobblleemmss It is relatively easy (too easy) to run into problems with the vviinnuumm con- figuration. If you do, the first thing you should do is stop configura- tion updates: # vviinnuumm sseettddaaeemmoonn 44 This will stop updates and any further corruption of the on-disk configu- ration. Next, look at the on-disk configuration with the vviinnuumm dduummppccoonnffiigg com- mand, for example: # vviinnuumm dduummppccoonnffiigg Drive 4: Device /dev/da3h Created on crash.lemis.com at Sat May 20 16:32:44 2000 Config last updated Sat May 20 16:32:56 2000 Size: 601052160 bytes (573 MB) volume obj state up volume src state up volume raid state down volume r state down volume foo state up plex name obj.p0 state corrupt org concat vol obj plex name obj.p1 state corrupt org striped 128b vol obj plex name src.p0 state corrupt org striped 128b vol src plex name src.p1 state up org concat vol src plex name raid.p0 state faulty org disorg vol raid plex name r.p0 state faulty org disorg vol r plex name foo.p0 state up org concat vol foo plex name foo.p1 state faulty org concat vol foo sd name obj.p0.s0 drive drive2 plex obj.p0 state reborn len 409600b driveoffset 265b plexoffset 0b sd name obj.p0.s1 drive drive4 plex obj.p0 state up len 409600b driveoffset 265b plexoffset 409600b sd name obj.p1.s0 drive drive1 plex obj.p1 state up len 204800b driveoffset 265b plexoffset 0b sd name obj.p1.s1 drive drive2 plex obj.p1 state reborn len 204800b driveoffset 409865b plexoffset 128b sd name obj.p1.s2 drive drive3 plex obj.p1 state up len 204800b driveoffset 265b plexoffset 256b sd name obj.p1.s3 drive drive4 plex obj.p1 state up len 204800b driveoffset 409865b plexoffset 384b The configuration on all disks should be the same. If this is not the case, please save the output to a file and report the problem. There is probably little that can be done to recover the on-disk configuration, but if you keep a copy of the files used to create the objects, you should be able to re-create them. The ccrreeaattee command does not change the subdisk data, so this will not cause data corruption. You may need to use the rreesseettccoonnffiigg command if you have this kind of trouble. KKeerrnneell PPaanniiccss In order to analyse a panic which you suspect comes from vviinnuumm you will need to build a debug kernel. See the online handbook at _/_u_s_r_/_s_h_a_r_e_/_d_o_c_/_h_a_n_d_b_o_o_k_/_k_e_r_n_e_l_d_e_b_u_g_._h_t_m_l (if installed) or _h_t_t_p_:_/_/_w_w_w_._F_r_e_e_B_S_D_._o_r_g_/_h_a_n_d_b_o_o_k_/_k_e_r_n_e_l_d_e_b_u_g_._h_t_m_l for more details of how to do this. Perform the following steps to analyse a vviinnuumm problem: 1. Copy the files _/_u_s_r_/_s_r_c_/_s_y_s_/_m_o_d_u_l_e_s_/_v_i_n_u_m_/_._g_d_b_i_n_i_t_._c_r_a_s_h, _/_u_s_r_/_s_r_c_/_s_y_s_/_m_o_d_u_l_e_s_/_v_i_n_u_m_/_._g_d_b_i_n_i_t_._k_e_r_n_e_l, _/_u_s_r_/_s_r_c_/_s_y_s_/_m_o_d_u_l_e_s_/_v_i_n_u_m_/_._g_d_b_i_n_i_t_._s_e_r_i_a_l, _/_u_s_r_/_s_r_c_/_s_y_s_/_m_o_d_u_l_e_s_/_v_i_n_u_m_/_._g_d_b_i_n_i_t_._v_i_n_u_m and _/_u_s_r_/_s_r_c_/_s_y_s_/_m_o_d_u_l_e_s_/_v_i_n_u_m_/_._g_d_b_i_n_i_t_._v_i_n_u_m_._p_a_t_h_s to the directory in which you will be performing the analysis, typically _/_v_a_r_/_c_r_a_s_h. 2. Make sure that you build the vviinnuumm module with debugging informa- tion. The standard _M_a_k_e_f_i_l_e builds a module with debugging symbols by default. If the version of vviinnuumm in _/_m_o_d_u_l_e_s does not contain symbols, you will not get an error message, but the stack trace will not show the symbols. Check the module before starting ggddbb: $ file /modules/vinum.ko /modules/vinum.ko: ELF 32-bit LSB shared object, Intel 80386, version 1 (FreeBSD), not stripped If the output shows that _/_m_o_d_u_l_e_s_/_v_i_n_u_m_._k_o is stripped, you will have to find a version which is not. Usually this will be either in _/_u_s_r_/_o_b_j_/_s_y_s_/_m_o_d_u_l_e_s_/_v_i_n_u_m_/_v_i_n_u_m_._k_o (if you have built vviinnuumm with a _m_a_k_e _w_o_r_l_d) or _/_u_s_r_/_s_r_c_/_s_y_s_/_m_o_d_u_l_e_s_/_v_i_n_u_m_/_v_i_n_u_m_._k_o (if you have built vviinnuumm in this directory). Modify the file _._g_d_b_i_n_i_t_._v_i_n_u_m_._p_a_t_h_s accordingly. 3. Either take a dump or use remote serial ggddbb to analyse the problem. To analyse a dump, say _/_v_a_r_/_c_r_a_s_h_/_v_m_c_o_r_e_._5, link _/_v_a_r_/_c_r_a_s_h_/_._g_d_b_i_n_i_t_._c_r_a_s_h to _/_v_a_r_/_c_r_a_s_h_/_._g_d_b_i_n_i_t and enter: # cd /var/crash # gdb -k kernel.debug vmcore.5 This example assumes that you have installed the correct debug ker- nel at _/_v_a_r_/_c_r_a_s_h_/_k_e_r_n_e_l_._d_e_b_u_g. If not, substitute the correct name of the debug kernel. To perform remote serial debugging, link _/_v_a_r_/_c_r_a_s_h_/_._g_d_b_i_n_i_t_._s_e_r_i_a_l to _/_v_a_r_/_c_r_a_s_h_/_._g_d_b_i_n_i_t _a_n_d _e_n_t_e_r # cd /var/crash # gdb -k kernel.debug In this case, the _._g_d_b_i_n_i_t file performs the functions necessary to establish connection. The remote machine must already be in debug mode: enter the kernel debugger and select ggddbb. The serial _._g_d_b_i_n_i_t file expects the serial connection to run at 38400 bits per second; if you run at a different speed, edit the file accordingly (look for the _r_e_m_o_t_e_b_a_u_d specification). The following example shows a remote debugging session using the _d_e_b_u_g command of vinum(8): GDB 4.16 (i386-unknown-freebsd), Copyright 1996 Free Software Foundation, Inc. Debugger (msg=0xf1093174 "vinum debug") at ../../i386/i386/db_interface.c:318 318 in_Debugger = 0; #1 0xf108d9bc in vinumioctl (dev=0x40001900, cmd=0xc008464b, data=0xf6dedee0 "", flag=0x3, p=0xf68b7940) at /usr/src/sys/modules/Vinum/../../dev/Vinum/vinumioctl.c:102 102 Debugger ("vinum debug"); (kgdb) bt #0 Debugger (msg=0xf0f661ac "vinum debug") at ../../i386/i386/db_interface.c:318 #1 0xf0f60a7c in vinumioctl (dev=0x40001900, cmd=0xc008464b, data=0xf6923ed0 "", flag=0x3, p=0xf688e6c0) at /usr/src/sys/modules/vinum/../../dev/vinum/vinumioctl.c:109 #2 0xf01833b7 in spec_ioctl (ap=0xf6923e0c) at ../../miscfs/specfs/spec_vnops.c:424 #3 0xf0182cc9 in spec_vnoperate (ap=0xf6923e0c) at ../../miscfs/specfs/spec_vnops.c:129 #4 0xf01eb3c1 in ufs_vnoperatespec (ap=0xf6923e0c) at ../../ufs/ufs/ufs_vnops.c:2312 #5 0xf017dbb1 in vn_ioctl (fp=0xf1007ec0, com=0xc008464b, data=0xf6923ed0 "", p=0xf688e6c0) at vnode_if.h:395 #6 0xf015dce0 in ioctl (p=0xf688e6c0, uap=0xf6923f84) at ../../kern/sys_generic.c:473 #7 0xf0214c0b in syscall (frame={tf_es = 0x27, tf_ds = 0x27, tf_edi = 0xefbfcff8, tf_esi = 0x1, tf_ebp = 0xefbfcf90, tf_isp = 0xf6923fd4, tf_ebx = 0x2, tf_edx = 0x804b614, tf_ecx = 0x8085d10, tf_eax = 0x36, tf_trapno = 0x7, tf_err = 0x2, tf_eip = 0x8060a34, tf_cs = 0x1f, tf_eflags = 0x286, tf_esp = 0xefbfcf78, tf_ss = 0x27}) at ../../i386/i386/trap.c:1100 #8 0xf020a1fc in Xint0x80_syscall () #9 0x804832d in ?? () #10 0x80482ad in ?? () #11 0x80480e9 in ?? () When entering from the debugger, it's important that the source of frame 1 (listed by the _._g_d_b_i_n_i_t file at the top of the example) con- tains the text Debugger ("vinum debug"); This is an indication that the address specifications are correct. If you get some other output, your symbols and the kernel module are out of sync, and the trace will be meaningless. For an initial investigation, the most important information is the out- put of the bbtt (backtrace) command above. RReeppoorrttiinngg pprroobblleemmss wwiitthh VViinnuumm If you find any bugs in vviinnuumm, please report them to Greg Lehey . Supply the following information: ++oo The output of the vviinnuumm lliisstt command. ++oo Any messages printed in _/_v_a_r_/_l_o_g_/_m_e_s_s_a_g_e_s. All such messages will be identified by the text vviinnuumm at the beginning. ++oo If you have a panic, a stack trace as described above. AAUUTTHHOORR Greg Lehey . HHIISSTTOORRYY vviinnuumm first appeared in FreeBSD 3.0. The RAID-5 component of vviinnuumm was developed by Cybernet Inc. _w_w_w_._c_y_b_e_r_n_e_t_._c_o_m for its NetMAX product. SSEEEE AALLSSOO vinum(8), disklabel(5), disklabel(8), newfs(8) FreeBSD 5.0 5 October 1999 FreeBSD 5.0