For comments, suggestions, language and spelling issues, please contact the author or drbd-user@lists.linbit.com .
Your mail/file/db server crashed? The whole department is waiting for you to restore the service? The deadline approaches incredibly fast, the only spare system is the one you actually meant to give away to the local primary school last week, and you need to find the courage to tell the team that all you can do is to have last month' backup restored asap, i.e. by tomorrow?
Be prepared. DRBD can save your day, and hair.
This article explains how to provide data redundancy from the ground up in a straight forward way. And how to test the DRBD setup to convince yourself that it really works. The experienced DRBD user will probably find some useful information, too.
About every service depends on data. To use the service, the data must be available. Which ever service you want to make highly available, you need to make the data it depends on highly available first. The most natural way to do this, and probably all of you do it on a regular basis, is: you backup your data. In case you lose the active data, you just restore it from the most recent backup, and its available again.
If the host your service runs on is (temporarily) unusable, you need to replace it with an other host which is configured to provide the identical service, and restore the data there. To reduce downtime, you can have the second machine ready for takeover all the time, waiting for the primary machine to die. Whenever you change the data on one machine, you back it up on the other.
You can have the secondary switched off, and just turn it on if the service's primary host goes down. This is typically referred to as cold standby. Or you can have it running, which is called hot standby, and is what DRBD and heartbeat runs with. These two machines form the nodes of a high availability (HA) cluster. If one node fails, the other takes over -- this is called failover. The heartbeat package is designed to detect node failures and do failovers automatically.
In any case, if the active node fails, you lose changes made to the data after the most recent backup.
One solution is to use some kind of shared storage device, then both nodes have access to the most recent data when they need it.
This can be simple SCSI sharing, or dual controller RAID arrangements like IBM's ServeRAID, or shared fiber-channel disks, or high end storage like IBM Shark or the various EMC solutions. These systems are relatively costly (ranging from $5K USD to millions of dollars), and unless you get to the very most expensive of these systems, they typically have single points of failure (SPOFs) associated with them - whether they're obvious or not. Some provide separate paths to a single shared bus, possibly with a single internal electrical path to access the bus. To the authors knowledge, only the very most expensive of these solutions can match DRBD's lack of SPOFs.
Another way is to have live replication of the data and all changes. This is what DRBD provides. It is a mass storage device, and as such a block device; it is distributed over two machines; and it replicates data. DRBD stands for Distributed Replicated Block Device. In the following, capitals denote the software in general, and drbd means the actual device.
Whenever one node alters the data, i.e. writes to a drbd, these changes are replicated to the other node in real time.
To achieve this, DRBD layers transparently over any standard block device (this is the ``lower level device''), and uses TCP/IP over standard network interfaces for data replication. You may think of it as RAID1 over the network.
No special hardware is required, though it is recommended to have a dedicated (crossover) network link for the data replication. And if you need high write throughput, you should eliminate the bottleneck of 10/100 Mbit Ethernet and use Gigabit Ethernet instead -- to tune it further, you can increase the MTU to something greater than the typical files system block size (say, 5000 Bytes).
Thus, for the cost of one of the mentioned shared storage solutions you can setup several DRBD clusters, and even support the further development of it for at least one year ;)
Though you can use raw devices for special purposes, the typical direct client to a block device is a filesystem. It is recommended to use one of the journalling filesystems, i.e. ext3 or reiserfs (xfs is not usable with DRBD yet).
When there are (official or unofficial) packages available for your favorite distribution, then you can just install these, and you're done.
SuSE officially does include drbd and heartbeat in its standard distributions, as well as in its fully supported SuSE Linux Enterprise Sever (SLES) 8. The most recent ``unofficial'' SuSE packages can be found in Lars Marowsky-Brée's subtree: ftp.suse.com/pub/people/lmb/drbd and its mirrors.
For Debian users, thanks to David Krovich, the currently best resource is probably:
deb http://fsrc.csee.wvu.edu/debian/apt-repository binary/ deb-src http://fsrc.csee.wvu.edu/debian/apt-repository source/
If for some reason you need to compile DRBD from source, you need to get a DRBD source package, or the source tarball from the download section of http://www.drbd.org. Or even check out the latest (but possibly unstable) from CVS:
cvs -d :pserver:anonymous@cvs.drbd.org:/var/lib/cvs/drbd login [<enter>] cvs -d :pserver:anonymous@cvs.drbd.org:/var/lib/cvs/drbd co drbd
And you need to have the kernel sources matching your running kernel at hand. Make sure the kernel source tree configuration matches the configuration of the running kernel!
For reference, these are the steps for SuSE:
cd /usr/src/linux make cloneconfig make dep cd /wherever/drbd make make install # for this step you obviously have to be root.
In case you got the source tarball, you should backup the drbd/documentation subdirectory first. Since the sgml/docbook stuff is difficult to get right, the tarball contains ``precompiled'' man pages and documentation, which might be corrupted by an almost, but not quiet, matching sgml environment.
Now you need to tell DRBD about its environment. You should find a sample configuration file in /etc/drbd.conf, if not, there is a well commented one in the drbd/scripts subdirectory.
This configuration file divides into at most one global{}
section,
and ``arbitrary'' many resource [resource id] {}
sections, where
[resource id]
is typically something like drbd2
, but may be any
valid identifier (alphanum string).
In the global section, you can specify how much drbds you want to
be able to configure minor-count
, in case you want to define more
resources later without reloading the module (which would interrupt
services).
And you can disable_io_hints
, which prevents a deadlock if
you connect DRBD via the loopback network on one box to itself, e.g. for
testing/simulation/presentation. It is unlikely that this deadlock can
happen at all when using two nodes; and since these io_hints
boost performance, keep them enabled.
Each resource{}
section further splits into resource settings
partially grouped as disk{}
and net{}
specific, and node specific
settings, which are grouped in on [hostname] {}
subsections.
Parameters you need to change are hostname
, drbd device
,
the lower level physical disk
to use, its virtual disk-size
,
and Internet address
and port
number. For further details
refer to the sidebar ``drbd.conf details''.
Note that it you must not ever access the lower level device while you are using drbd. You do not mount the lower level device any longer, you mount the virtual drb-device!
If you have any troubles setting up DRBD, check http://www.drbd.org, and if that does not help, feel free to subscribe and ask questions on drbd-user@lists.linbit.com .
If you feel that write throughput is way too low, try to identify the bottleneck. Sustained write throughput cannot be better than the minimum of your underlying disk hardware and network. Make sure you enabled DMA mode for IDE disks (hdparm -d1 /dev/hdX). Note that network bandwidth is typically given as bits, not bytes, so 100MBit FastEthernet has a maximum bandwidth of 12.5MB/s, and thats without the protocol overhead, one way, and only your data on the wire.
What does ``bitmap too small!!'' mean? Well, this is caused by a blocksize change during synchronization. Either you mounted a non 4K blocksize filesystem on a standard kernel, or a non 1K blocksize fs on a SuSE kernel (which is different in the basic do_open function; the most recent SuSE kernels have this fixed, and a couple of more serious things as well, so please upgrade!).
Avoid this by either not mounting during sync, or by using the ``compatible'' blocksize. If you had the device mounted before the sync starts, this is ok. Mounting after the sync has finished, is ok, too. Mounting with the ``compatible'' (see above) blocksize during sync isn't a problem either. Only the blocksize change during sync is what drbd does not like.
Now that you configured it in drbd.conf, lets start it for the first time. Choose one node to start with. I'll call the nodes Paul and Silas, and start with Paul. Load and configure the module:
paul# /etc/init.d/drbd start
You will be prompted to make this node Primary. Say yes, then create a file system on the drbd.
paul# mke2fs -j /dev/nb0
Make an entry into /etc/fstab
(on both nodes!), like this:
/dev/nb0 /www auto defaults,noauto 0 0 /dev/nb1 /mail auto defaults,noauto 0 0
Do silas# drbd start
on the other node.
You will notice that it connects with the first node, and starts to
sync. This will take a while, especially if you use 100MBit Ethernet, and
large devices. The device which is sync target typically blocks in the
script until the synchronization is finished.
However, the sync source (Primary) is fully operational during a sync. So back on the first node, let the script mount the device:
paul# /etc/init.d/datadisk start
Start working with this file system, untar the kernel source, put some large PS files there, copy your CVS repositories or something.
When sync is finished, lets practice a manual failover. Unmount on Paul, and mount on Silas:
paul# datadisk stop silas# datadisk start
You should now find the devices mounted on Silas, and all the files and
changes you made should be there, too. In fact, the first disk-size
blocks of the underlying physical devices are bit for bit identical.
If you want, you can verify this e.g. with an
md5sum over the complete device:
First stop DRBD on both nodes, on the active one first.
silas# datadisk stop silas# drbd stop paul# drbd stop
Then calculate the md5sum.
# dd if=[lower-level device] bs=1k count=[disk-size] | md5sum
or more natural, if the lower level devices sizes are exactly [disk-size]:
# md5sum [lower-level device]
This will of course take a while, too, but it might improve your confidence into DRBD.
Start DRBD again on both nodes. This time there should be no sync. This is the normal situation after an intentional reboot: If both nodes are in Secondary state before the cluster looses connection, there is no need for sync. See sidebar ``technical details'' for more information about when DRBD syncs, and why.
Note that the drbd
script normally only loads and configures the
module. To set the resource Primary and mount the device, you use the
datadisk
script (which is the same script, behaves differently
depending on how it was invoked).
Up to now we only replicate the data. If one node fails, we need manual intervention. To automate this, you want to have a cluster manager disk and executable monitoring (daemon) process running, see the heartbeat article in this issue about how to set this up properly.
Now you can bring down for maintenance the PDC of your Win Net (a SAMBA server, of course), or your main web, database or file server, without someone noticing it, since it was HA clustered using heartbeat and DRBD...
Do not attempt to mount a drbd in Secondary state. Though it is possible to mount a Secondary device readonly, changes made to the Primary are mirrored to it underneath the filesystem and buffercache of the Secondary, so you won't see changes on the Secondary. And changing metadata underneath a filesystem is a risky habit, since it may confuse your kernel to death.
Once you setup DRBD, never -- as in never!! -- bypass it, or access the underlying device directly, unless it is the last chance to recover data after some worst case event.
If you for some reason need to start a cluster in degraded mode,
do so with the drbd start
and datadisk start
commands, then use
the services as normal.
After you rebuilt the other node, to make sure the first sync is in the direction you expect -- and does not overwrite the good data of the degraded primary -- remove the metadata files on the freshly rebuilt node (which can be found in /var/lib/drbd/drbd#), this causes them to be recreated with zeroed out counters. Then it will lose any dispute about who has the good data, and the devices will be resynchronized completely upon the first connect.
DRBD on top of loop device, or vice versa, is expected to deadlock, so don't do that.
You can stack DRBD on top of md, md on top of DRBD is nonsense, however.
DRBD on top of LVM is possible, but you have to be very careful about when and which LVM features you use, and how you do it, otherwise what actually happens does not necessarily match your expectations. Snapshots for example won't know how to notify the filesystem (possibly on the remote node) to flush its journal to disk to make the snapshot consistent. But this might be convenient for test setups, since you can easily create or destroy new drbds.
Drbd as LVM ``physical'' volumes currently does not work at all, due to the transparency of drbd and limitations in LVM. LVM2 could work, but probably needs some major tweaking.
If you consider to stack DRBD on top of DRBD, think it over again. In a failover case this will cause you more trouble than without it.
The typical use of DRBD and HA clustering is probably two machines connected with normal networks, and one or more crossover cables, a couple of meters apart, within one server room, or at least within the same building. Possibly even a few hundred meters apart.
But you could use DRBD over long distance links, too. When you
have the replica several hundred kilometers away in some other data
center for Disaster Recovery, your data will survive even a major
earthquake at your primary location. You want to use protocol A
and
a huge sndbuf-size
here, and probably adjust the timeout
, too.
Since with DRBD the complete disk content goes over the wire, think about privacy, if this wire is not a crossover cable but the (supposedly hostile) Internet. You should route DRBD traffic through some virtual private network. This can be a full blown IPSec solution. For a more lightweight solution for this specific task have a look at the CIPE project.
Make sure no one other than the partner node can access the DRBD ports, or someone might provoke a connection loss, and then race for the first reconnect, to get a full sync of your disks content.
To eliminate the most displeasing limitations of drbd-0.6.x, work is underway for drbd-0.7.x. The code is made more robust against block size changes to support XFS, and avoid certain nasty side effects.
The Primary node can be target of an ongoing synchronisation, which makes graceful failover/failback possible during resynchronisation and increases interoperability with heartbeat. As a sideffect of the above, together with OpenGFS this probably supports true active/active configurations.
Finally, some variant of an activity log avoids full synchronization in most cases.
Unfortunately, these improvements are still alpha/early beta phase. But with your ongoing support, pace of development should increase.
hostname
hostname -s
reports on the respective
nodes, case is significant.
device
/dev/nb#
or /dev/nbd/#
on devfs;
obviously needs to be unique within the configuration.
By the way, nbd is the network block device, the major number of which drbd hijacks for its use. With md (linux software raid) over (e)nbd one can achieve similar functionality as drbd, but the author believes it is much more work to setup properly, and has a design flaw regarding write ordering.
disk
address
, port
This should not be confused with the administration address of the node, nor with the (typically virtual) service address of the cluster.
protocol
disk
.
C
, benchmarks show it does not. So currently there is no reason at all
to use this protocol.
sndbuf-size
in the net{}
section)
inittimeout
A negative value indicates that drbd should stay WFConnection Secondary/Unknown and just continue the boot process, thus leaving the decision to the cluster manager. This is probably a good idea, since the cluster manager has usually more than one communication path to find out about the partner nodes status.
When not given/0: wait until the partner node shows up, or some operator intervenes; do not timeout.
skip-wait
load-only
fsckcmd
/bin/true
.
incon-degr-cmd
disk{}
and net{}
settingsdisk-size
If you want to avoid problems when drbd needs to start up with the partner node being down, you need to set this.
do-panic
sndbuf-size
sync-min
, sync-max
, sync-nice
min
, the thread
is reniced to the highest possible priority. When throughput is above
min
, priority falls back to low priority (or what ever you configured
for nice
). When throughput reaches max
the syncer is throttled,
so it does not use up all the bandwidth. If you never want to throttle,
set max
greater than your bandwidth.
These parameters can be changed while the syncer is running with the
drbdsetup /dev/nbX syncer --min 1M --max 100M --nice -15
command.
tl-size
tl-size too small!!
messages, you need to
increase this value. This is an internal log used to force strict write
ordering on both nodes with minimal impact on write throughput.
timeout
Unit is 0.1 seconds, thus the default of 60 means 6 seconds.
connect-int
ping-int
ko-count
If you see problems with connection lost/connection established
loops,
increase timeout
, and maybe ping-int
. Or change your setup to
reduce network latency; make sure full duplex behaves as such; check
average round trip times while network is saturated, check ...
# you can place comments here like in shell scripts, # the hash mark being comment leader.
# global { # minor_count=5 # disable_io_hints # } resource drbd0 { protocol = B fsckcmd = fsck -p -y # inittimeout=-60 # skip-wait # load-only # incon-degr-cmd=halt -f disk { do-panic disk-size = 4194304k } net { # sndbuf-size = 512k # skip-sync # you should _never_ use this, not even know about! # sync-nice = -18 # if synchronization is high priority for you # sync-min = 4M # syncer tries hard to not drop below this rate # sync-max = 500M # if you don't care about network saturation # -max has to be larger than -min, obviously sync-min = 500k sync-max = 1M # maximal average syncer bandwidth tl-size = 5000 # transfer log size, ensures strict write ordering timeout = 60 # 0.1 seconds connect-int = 10 # seconds ping-int = 10 # seconds ko-count = 4 # if some block send times out this many times, # the peer is considered dead, even if it still # answeres ping requests } on paul { device = /dev/nb0 disk = /dev/hdc1 # paul seems to have ide hardware address = 10.4.4.1 port = 7788 } on silas { device = /dev/nb0 disk = /dev/sdd1 # on silas we use some scsi device, maybe address = 10.4.4.2 port = 7788 } } resource drbd1 { protocol=C fsckcmd=fsck -p -y on paul { device=/dev/nb1 # btw, don't do this. # did you notice that in this example we have two drbd devices # on the same spindle (hdc)? performance will be bad. if you # use several drbd devices, put them on different spindles; # different channels/controllers won't be a bad idea for IDE. disk=/dev/hdc2 address=10.4.4.1 port=7789 } on silas { device=/dev/nb1 # and while we are at it, # if drbd throughput is low, please check your disk # throughput first. maybe you need to enable DMA? (-> man hdparm) disk=/dev/hdc2 address=10.4.4.2 port=7789 } }
Whenever a higher level application, typically a journalled file system, issues a IO request, the kernel dispatches this request based on the target device major/minor numbers. If DRBD is registered for this major number, it passes READ request down the stack to the lower level device locally. WRITE requests a passed down the stack, too, but additionally are sent over to the partner node.
Every time something changes on the local disk, the same changes are done at the same offset on the partner node's device. If some WRITE request is finished locally, a ``write barrier'' is sent over to the partner, to make sure that it is finished before another request comes in. Since later WRITE requests might depend on successful finished previous ones, this is needed to assure strict write ordering on both nodes. Thus with protocol C it is guaranteed that after a (f)sync operation both devices are bit-for-bit identical. Just for the blocks affected by the fsync(), of course.
The most important decision that DRBD has to do is: When do we need a synchronization, and does it have to be a full synchronization or just an incremental one.
To make this decision possible, DRBD keeps several event and generation counters in metadata files located in /var/lib/drbd/drbd#.
Since during normal operation DRBD is only an annoying retardant deadweight, lets have a look at the failure cases.
Say Paul is our primary server, and Silas is standby. Together with heartbeat and the services, both form HA-cluster. This cluster and its nodes are in a certain state, and can have state transitions into some other state.
If Paul and Silas are up and running, this is the normal state. If one of them is down, the cluster is degraded. If both only believe the other nodes is dead, this is split-brain -- heartbeat tries to avoid this by using as many communication paths as possible.
Typical state changes are degraded -> normal and normal -> degraded.
When Silas was standby and leaves the cluster (for whatever reason: network, power, hardware failure), this is not a real problem, as long as Paul keeps on running. In degraded mode Paul flags all the blocks that have write operations as dirty. Some technician comes by, fixes it, and Silas joins the cluster again. Now Silas needs all the changes made on Paul since Silas left the cluster. Since Paul has its ``block is dirty'' flags, it can do an incremental synchronization (/proc/drbd says ``SyncQuick'').
If Paul failed (or was shut down) while he was alone (1b), the dirty flags are lost, since they are held in RAM only. So unfortunately, the next time both nodes see each other, they perform a full sync (``SyncAll'') from Paul to Silas.
When Paul as active primary node fails, the situation is a bit different. If Silas remains standby (unlikely; 2a), and later Paul comes back, Paul will become the active node again. Since it is unknown which blocks might have been modified on Paul just before the failure, but had not reached Silas because of the crash, this time there is a full sync from Paul to Silas, just to make sure that everything is identical again.
In the more likely case that Silas took over the active role (2b), when Paul comes back, he becomes standby and sync target, this time receiving the full sync from Silas. Why a full sync here, too? Same justification as above, it is not known which blocks might have been modified on Paul immediately before the crash.
If both nodes where down (main power failure or something), after the cluster reboot, situation is similar to 1b/2a: full sync from Paul to Silas.
Now it seems like whenever Paul was down we need a full sync. This is not exactly true. You can stop the services on Paul, unmount the drbd, and make it Secondary. The cluster then is connected, but both nodes are passive/standby.
You can now either shutdown both nodes cleanly in any order. When they see each other the next time, there will be no sync at all, since they know from their metadata, that both have been Secondary last time, and they belong to the same ``generation'', thus the data is still identical.
Or you can assign Silas the active role now, make drbd Primary on Silas, mount it, and start the services. This way you can bring down Paul for maintenance, when it reboots we have case 1a again, with swapped roles: SyncQuick from Silas to Paul.
If one of the nodes (or the network) fails during a synchronization, this is a double failure, since the first failure caused the sync to happen. Note that double failures are logically impossible to tolerate with double redundancy, so you should treat any failure in the HA cluster as very serious and repair it ASAP, to regain redundancy once more.
Paul is active and has the good data. Silas receives the sync. The cluster is still degraded, since Silas is not yet ready for takeover, it has inconsistent, only partially up-to-date, data.
When Silas fails, the sync has to be restarted. If it was a SyncQuick, it can be restarted at about the place it was interrupted. A SyncAll unfortunately is restarted at the very beginning. (This needs to be improved!)
When Paul fails while being the sync source, we now have a non operational cluster. Silas cannot take over, since it still has inconsistent data. Paul is dead.
If you really need availability, and don't care about possibly
inconsistent, out-of-date data, you can tell Silas to become Primary
anyways. It will refuse to become Primary at first, but with the
explicit operator override
silas# drbdsetup /dev/nb0 primary --do-what-I-say
you can force it to. Since you used brute force, you take the blame.