[lug] Setting up failover in Linux?
nagler at bivio.biz
Mon Apr 30 16:12:26 MDT 2012
Hi Sean (again :-),
> I'm pretty sure that Amazon is not hyping Linux-HA, and I honestly can't
Nothing here says anything about how to build fault tolerant apps.
Moreover, some pretty important concepts are nowhere to be found on
aws.amazon.com (or their search is awful):
I could go on, but I hope you see what I mean.
> monkey" that goes around and kills components to ensure that the
> applications can withstand components failing through their design.
Would be interesting to see what happens when they do "kill 1". :)
While this is a good test, it's not what I'm after. I'm looking for
the system which can sustain a "yum update postgresql", have that stop
in the middle, and survive.
> I guess you're either going to have to trust my 17 years of doing HA
> deployments, or develop your own experience... :-)
Again, I'm not questioning your experience. What I'm asking for is
answers to my questions based on your experience taking into account
my experience. That's a tall order, I have to admit so I have to fill
in my experience in between your answers. Together, hopefully, we can
come up with a great answer.
> My point was that RAID handles it seamlessly, as if the hardware had not
> failed, without consideration for the lower levels. With Linux-HA systems,
> you have to design the applications and their start/stop to account for
> restarting as part of the failure recovery.
There should be a RAID for fault tolerance for people who manage state
carefully. One of the problems is that we have yet to implement
http://en.wikipedia.org/wiki/X/Open_XA in all databases including file
systems. We handle "start/stop" just fine. Within a request, we
handle all transaction state with ACIDly. However, once it hits the
disk, the transaction is not coordinated. What I'm looking for is for
Xapian, ZFS (if you insist), Postfix, and Posgres to all participate
in the transaction. The each have their own mechanisms for
transactions, but they don't have the hooks to coordinate across the
systems. Then I'd like the transaction manager (Pathway/TMF, Tuxedo,
IBM/ESA, whatever) to allow replays of the transactional state. It
doesn't have to be synchronous, rather it can be completely
asynchronous so the only price you pay in the replication is the
network and system latency. It is rather odd that nobody has done
this. Perhaps it could be a Summer of Code project. ;-)
>> out the network partition problem for 30 years, and dammit, I still
> Many systems do it via STONITH or Fencing.
This is voodoo at best. If the network is partitioned, you can't send
a message to the server to shutdown. The two nodes may be visible to
clients, but they may not be able to talk to each other. Or, the
primary may be isolated for a time, the secondary takes over, and the
primary comes back up. There have been some interesting AWS failures
due exactly to this behavior, except the primary came up as an
entirely different system. :)
> Actually, Tandem found that hardware was getting more reliable and that
> Non-Stop didn't make as much sense as it once did.
You might read the section about "TMF" for what I'm looking for, which
is still in use, and has been around since the 1980s.
> The additional
> complexity of Non-Stop introduced opportunities for operator error that
> weren't there in a more traditional system.
Hmmm... Not sure I heard that one. Operator error happens no matter
what. You don't need to do much to program NonStop in COBOL, for
example, because all variables are part of the transaction and all
servers are stateless.
> Linux-HA and DRBD do not solve that problem... They ensure that system
> failure is detected, resources are started/restarted whenever possible, and
> that data is replicated between hosts.
That, sadly, is the state of affairs from ISIS in the 1980s:
I think ISIS did a better job of this than other systems, but in the
end, ISIS was too complex.
There are plenty of simple systems like ZooKeeper, Arakoon, Chubby
(not OSS, sadly), and so on. They do a good job with small objects.
However, we allow customers to upload large data files (50mb is fairly
common), and we can't rely on a system with a maximum datum of 256kb.
It's frustrating that when you hear about BigTable, they are talking
about the number of rows, not the size of the data.
More information about the LUG