ClusterText

		     IT-Secure.COM A.G.

		 High Availability Cluster

		       Version 3.5
			(C) 2001

		Implementation for PKI CA

         	   Author: Andrei Ryjov

1. Introduction

Main purpose of the HA clusters is to tolerate inevitable
failures of production systems hardware in such a way that
these failures do not affect the quality of service provided.

Most of HA solutions available on the market, require proprietary,
non-standart hardware or software, as well as presume significant
training or special skills from the support staff.

The reason for such complexity is the intention to make the HA solutions
more cost-effective (for instance, by running active critical processes
on all nodes simultaneously, thus balancing the load). In addition, the HA
vendors usually try to make their solutions as flexible as possible
in order to cover wider customer community.

The complexity of existing HA solutions causes lower resistance
to HA software bugs, load growth, hardware upgrades, human mistakes.
As a result, significant testing is required after any minor change
in software, hardware, load or personnel. The only way to ensure the
proper handling of node failures and clean switchover, is to periodically
simulate various failure scenarios under adequate workload.
Such test may be a very costly, time-consuming exercise for real
production datacenters, and, as a result, most real-life HA clusters
work only by declaration, quite often failing to handle the switchover
when it is really necessary.

IT-Secure's HA cluster has been designed with intention to keep it as simple
as possible, in order to avoid the need in any failure simulation tests after
bringing the cluster into production environment. This goal is achieved
by minimizing the changes to software and hardware in comparision to
a single-node production host. In fact, the most common IT_Secure's HA
cluster installation does not require any changes in single-node server's
sofware or hardware. All cluster-related software is located on an external 
monitoring and controlling node, which is alot less critical than the server 
itself.

Cluster provides automatic resumption of production services within 5 to 8
minutes after unrecoverable hardware error (Node or disk poweroff, CPU module
or RAM crash, etc). Simultaneous failures of two or more hardware components,
in general, are not covered in current release.

Cluster is based upon two symmetric SPARC servers sharing the redundant
disk space, so that either of two servers can boot from the shared area.
The active (primary) node is running production jobs (applicatons,
databases, NFS or other). The standby (secondary) node stays
in Open Boot PROM mode (OBP, "ok" prompt).

All IT-Secure HA clusters are monitored and controlled by the daemons
running on central monitoring host, via console (tty) connections
from terminal server (NTS) or via monitoring hosts's serial ports.
The monitoring host itself does not need to be highly available,
unless there is high probability of cluster failure while the monitoring
host is down.

One monitoring host can control multiple clusters, the number
of clusters is limited only by monitoring hosts's resources
(CPU power, memory, etc.).

In case of primary node outage, the daemon sends the "break" signal
to the primary node console, thus dropping it to OBP, and
reconfigures the OBP parameters so that primary node cannot boot
automatically anymore.

After that, the daemon reconfigures the OBP parameters of the secondary
node and boots it from the same shared media.
As long as the secondary boots from the same root disk as the primary was
booted before, it is undistingusheable from the primary from the client's
side. To the clients, the switchover looks like a primary node reboot.


2. Hardware requirements and configuration

The specifications for cluster nodes (primary and secondary machine)
are highly dependant on production workload. To date, IT-Secure cluster
has been tested in very wide range of SPARC machines, including SPARCstation
20 and e10k domains. Clusterization  brings no additional requirements to
resources like RAM, disk space, network throughput, etc, but may require
additional storage or network interface cards:
Each cluster node  must have at least two unused interfaces to the shared
storage, like daisy-chained SCSI disks or arrays, disk subsystems with
fiber-optical interface, etc. Most well tested to date are FSBE/S, FSBE/P, SWIFT,
SCSI adapters and FC/OM modules. Configurations with other types of disk adapters
may require additional testing.

Cluster Daemon can run on low-end Solaris SPARC, Solaris Intel, or Linux
machine (Linux version of clusterd is still at beta test stage as of April
2002). Hardware specs for such monitoring PC or SPARC machine match the specs
required by operating system, several clusterd instances do not require any
extra hardware resources to run. Typical monitoring host has 64 to 128 mb RAM
and 4 to 10 gb of local disk space. For infrastructures without Network
Terminal Servers or other means of remote console control, the monitoring
host needs 2 serial ports per cluster (one port for each cluster node).
The 32- and 64-port serial multiplexer cards from Magma and Aurora are
well tested under Solaris and Linux.

The simpliest configuration for one 2-node cluster is shown below:



		     COM2         COM1
      Serial line      -------------     Serial line
  -------------------->| mon. host |<-------------------
  |                    -------------                   |
  |                                                    |
  |                                                    |
  |   ------------                       ------------  |
  |   |  Host1   |------   Shared   -----|  Host2   |  |
  |-->|          |         storage       |          |<-|
      ------------                       ------------



3. Cluster daemon


Clusterd should be started on monitoring host by hand:
  /etc/init.d/clusterd start

Automatic startup can be easily implemented by linking
startup script to /etc/rc2.d
The script starts one clusterd instance for each cluster listed
in $CLUSTER_HOME/CList. It also checks the availability
of all necessary configuration files and binaries.
The startup and activity logs are located in $CLUSTER_HOME/Logs.
Each instance of clusterd reads it's individual chat script that
describes the commands to send to nodes consoles and local shell,
and reactions to various responces from sh, primary node console
(Solaris login prompt) and secondary node console ("ok" prompt).

Clusterd watches for:
 - network availability of primary node (ping)
 - appearance of Solaris login prompt (after simulating RETURN)
 - "panic" and other configurable messages on primary node console.

Specific chat-scripts can be written to watch for other events
and send other commands to other channels. Chat-script uses very
simple "final state machine" languague.

Normally, clusterd is started on behalf of "trusted" user
who has permissions to execute ssh commands on primary node.


4. Tests description

The following crash simulations must be performed before taking each
cluster into production:

Secondary node power-off (to check alarming).
Attempt to boot the secondary node (must refuse to boot, and generate
an alarm).

Primary node power-off.
Primary IP address change.
Network cable disconnect.
Simulated panic.
Multiple cascaded switch overs, multiple poweroffs.
Disk poweroff, disk label corruption.
Check of filesystem consistency after 10 cascaded poweroffs.
(all filesystems must be journaled/logged).
Each test must be performed at significant system activity
(IO, NFS-OPS, DB transactions, SQL client simulation, etc.).


4. Cluster implementation for BCP without SAN.

IT-Secure's cluster can also be used For Busines Continuity Planning
with two geographically remote sites. If there is a Storage Area Network
(SAN) covering both sites, there is no difference in failover mechanism
from situation with the nearby-nodes cluster, and the failover time is
still within 10 minutes range, without any lost transactions.

Without SAN, the periodic snapshots from active to passive node must
be transferred between the site over normal IP links. The frequency
of such snapshots (affecting amount of lost transactions in case of
the failure) and maximal downtime, depend on amount of transferred data
and speed of the link. Normally, the downtime (equal to switchover time)
should not exceed 1 hour.