DSX Access Systems, Inc.

 

 

Clustered SQL Servers

 

SQL Server 2000 Failover Clustering

 

Introduction

Continuous uptime in a production environment is becoming a common security requirement. This paper describes Microsoft® SQL Server™ 2000 failover clustering, one method of creating high availability. Failover clustering is available only in SQL Server 2000 Enterprise Edition.

 

Failover clustering is a process in which the operating system and SQL Server 2000 work together to provide availability in the event of an application failure, hardware failure, or operating-system error. Failover clustering provides hardware redundancy through a configuration in which mission critical resources are transferred from a failing machine to an equally configured server automatically. Failover clustering also allows system maintenance to be performed on one computer while another computer (node) does the work. This benefit can also ensure that system downtime due to normal maintenance is minimized.

 

The goal of failover clustering is to provide high availability for an overall solution that accommodates backups, redundancy, and performance. If software and/or hardware problems occur, failover clustering can enable a production environment to stay up and running.

Figure 1: SQL Server 2000 virtual server illustration. This example is comprised of two server nodes, and one SQL Server 2000 virtual server.

How Failover Clustering Works

The clustered nodes use a heartbeat (see above) to check whether each node is alive, at both the operating system and SQL Server level. At the operating system level, the nodes in the cluster compete for the resources of the cluster. The primary node reserves the resource every 3 seconds, and the competing node every 5 seconds. The process lasts for 25 seconds and then starts over again. For example, if the node owning the instance fails due to a problem (network, disk, and so on), at second 19, the competing node detects it at the 20-second mark, and if it is determined that the primary node no longer has control, the competing node takes over the resource.

 

From a SQL Server perspective, the node hosting the SQL Server resource does a looks-alive check every 5 seconds. This is a lightweight check to see whether the service is running and may succeed even if the instance of SQL Server is not operational. The IsAlive check is more thorough and involves running a SELECT @@SERVERNAME Transact SQL query against the server to determine whether the server itself is available to respond to requests; it does not guarantee that the user databases are up. If this query fails, the IsAlive check retries five times and then attempts to reconnect to the instance of SQL Server. If all five retries fail, the SQL Server resource fails. Depending on the failover threshold configuration of the SQL Server resource, Windows Clustering will attempt to either restart the resource on the same node or fail over to another available node. The execution of the query tolerates a few errors, such as licensing issues or having a paused instance of SQL Server, but ultimately fails if its threshold is exceeded.

 

During the fail over from one node to another, Windows clustering starts the SQL Server service for that instance on the new node, and goes through the recovery process to start the databases. The fail over of the SQL Server virtual server will take a short time (probably seconds). After the service is started and the master database is online, the SQL Server resource is considered to be up. Now the user databases will go through the normal recovery process, which means that any completed transactions in the transaction log are rolled forward, and any incomplete transactions are rolled back. The length of the recovery process depends on how much activity must be rolled forward or rolled back upon startup.

Client Connections and SQL Server 2000 Virtual Servers

WinDSX SQL connects to a SQL Server 2000 Virtual Server with the SQL Server Name or IP address of the SQL Server 2000 Virtual Server. The Cluster name, the Cluster IP address, or even the individual node names are not used by the connections. From a client or application perspective, it does not need to worry about which node owns the resources, because connecting to the SQL Server 2000 Virtual Server appears as a normal SQL Server. During the failover process, any active connections are broken. WinDSX SQL checks to see whether the connection exists, and if not, reconnects.

 

DSX Example of Clustered SQL Servers

DSX configured and tested Clustered SQL Servers in the following manner. Listed below is the equipment used and the configuration.

·        PCs - 3 GHZ with 1G of RAM

·        OS - Advanced Server 2000 and MS SQL Server 2000 Enterprise.

·        Dell Power Vault  220 shared drive array.  The Power Vault can handle up to 14 drives but you lose 1 slot when you cluster the array. The Power Vault has two drives that are mirrored for the quorum files that the Cluster administrator uses. It also has 3 drives that are configured Raid 5 for the SQL server data files. This raid drive also has our file share resource so that when the SQL server fails over so does our shared folder. You could also have another 3 drives for the SQL transaction logs if you wanted. There are several hotswap drives for failover redundancy.

·        Each PC has an LSI megaRaid card that talks to the PowerVault controller. You could also use a Perc Raid controller from Dell. The important thing with these cards is that they are Cluster aware and cluster capable. When you configure the cards they must have a unique LUN so one card is LUN 6 and the other card is LUN 7.

\\\ Both PC’s run the cluster administration program and are capable of taking control of the cluster. We are using SQL server in an Active/Passive setup. Since this is a stand alone system both PC’s are domain controllers and run Active Directory as well as DNS servers. The Heartbeat is facilitated by using a second Network card in both PCs connected together via a network HUB.