You may find it hard to believe, but it is possible to maintain SQL Server high availability and not spend a fortune. In fact, you can spend a fortune and still not get SQL Server high availability. Sound like a paradox? Not really. SQL Server high availability is not a direct function of how much you spend. Instead, high availability is more about what you do right and what you do wrong.
Before we get too far along in our discussion on how to achieve high availability at minimal cost, let’s first take a brief look at what high availability really is. Let’s assume we live in a perfect world where nothing ever goes wrong. In this case, our SQL Servers would be available 100% of the time they are needed by users. This might be 24/7, or 8/5, depending on your user base and how your SQL Servers are used. In other words, there is a time frame where SQL Server needs to be up and running efficiently, and as long as they are up and running efficiently during that time frame, then you have 100% availability.
But what does “up” mean? Does it mean that the SQL Server service is running, even if it is running so slowly that users can’t get their work done on a timely basis? Or does it mean that the SQL Server service is running and that users can access it on a timely basis? I think most DBAs would assume the second option. If users can’t access SQL Server on a timely basis, then SQL Server is not really “up”.
Now, let’s get back to reality. Stuff happens. Things break. People make mistakes. It is impossible to attain 100% SQL Server availability. Because of this, it is our goal, as a DBA, to attain as high as a level of availability as we can, given our limited resources. This may mean 99.999%, or it may mean 90%. Each organization is different and has its own standard of what high availability means to them. Sometimes those resources include redundant data centers, SQL Server clusters, or even disk mirroring. In other cases, our resources are very limited. In any event, we must make the best use of what we have available.
What Prevents SQL Server High Availability?
Before we can learn what steps we can take to help ensure high availability, we need to understand what can go wrong with our SQL Servers. Once we understand this, then we can prescribe what we need to do to prevent them from going down in the first place. While the following list is long, it is not comprehensive. It would take a book to cover every potential thing that can prevent high availability. The focus here is on key factors that can negatively affect high availability.
Outside Forces
Let’s take a look at each one of these potential problem areas, examining what can go wrong. Note that potential causes of SQL Server unavailability are numerous, and include much more than just hardware failure. Anything that prevents users from accessing SQL Server when they need to is a potential problem that needs to be addressed.
Outside Forces
When people think of major disasters, this is what most think of. Fortunately, most outside forces rarely affect us, but when they do, they usually affect us with great impact. Most of these are very difficult to predict. On the other hand, such outside forces as hackers or viruses are a constant threat, and must be assumed to be occurring all the time.
Server Hardware
When DBAs and managers think about high availability, this category seems to come out on top. This is probably because everyone knows that all physical hardware fails, eventually. While this is a very important category to keep in mind, it is only one small cause of SQL Server availability problems. There are others that are more common than these.
Network Infrastructure
In many ways, this category is just as important, if not more so, than server hardware. If your SQL Servers can’t communicate with your users, they aren’t of much use.
Electrical Power
This category is a no-brainer.
Scheduled Hardware/Software Upgrades/Patches
This category is an often forgotten source of SQL Server downtime. Hopefully, you can schedule downtime for this important area, unlike much downtime, which is unplanned.
Operating System
While Window has gotten better over time, it still is not perfect and is subject to occasional problems. On the other hand, I have had SQL Servers up for six months, and longer, without any reboots.
SQL Server
SQL Server itself can cause havoc, causing downtime. In fact, from personal experience, I have had more problems with SQL Server than with the operating system in regard to unexpected downtimes.
Human Errors
These are some of the largest causes of unexpected downtime, but most of us don’t want to talk about them. I wonder why?
Application Software
Here, I am referring to the software used to access SQL Server. When application software fails, SQL Server often gets the blame, although unfairly. As a DBA, you need to take into consideration application software when planning for high availability.
Poor Tuning
Here’s an area you may not have thought about. There are many performance tuning-related issues that can prevent users from accessing data when they want. While some DBAs don’t consider performance tuning a high availability issue, it really is.
Database Maintenance
Routine database maintenance often slows down SQL Server, even locking users out of tables. All of this can add up to reduced availability.
Database Jobs
Scheduled jobs, whatever they are, have the potential of reducing SQL Server’s performance. It is important to design and schedule jobs with the least impact on users’ need for availability.
Third-Party Software
Here, I am referring to software running on the same server as SQL Server. Each of these have the potential for bringing down the server, with SQL Server along with it. They can also reduce SQL Server’s performance.
Poor Documentation
You are not going to like reading this. And I don’t like writing it. But if you don’t have proper documentation, and when you have problem, you risk increasing the amount of time SQL Server is unavailable.
Now that we have a good understanding of what can go wrong, causing SQL Server to be unavailable, let’s take a look of how we can reduce the odds of the above problems from occurring, all without spending more than we need to.
What to Do to Help Ensure High Availability
In this section, we will take a look at a large number of ways to help ensure high SQL Server availability. While the emphasis here is on low cost options for providing high availability, I am still going to include a few higher cost options, just so that you are familiar with all of the available options. In addition, what is high cost for one company may be low cost for another company. Because of this, I am trying to be inclusive.
As I describe each option below, I will also categorize each in regards to importance and difficulty of implementation.
For example, the importance of each option will be rated as:
The difficulty of each option will also be rated as:
And more importantly, for each option discussed below, I will describe what it does to help ensure high availability.
I have done this to help you decide how to prioritize the options available to you. While we might like to be able to implement everything here at once, that is not possible. Because of resource limitations, we have to choose what to implement first, second, and so on. Generally, I would recommend implementing the critical/easy to implement options first, and so on. This way, you will get the great effect for the work you put into this effort. So here goes.
Failover Clustering
Importance: Important
Difficulty: Hard
Used to:
Yikes, I started with an expensive option. But again, if your company generates US $1 million dollars an hour over a web site, then clustering is not really that expensive of an option. Cost is relative from company to company and situation to situation.
There seems to be a myth about failover clustering, and that it is the ideal (or only) way to ensure SQL Server high availability. While it is an important aspect of helping ensure SQL Server high availability, it is not the most important. For example, failover clustering does not protect your data, which is the most important part of your SQL Server. While failover clustering can be an important step to helping ensure high availability, it is only one part of the solution. The rest of this article describes the total solution, as you will see.
Offsite Database Mirroring (SQL Server 2005)
Importance: Important
Difficulty: Intermediate
Used to:
Database mirroring is a new feature being introduced with SQL Sever 2005. Without using any special hardware, it allows you to duplicate (real time mirroring) a production database on another SQL Server at a different location. In addition, if the first server should become unavailable, the system will failover to the mirrored database and SQL Server, helping to ensure high availability. You can also combine clustering and database mirroring if you like. Because of its lower cost than failover clustering, its easier implementation, the ability to work over distances, and the ability to protect data, SQL Server 2005 database mirroring should become a very popular high availability option.
Database Snapshots (SQL Server 2005)
Importance: Important
Difficulty: Easy
Used to:
Another new feature of SQL Server 2005 is the ability to create snapshots (static database copies) of your production database onto another SQL Server. This feature can be used to recover from human errors, or more likely, be used to run reports from, helping to remove the reporting load from your production server to another server, helping to boost your production server’s performance and availability.
Replication
Importance: NOT recommended for HA
Difficulty: Intermediate
Used to (in theory):
I have listed this option because a lot of people consider replication as a means to help attain high availability. I vigorously disagree. Replication does serve some specific business needs, but HA is not one of them. More often than not, replication reduces high availability.
Hot Standby Server
Importance: Depends on other choices you have made
Difficulty: Easy
Used to:
If being down for awhile is not a problem, a hot standby server (SQL Server is installed, along with recent backups, ready to be used) is a viable option for may companies to help ensure SQL Server high availability. But again, if you have the money to purchase a hot standby server, you might as well use it for disk mirroring or log shipping (discussed later), and be ahead of the game.
Cold Standby Server
Importance: Depends on other choices you have made
Difficulty: Easy
Used to:
The only difference between a cold standby server and a hot standby server is that you literally have to build the SQL Server when you need it. This, of course, take time, but it does save some money if you don’t mind being down for awhile.
Log Shipping
Importance: Depends on other choices you have made. Can be used with clustering instead of disk mirroring
Difficulty: Easy
Used to:
Log shipping is one step above a hot standby server, but one step below disk mirroring or failover clustering. Because it is easy and inexpensive to implement, it has become very popular. But I am guessing that it will slowly fade away once SQL Server 2005 disk mirroring becomes available, as disk mirroring is much superior, and not much harder or much more expensive to implement than log shipping.
Daily Backups
Importance: Critical
Difficulty: Easy
Used to:
This may seem obvious, but you might be surprised how many SQL Servers don’t get backed up. On top of this, backups are mandatory, no matter what other type of high availability options you are using, including failover clustering or database mirroring. Notice that this is the first item I have listed with a critical importance, and it also one of the easiest to implement.
Transaction Log Backups
Importance: Critical
Difficulty: Easy
Used to:
As with daily backups, hourly (or more often) transaction log backups are critical to overall high availability efforts. Ideally, transaction log backups should be stored on a server other than the production server. For example, you might have a file server designated in your organization for the sole purpose of storing SQL Server log files, just in case you should loose the production server. The shorter the transaction log backup interval, the less data you potentially lose.
Backup Verification (BACKUP command)
Importance: Critical
Difficulty: Easy
Used to:
This high availability suggestion is also obvious, but again, not everyone uses this option of the BACKUP command. By using the verification option, you are adding one more way to help ensure your backup data is good.
Monitor Backup Success and Failure
Importance: Critical
Difficulty: Easy
Used to:
Just because you have set up a job to make database and log backups, doesn’t always mean that the jobs are successful. You need to have in place a system to regularly monitor backups to ensure that they are being made. You can do this manually, using SQL Server, or use some third-party monitoring software.
Store Backups Offsite
Importance: Critical
Difficulty: Easy
Used to:
Ensure backups and transaction logs are safe should outside forces cause a problem.
Should the worst occur, and you loose your entire building, and if you don’t have a backup data center, then offsite backups are the only way you can recover your data. The very existence of your business may be at stake if all your data is gone. The easiest way to prevent this from ever happening is by keeping backups offsite. You need a system in place to ensure that nightly backups are stored safely offsite, and to ensure old tapes are returned back into rotation.
Test Backups to Verify Them
Importance: Critical
Difficulty: Easy
Used to:
This is not the same as using BACKUP verification. Here, I am talking about implementing a systematic method to verify that the backups you make are good. For example, you may want to perform a test restore of a different database each week. This critical and easy, albeit, boring job is very important for ensuring high SQL Server availability.
Redundant Infrastructure
Importance: Important/Critical
Difficulty: Hard
Used to:
– Redundant routers/switches
– Redundant DNS/WINS
– Redundant active directory
– Redundant WAN/Internet connections
– Redundant facility (if affordable)
The most highly available SQL Server is of little use if it can’t talk to the users who need to access it. So to ensure that SQL Server is available, all the network and physical infrastructure must also have redundancy built in. If it doesn’t, there is little use in spending much on SQL Server high availability.
Physical Security
Importance: Important
Difficulty: Easy
Used to:
People, purposefully or accidentally, can quickly and unexpectedly bring down SQL Server. Whether its a hacker penetrating your firewall, or a janitor unplugging your server to plug in his vacuum cleaner, there are a lot of ways that simple physical security precautions can help keep SQL Server up and running. In other words, keep unauthorized people physically (and virtually) away from your SQL Servers.
Use Server-Class Hardware
Importance: Critical
Difficulty: Easy
Used to:
– Error-Correcting memory
– Multiple CPUs
– Multiple NICs
– RAID, multiple I/O cards
– Redundant power supplies, fans, etc.
Hot swappable capability
Purchasing the physical hardware to run your highly available SQL Server is not a place to skimp on a few dollars. You need server-class hardware that has redundant components so that a simple hardware failure won’t bring down your server. Sure, you can’t protect from every possible hardware issue, but you can protect yourself from the most common ones. In addition, by having hot swappable capability, you reduce the amount of potential down time for times when you need to change out failed parts.
Standardize on Your Server Hardware
Importance: Important
Difficulty: Easy
Used to:
Ideally, you want all of your SQL Servers to run the same hardware configuration, including driver levels. By taking this approach, you reduce how much to need to remember, helping to reduce human error-related problems. In addition, it is less expensive to maintain on hand spare parts if all the servers are identical. Maintaining identical hardware is not always possible, but try your best to do so.
Use Hardware Monitoring Software
Importance: Important
Difficulty: Easy
Used to:
Virtually all server-class hardware comes with (or available as an option) software that allows you to monitor your hardware’s function, including the ability to identify potential hardware issues before they actually occur. This simple, preventative step goes a long way to helping ensure high availability.
Use Battery Backup/Power Conditioning
Importance: Critical
Difficulty: Easy/Intermediate
Used to:
Power is actually a part of your infrastructure, but I mention it separately because if its key importance. If you have a data center, then the entire data center may be on backup power. But if you don’t have a formal data center, then if you want to keep your SQL Server’s up, not only do they need to be on backup power, but all of the network infrastructure also. And keep in mind that battery backup is for only very short power interruptions, and for power conditioning.
Backup Power/Generator
Importance: Important
Difficulty: Hard
Used to:
If the power goes out for a much longer than a few minutes, then you will need to have access to backup power through your own onsite generator. Of course, only the most critical SQL Servers and data centers need, or can afford, this capability.
Tight SQL Server/Database/Object Security
Importance: Critical
Difficulty: Easy
Used to:
One of the easiest, but often neglected ways of boosting SQL Server’s high availability is to incorporate tight security at the SQL Server, database, and object level. As a rule, prevent all access to everyone for any reason, and then only very selectively give the absolute minimum permissions to users to get their jobs done.
Don’t Use SA for All Your SQL Server Access
Importance: Critical
Difficulty: Easy
Used to:
As a SQL Server SA, it is very tempting to use the SA account, or an Active Directory account with SA rights, to do all of your work within SQL Server. And in some cases, you have to be an SA to perform some tasks. But in other cases, you don’t need to have SA rights to work with SQL Server. In those cases, it is safer if you use a different account with less than SA rights. This way, if you make a mindless mistake, there is less chance that the mistake will destroy data or bring down your SQL Server.
Periodic Security Audits
Importance: Important
Difficulty: Easy
Used to:
In many cases, the only way to find out if people are accessing data they shouldn’t be is to audit user activity. This includes security at the physical server, SQL Server, database, and object level. You want to ensure that all unused accounts are deleted and that users are not accessing data they shouldn’t.
Maintain Good Documentation
Importance: Critical
Difficulty: Easy
Includes:
As much as everyone hates to write and maintain documentation, it is critical for SQL Server high availability. If case of any problems, documentation will speed up the recovery process and reduce errors. It will also help others to resolve issues if you (the DBA) are not around.
Good Change Management
Importance: Critical
Difficulty: Easy
Used to:
Includes:
Another boring, but critical area is to keep good track of what is happening on your SQL Servers. This helps to provide an audit trail of what has happened to your SQL Servers, often helping you in identifying recently introduced problems. In addition to documenting changes, be sure you test them before you put them into production, and also come up with a backout plan should the change fail.
Regular Performance Tuning
Importance: Important
Difficulty: Easy
Used to:
Includes:
As mentioned earlier in this article, a poorly performing SQL Server can hinder high availability. Because of this, you must regularly monitor your SQL Server’s performance, ensuring that it is performing at its optimum capacity. If you identify hardware bottlenecks due to increasing load, then you need to plan to beef up the hardware as soon as you can.
Smart Job Management
Importance: Critical
Difficulty: Easy
Used to:
Includes:
SQL Server jobs are not only necessary for checking the integrity of your data, but they must also be scheduled in such a way as to reduce the impact on users. Not running necessary jobs, or running any job at the wrong time hurts high availability.
Check All Logs Daily
Importance: Critical
Difficulty: Easy
Used to:
Includes:
While most logs have a lot of unnecessary information, they often include useful nuggets of information that can help you prevent future problems. Because of this, you need to take a regular look at them. You can do this manually, or you can use a third-party tool to monitor the logs for you, only alerting you to critical problems.
Check Disk Space Daily
Importance: Critical
Difficulty: Easy
Used to:
When SQL Server runs out of disk space, it can fail. Because of this, you need to daily check on how much empty space you have. As a rule of thumb, you want to have at least 20% free space. If you don’t, performance will be hurt and you further risk running out of space due to unexpected factors, such as a suddenly growing tempdb or log file. You can check this manually, or use third-party software.
Use Well-Designed Applications
Importance: Important
Difficulty: Varies
If You Write the Apps:
If you don’t write the apps, you don’t have much control on how the affect SQL Server high availability. But if your company does write them, write them with high availability in mind.
Use Current Drivers, SPs, Patches
Importance: Critical
Difficulty: Easy
Used to:
As much as we all hate updating our servers, it’s a mandatory task. We just need to accept the inevitable, and at the same time, test, test, and test before you roll any of these changes into production. Use scheduled downtime for these.
Minimize Use of Third-Party Applications
Importance: Important
Difficulty: Easy
Used to:
Includes:
Avoid running any third-party software on your SQL Servers, as the more software you have running on it, the greater the chance it will negatively affect SQL Server’s availability.
Cross-Train Backup DBAs
Importance: Critical
Difficulty: Easy/Intermediate
While management might not feel this way, you as the DBA, are the most critical link in SQL Server high availability. Because a single DBA can’t be responsible for everything, all of the time, there must be multiple people who can act as DBA when necessary. Cross-training is critical, and there must be no single point of failure (no DBA with unshared knowledge).
Is That All?
It’s all of this article, but not all you need to know about SQL Server availability. The purpose of this article is to point out that SQL Server availability is critical, that it is made up of many different parts, and that is is not necessarily expensive. Most of the suggestions presented here are easy and inexpensive, but critical to high availability.
High availability does not happen by accident. It is a result of doing lots of things correctly and consistently.
*Originally published at SQL-Server-Performance.com
Brad M. McGehee is a full-time DBA with a large manufacturing company, and the publisher of http://www.SQL-Server-Performance.Com, a website specializing in SQL Server performance tuning and clustering.
He is an MVP, MCSE+I, MCSD, and MCT (former).
Brad also runs another website called http://www.WorldClassGear.com It provides independent gear reviews for backpackers, trekkers, and adventure travelers.