Disaster recovery is a topic on the minds of many DBAs. Learn some of the basics about SQL Server disaster recovery from a leading expert in the area, Greg Robidoux of Edgewood Solutions. Greg is currently the Vice Chair for the PASS DBA Special Interest Group (SIG). In addition, he recently gave two presentations at the PASS Summit in Seattle on Change Management and Project Management for DBAs.
Tell us about yourself, your background, your training, and your experience with SQL Server.
I have been working in the IT industry for almost 15 years now. I started out working in networking, but moved to database systems early in my career. Over my career I have worked with Sybase, Oracle and most recently with SQL Server. I have worked with small clients with a single SQL Server, to clients with over 100 SQL Servers. Some of the areas that I have worked in with SQL Server include upgrades, disaster recovery planning, change management controls, project management, SAN integrations, product selections, centralized management, and application development on SQL Server 6.5, 7.0 and 2000 platforms.
Tell us about your consulting company.
Edgewood Solutions was founded in January 2002 with the premise of elevating the Microsoft SQL Server platform, by implementing appropriate and necessary components for SQL Server that are sometimes overlooked. There are components, such as disaster recovery, that are often not addressed, but we feel are extremely important to a sound database implementation.
Edgewood is focused on developing key components for SQL Server that most DBAs know they should have, but don’t, because most of their time is consumed with handling immediate database issues. Instead of trying to provide solutions for all aspects of SQL Server, we address areas that should be in place, but are not always addressed. These areas include:
- Change Management
- Security Policies
- Disaster Recovery Planning
- Project Management for SQL Server projects
- SQL Server Upgrades
- Maintenance Planning
- Performance Analysis and Tuning
We also research and recommend software products that are developed specifically for SQL Server. We look for best of breed products and offer these solutions to our clients. So far we have partnered with SQL LiteSpeed, Lumigent and Precise Software Solutions. Each of these solutions provides a unique benefit to the SQL Server industry and economies of scale for DBAs addressing numerous projects with tight deadlines.
SQL Server disaster recovery has become a big topic. We all know what the obvious reasons are for preparing a disaster recovery plan. But, what are some of the less obvious reasons why a disaster recovery plan is critical for organizations?
When most people talk about disaster recovery they think about it from a total site disaster. Although this is possible, it is not very likely that an entire site will be wiped out unless the site is in an area prone to natural disasters such as floods, hurricanes or tornadoes.
I think of disaster recovery as eliminating unplanned downtime that severely impacts the business. This can be caused by power outages, inadvertent code that is moved into production, and hardware failure. Most companies that have SQL Server installations are relying on SQL Server to be available 24×7 in order to conduct business. As more and more companies begin to go global with their product offerings, thanks to the Internet, the available window for planned downtime is becoming smaller, and the window for unplanned downtime needs to become non-existent.
One of the items that must be addressed, and is typically overlooked, is actually determining the amount of acceptable downtime for a company, and in turn, developing a thorough plan to meet these needs. It is often difficult to get a realistic time that systems need to be available to users versus availability for maintenance windows. Because this window is not clearly defined, DBAs find it difficult to plan for something that might not be realistic.
What exactly does a disaster recovery plan cover?
As a disaster recovery plan is conceived, it is necessary to address items on both the business and technical sides of the coin. It is my recommendation that the plan covers all aspects of any type of system downtime, whether it is as small as one table or as large as an entire site. The plan should be appropriately based on the business need and not for the sake of only implementing technology that does not resolve the immediate and long-term needs.
Plans and procedures need to be written and implemented that take into consideration all possible threats of unplanned downtime. To address the items of virus, hackers, DBA mistakes, and disasters, the disaster recovery plan is only one component of a fully functional suite of processes that should be implemented. In addition, DBAs need to look at Change Management, Security and general best practices to avoid downtime.
What steps are there to creating a disaster recovery plan?
Based on my experience, it is best to start with a small plan and begin the process with future goals in mind that are tested and implemented on a predefined schedule. For example:
- Determine the business needs for availability as well as the corresponding budget for the solution.
- Determine disasters to prevent as well as recover from, including mistakes, hackers, hardware failure, and database corruption.
- Plan the data collection disaster recovery process.
- Document the environment with a standardized set of documents relating to the hardware, software, application, and personnel.
- Develop a key contact list and escalation levels.
- Develop a media kit that has all the necessary software versions and service packs.
- Standardize server configurations across servers.
- Have backups readily available on disk (Look at SQL LiteSpeed for smaller and faster backups).
- Have spare hardware or servers available.
- Communication plan for the recovery process.
- Testing the plan to ensure success.
- Implement the solution.
- Re-testing and revising the plan as needs change.
- Look at third party tools that can assist with collecting server information (BindView for SQL Server and NetIQ ConfigurationManager for SQL Server).
- Look at third party tools for restoring lost data (Lumigent Log Explorer).
- Lastly, keep documentation up to date.
How big a part does documentation play in a disaster recovery plan?
Documentation is one of the key components to having a successful disaster recovery process. Without documentation it is very difficult to perform a planned recovery. What happens in most instances is that the recovery process is handled in a fire-fighting mode. Several actions are taken to fix the problem at hand, without knowing what fixed the problem, or possibly creating subsequent problems. Too often systems documentation has not been a priority and is not in useable format for most DBAs. Most of the time DBAs rely on personal experience, past emails, and fast Internet searches in order to recover.
Parts of a comprehensive DR document include the following:
- Contacts and escalation list
- Software versions and service packs applied
- Hardware configurations
- Server names and IP addresses
- Impacted audience
- Impacted applications
- Priority order of servers and applications for recovery
- Recovery scenarios
- Table
- Database
- Server
- Data Center
How do you identify the biggest areas of vulnerability for a SQL Server?
SQL Server used to be for departmental applications where not much support was given to these groups by the IT department, and therefore downtime or re-keying of data was an acceptable process. As SQL Server has moved into the enterprise-wide application space, it is extremely important that companies implement a comprehensive recovery plan for their SQL Servers.
Most companies have spent a lot of money on redundant hardware, power and room conditioning, but that is where things usually stop. I guess the feeling is that if the hardware doesn’t go down there is nothing else to worry about. If companies don’t plan to implement a full disaster recovery process, the one area that I would stress for them to address is their Change Management process. Besides hardware, failures due to application code and human error make up the next biggest segment for downtime. With a sound Change Management process, this area of downtime could just about be eliminated.
How do you suggest DBAs test their disaster recovery plan?
The threat of a total disaster, which would require a hot site to be utilized, is not that common. The real threat people should think about is isolated disasters and recovery from these incidents. These consist of dropped tables, major updates that cannot be reversed, bad code that was moved into production, etc. Since these types of things occur more frequently, DBAs that do not have any type of DR plan in place should begin with things they can fix themselves. Once these items are in place and documented, DBAs can work with other IT and business departments to institute a more comprehensive solution that crosses the enterprise.
Testing should occur as much as possible. The more testing that occurs and the more rehearsals that are performed, the better the plan will be when it needs to be called into action. Scenarios should be identified based on past problems that may have occurred and plans should be written to address these issues. As for equipment, it would be nice to simulate the production environment, but where budgets are an issue, a desktop PC could be put in place as a test machine. It is pretty inexpensive to implement a small test server, and if a company is dependent on the availability of their database servers, then the DBA has to sell management on the importance of test servers.
What tips and suggestions can you offer for DBAs for backing up their data in order to ensure that they can recover their data?
Along with executing backups, DBAs need to perform database restorations to ensure the backups are valid. I recently published an article called “Backup and Restore – Back to Basics with SQL LiteSpeed”, that is available on our web site www.edgewoodsolutions.com, about backup and restore procedures, as well as features that SQL LiteSpeed offers in terms of speed, disk savings, verification and encryption. Here are some general tips from the article:
- Backups should occur to disk first for faster backups and restores
- Verify backups
- Test restores
- Use full recovery mode to eliminate data loss
- Use a combination of full, differential and transaction backups
- Periodically execute a full system restore to make sure the plan is successful
Do you recommend any specific hardware configuration (cluster, hot server, RAID, etc.) for SQL Servers in order to reduce potential vulnerabilities?
The more redundant the environment, the better the vulnerabilities are addressed. In most hardware configurations, technical staff has spent sufficient resources on redundant hard drives, controllers, power supplies, network cards, etc. In addition, they spend a lot of time configuring replication, clustering or standby servers, but the one aspect most companies fail to implement are the Change Management processes that should correspond to the hardware investment. Unfortunately, the greatest hardware cannot prevent someone with the appropriate privileges from changing a configuration or migrating new code into production. These processes are as important as the hardware environment from an availability perspective.
The ideal hardware configuration really depends on the type of environment and the need for availability. There are several of options that could be put in place for this ideal setup, but cost is usually the prohibiting factor. DBAs need to determine what is acceptable downtime, and the hardware configuration should be adapted to that need. The one thing I can say is ideal, is having a standardized hardware and SQL Server configuration across severs.
If the worse occurs, and the physical facility goes away. What is the best way to prepare for such an event? And how do you recover from such event?
In the scenario you have explained, it is necessary to address these items in the discovery and planning stages of the disaster recovery project. It is always best to plan for these types of failures rather than scrambling once the failure has occurred. Based on the criticality of the data, it may be necessary to have a hot site with employees in this different geographical location. Typically, if a site has been destroyed, then staff in the impacted area may not be able to address the business needs due to transportation issues or additional priorities. As such, having a separate facility with staff may be the best bet.
In a total site loss, all systems cannot be recovered simultaneously; therefore it is very important when planning your disaster recovery to have a priority list of servers and applications, so these can be addressed first. In addition, there are also other servers that may contend for the same resources, so for this type of scenario site wide planning and documentation are very important.
Can you provide any examples of disaster recoveries you have been involved in?
Some of the disasters that I have been involved in have included power outages, generator failure, hardware failure, data corruption, DBA mistakes, network changes and system overloads. Each area requires a different set of people and a different approach of how to handle the issues. The disaster recovery plan should address each of these items and the steps for a successful recovery.
What are the biggest mistakes people make when they create a disaster recovery plan?
They do not keep it current and they do not test. All things sound great in theory and look good on paper, but if you do not test your plan, how will you know if it works? Also, keeping documentation current is extremely important. Configurations constantly change and not maintaining your documentation will make your plan fail. Generally once a step in the plan does not work, people are prone to throwing aside the plan and trying to fix the problem based on their knowledge.
What are some myths about disaster recovery plans?
I think the biggest myth is that a disaster recovery plan is only needed for total site recovery. Since the odds of this are not great, most people look at putting a plan together as a futile exercise that is not going to reap any benefits. Disaster recovery should be redefined as “recovery from unplanned downtime”.
How can a DBA learn more about disaster recovery?
There are several great resources on the web that DBAs can consult. Just doing searches for keywords on search engines will reveal a lot of information specific to SQL Server as well as more general information. The areas that DBAs should continue to improve on are planning and documentation skills. These two areas are key to Disaster Recovery, but also to any other procedural process that needs to be developed for their database environment.
Besides what has already been covered, what else would you like to mention?
A disaster recovery plan is definitely a key process that all companies should invest the time in preparing and maintaining. It’s like an insurance policy, most of the time you think it’s a waste of money and you don’t need it, but that one time you really need it you’re sure glad you had a plan in place. When a plan does not exist, more time is wasted on communicating the issue and how it is going to be resolved instead of actually fixing the problem. If a plan exists, you can just tell your users and management that the details are in the plan, so let me get to resolving the problem!
Brad M. McGehee is a full-time DBA with a large manufacturing company, and the publisher of http://www.SQL-Server-Performance.Com, a website specializing in SQL Server performance tuning and clustering.
He is an MVP, MCSE+I, MCSD, and MCT (former).
Brad also runs another website called http://www.WorldClassGear.com It provides independent gear reviews for backpackers, trekkers, and adventure travelers.