Thursday, July 24, 2008

Upgrade live applications to SQL Server 2005 for high availability

Most companies have the luxury of upgrading their applications in place. But for many industries such as credit card, banking, shopping and gaming companies, the prospect of downtime is unacceptable. Imagine eBay/PayPal taking the system down for even one hour for an upgrade. The costs quickly escalate into the millions in just lost immediate revenue, not to mention the customers that would be lost to competitors.
The challenge
Your boss has told you that the ordering application has to be upgraded to version 10 with less than 15 minutes downtime. The database in question is 4+ TB and takes 16 hours to upgrade without any traffic running to it.
Reality check
The best laid ideas often fail because the scope of the project is not set early on and strictly adhered to. You will probably need to focus on a few key application areas that are critical to convert live. In eBay's case, it might say that, throughout the course of the upgrade, activities such as placing auctions, bids, payments and so on have to be live, but it's acceptable if you can't provide feedback or account merges. Try to avoid converting any data live, but keep each functional unit (block of tables) intact and transfer them live.

Hypothetical process
What process you pick, is heavily dependent on your budget. In this hypothetical scenario, we'll use two clusters with two nodes each. But you could easily simulate this scenario using two SQL Server instances on the same physical (or cluster) box or virtual servers running something like VMware's ESX. Let's examine the figure below.

Using two clusters and two nodes for live application upgrade.
Stage 1
In stage 1, we have the live data flowing into the current version of the database application. We created a second conversion database on Cluster 1 that contains any tables we want to convert over to the upgraded database – while the application is live. The most important part at this point is determining how to maintain consistency in the tables. The most common method is saving off of the current date/reference number for all conversion data at the time you want to start converting. Then continually you'd use a job that fires off stored procedures to bring over dates/reference numbers greater than the initial numbers you saved off. You could use other methods, such as triggers or Service Broker, but both would involve modifications to the original database schema and increase your risk and chance of making a mistake.
Stage 2
Stage 2 is a second physical cluster (or second instance/virtual server, depending on budget). As far as the schema goes, the conversion database on Cluster 2 exactly matches the conversion database on Cluster 1.
As the application pushes live data into the Order App DB (V9), the SQL Agent job on Cluster 1 fires off stored procedures that move the data into the conversion database on Cluster 1. Once the data reaches the conversion database, you can set up a replication publication on Cluster 1 for the conversion database that has a subscription running on Cluster 2. This causes any data pushed into the conversion database to transfer almost immediately into the conversion database on Cluster 2.
As the application is live pushing data into Cluster 1's version 9 database and is replicated over to the conversion database on Cluster 2, you can run the database upgrade process on Cluster 2's Order App DB – which you already restored from production at some point. Regardless of the time the database upgrade takes to run, the live data will be flowing into the conversion database on Cluster 2. Once the database upgrade is complete, you can fire off the SQL Server Agent job on Cluster 2; this fires off stored procedures that convert the version 9 data tables in the conversion database into the upgraded version 10 database sitting on Cluster 2. At this point, you'll have a chance to do any necessary configuration/maintenance to the upgraded version 10 database.
Stage 3
Once the conversion database on Cluster 2 is close to empty and the clients are upgraded to version 10, it's time to simply turn off Cluster 1 and point all the clients to Cluster 2, which already have all the critical live data transferred. A quick method of switching clients to version 10 would be SMS or SoftGrid (application virtualization), or some other method of automatically deploying a new version of a client en masse to a large client base. This would allow you to deploy version 10 applications within 15 minutes and they'd point to the freshly upgraded version 10 database on Cluster 2.
Summary
Upgrading an application while it is live requires that a lot of custom code be written to ensure the tables are consistent between the clusters (servers or instances). Test the upgrade scenario/scripts out several times, verifying reports at all the various stages to make sure the upgrade flows like clockwork. Verifying the various conversion database processes and jobs and documenting and ensuring consistency is fairly boring work, but it will pay off in the end when the upgrade flows smoothly. It beats working out issues within the user's sight.
Post a Comment