Wednesday, June 11, 2008

SLA - Service Level Agreement for SQL Server 2005

Do you know what an SLA is?

SLA = Service Level Agreement. SLAs are agreements between you and your customers. If you're a DBA, then your customer is typically the company for whom you work. Examples of SLAs are:

1. In the event of a corruption, or other disaster, the maximum amount of data loss is the last 15 minutes of transactions.
2. In the event of a corruption, or other disaster, the maximum amount of downtime the application can tolerate is 20 minutes.

Usually, it's a combination of SLAs such as those above.

Do you know why SLAs are important?

Here's the catch - an SLA is really more than just an agreement between you and your customers - it's more like a contract that you're obligated to meet. This means that if you're a DBA with zero-downtime and zero-data loss SLAs, you need to make sure that in the event of a corruption you can actually meet those SLAs. The obvious thing is that if the SLAs cannot be met then the business will suffer downtime and data loss. The not so obvious thing is that if you're the one who agreed to the SLAs in the first place, and when the disaster strikes, the capabilities of the system are far below the SLA's requirements, then you could lose your job - resume/CV time - I've heard of it happening...

Do you know your SLAs?

You have to know what your SLAs are so you can make sure the system can meet them. Several DBAs I discussed this with don't know what their business' SLAs are, even though they are responsible for making sure they are met. I find this astounding - how can you sign up for meeting an SLA when you don't know that the SLA is? Especially if failing to meet the SLA could lead to resume/CV time...

Do you think you can meet your SLAs?

The other reason to know your SLAs, of course, is so that you can correctly architect your system to meet them. There are a bunch of technologies you can use and strategies you can employ to work towards meeting your SLAs (well beyond the scope of this blog post but will be covered through the year). If you find that you can't meet your SLAs, you need to push-back on your management - otherwise you're setting yourself up for trouble when a disaster occurs and you can't meet the SLAs - you'll be held responsible.

Do you know you can meet your SLAs?

Your disaster recovery plan looks great on paper - but have you actually tried it? I know of one company that has a 15 minute downtime SLA for a 300+GB database but the DBA is relying on clusters to provide that for him. That won't work if the database is corrupt (remember a failover cluster has a single point of failure in its shared-nothing configuration - the disks) and needs to be restored from the last full backup... Another company I know of relies on database mirroring to failover in the event of a disaster but has never tried it to see if their application fails over gracefully... You have to make sure you've practiced recovering from a disaster before the first real disaster happens - you'll be amazed at the little things that are discovered (e.g. if the on-site backups are bad, how long will it take to get the offsite copies brought in-house from the off-site location 100 miles away? Can you still meet your 15 minute downtime SLA in that case?)

SQL Server Specific on SLA

Hours of Operation

1. Hours that the partition/table/database is available to users. The SLAs may be different for various parts of the database, depending on what applications need access to them. Why is it important to differentiate at various granularities? For example, depending on how the unit of data is used, it may require different maintenance than other data and so knowing the availability SLA allows maintenance downtime to be planned.
2. Hours reserved for planned downtime. Again, this may differ at various granularities of data.
3. Amount of advance notice for extended downtime or other changes that affect users. For instance, when my bank upgraded its computer system last year, they gave a series of warnings over the preceeding few months so that people weren't surprised.

Service Availability

1. Percentage of time SQL Server service is running and able to service connections.
2. Percentage of time a particular partition/table/database is available for use (i.e. not exclusively locked for maintenance or restore).

System Performance

1. Number of concurrent users the system supports.
2. Number of transactions supported per unit of time.
3. Acceptable level of performance, such as latency experienced by users for a variety of operations.
4. Minimum time for an update to be replicated to various remote sites.

Disaster Recovery

1. Time allowed for recovery from each type of failure (e.g. accidental data deletion, database corruption, SQL Server crash, OS crash, server failure, site failure).
2. Time it takes to bring critical data online (e.g. the read/write partitions of a sales database) such that operations can continue and less critical data can be recovered later.
3. Time taken to recover data to the point of failure.
4. Maximum acceptable data/transaction/work loss for various kinds of failures.
5. Maximum time for application failover to a remote server/site.


1. Methods available for application users to get help.
2. Maximum response/resolution time from a DBA to respond to various types of problems.


1. Maximum amount of space for user tables/databases.
2. Amount of users in specific roles.

Some other things to consider are how you define an SLA - for example, in transactions per second or in commit latency that users experience - and the interplay between SLAs - for example, the commit latency SLA may be affected if the acceptable data loss SLA is zero and a solution such as synchronous database mirroring or remote SAN mirroring is used.

Bottom line is that although it can be simple to quickly define and announce a set of SLAs for a given application, its very difficult to make sure that each is palatable to all involved, guarantee that each can be met, and allow easy diagnosis of the system to work out which component is failing when an SLA is not met. SLAs really need to be defined while a sytem is being designed as retro-fitting SLAs after-the-fact can be very time-consuming and costly.

1 comment:

Xeno said...