Monday, 16 February 2009

Sizing a database server continued

Sizing of a database server from Perfmon Measurements

In an early blog entry I described the performance monitoring excersize that was to be used to determine the hardware specifications of a database server. Here is a list of process counters and there recommended limits

Processor Counters :
Processor Time: The Percentage of elapsed time that the processor spends to execute a non-idle thread. It should not exceed 50% long-term.

Memory Counters :
Available Mbytes: The amount of physical memory, in Megabytes, immediately available for allocation to a process or for system use

Paging File % Usage: The amount of the Page File instance in use in percent. Should not exceed 10% -20%

Storage Counters :
Avg. Disk sec/Read. The average time, in seconds, of a read of data from the disk. Should not exceed 20 ms

Avg. Disk sec/Write. The average time, in seconds of a write of data to the disk. Should not exceed 20 ms

Avg. Disk Queue Length. The average numbers of both read and write requests that were queued for the selected disk during the sample interval Should not exceed 2 per spindle

Disk Read/sec. The rate of read operations on the disk

Disk Write/sec. The rate of write operations on the disk

Disk Transfer/sec. Indicates the number of reads and writes completed per second, regardless of how much data they involve. Should not exceed 50 per spindle

Current Disk Queue Length. The number of requests outstanding on the disk at the time the performance data is collected. It also includes requests in service at the time of the collection. This is an instantaneous snapshot, not an average over the time interval. Multi-spindle disk devices can have multiple requests that are active at one time, but other concurrent requests are awaiting service. Requests experience delays proportional to the length of this queue minus the number of spindles on the disks. Should not exceed 2 per spindle

Network Counters :
Output Queue Length. Is the length of the output packet queue (in packets). Should not exceed 2 packets

System Counters :
System Uptime. The elapsed time in seconds that the computer has been running since it was last started

Server Bytes Received/sec. The number of bytes the server has received from the network

Server Bytes Transmitted/sec. The number of bytes the server has sent over the network

Objects Processes. The number of processes in the system at the time of data collection

We plotted the performance counters and marked the recommended limits. From this we where able to determine an number of configuration issues such as software that had been installed and consumes resources but that does nothing, or anti virus software consuming CPU and memory.

The I/O queue grows over the allowed value; also, the latencies (read, write) was going over 20ms. More spindles can help to shorten the queue and latency.

To calculate the value of IOPS that the server can handle, we need to know how the percentage of Reads and Writes we have on the disk. In our case this was 20 percent of all IOPS are reads. With Raid 1, 20 percent Reads and 14 disks we get an IOPS of 1167. Raid 5 is much slower because of a double Write Penalty.

The performance monitor does not give us enough information to determine in which ratio we should split the hard disks but we will go for: 2 for the system. 4 for the TempDB, 4 for the MDF and 4 for the LDF. So the hardware for this looks like

1x Smart Array P800 with BBWC 512MB
16 x 72 GB 3G SAS 15K SFF DP ENT HDD

Part of the performance monitoring covered the database backup. On one of our servers we where backing up onto a NetApp share. It took 11 hours to complete this backup. From the performance counters we could determine this was due to a slow data io to the NetApp share. We where also able to determine that the performance of the NetApp share was not smooth and dropped off drastically in the middle of the night. This was probably when transferring data or rearranging the data storage within the Netapp. The conclusion was that the performance of NetApp is not sufficient for our database and that we would need to add some hard disks to our local EVA to make the backup window smaller and copy the backup files at a later date. This would mean a different Smart Array controller and an additional box to house the 2 additional hard disks.

I saw a study on the performance of NetApp with a simulated load over time. The study showed that over time the performance of the i/o to the NetApp share deteriorates over time. To start with NetApp is faster that the local storage but within a few days becomes 3.5 times slower. I think the reason for this is defragmentation tasks that happen in the background. The same study showed drops in performance resulting from reallocation of data within the device.

To determine the amount of memory look at the percent Pagefile usage. This should not exceed 10% -20%. In our case it brings the Memory requirement to 8 GB. Since this is more than 4 GB we would require a 64 bit operating system.

Finally the CPU can be determined from % Processor Time which should not exceed 50% over an extended period of time. At this time CPUs are cheap and so we will go for 2 quad core CPUs

So the final specification looks something like:

64 bit SQL 2008
2 Quad core CPUs
Smart array with 16 harddisks

Today this will cost around 14k CHF