xseries.org

system performance - rules of thumb             

please note, these are rules of thumb, not hard rules -- you're milage may vary   

the info here is as is and is guaranteed to be neither 100% accurate nor 100% comprehensive

this is not a detailed analysis page. all you see here are the results of someone else's analysis.                 

Last updated: Jan 2005

 

gigE adapters                                                                                                                                                   

assumtions: gigE adapters, 64k blocks, streaming data

One gigE adapter can saturate an older P-III system

Two gigE adapters can saturate a 400mhz FSB XEON system

Three gigE adapters can saturate an 800mhz FSB XEON system

notes:   at smaller block sizes, you will max out CPU utilization long before you peak your throughput performance. 64k seems to be the sweet spot.

            current TOE products (as of May 2004) seem to have their biggest impact (reducing CPU utilization) with large block transfers -- don't count on TOE to solve your woes….. Yet

 

 

Bus speeds

PCI-X 133mhz x 64 bit = .85 GB/s half duplex

PCI-Express 4x = 1GB/s full duplex

 

gigE = .1 GB/s full duplex

Using large frames (8k+), you can get up to 160 megabytes/sec throughput per NIC (that’s full duplex, saturating wire speed)

            Note, at an estimated 80 megabytes/sec (or .08 GB/s to keep with the earlier numbers) EACH DIRECTION after overhead, that plays out.

Dual NIC may not scale very well under Linux until you hit 2k frames (today). Windows currently performs about 2x overall (single NIC). If a dual NIC doesn’t scale up in Linux, then a dual NIC Windows box could end up 4x the performance of a dual NIC Linux box.

Bottlenecks may occur at much lower throughputs due to small packets, especially if they are all going the same direction.

Faster FSB = better IP performance

 

U320 SCSI = .32 GB/s half duplex

2g FC = .2 GB/s full duplex

SATA = .15 GB/s full duplex

 

 

CPUs                                                                                                                                                  

this is a VERY rough comparison of CPU performance

there is a LOT more that goes into actual system performance, especially memory controllers and capabilty of the supported OS      

different architectures scale up differently as well, so quantities of CPUs may be even more inaccurate than this comparison

also, cache sizes vary, so some apps (like databases) MAY actually squeeze some bigger differences out          

 

reference CPU                         roughly equal CPUs                 

Alpha 500mhz                          P-III 500mhz               

Sparc III 1ghz                          P-III 1ghz                    

P-4 2ghz                                  P-III 1.4ghz                 

PPC970 1.6ghz                        (P-4) Xeon DP 3.2ghz 

Opteron 1.6ghz                        (P-4) Xeon DP 3.2ghz 

 

 

                                                Memory latency                   CPU Core intensive             Memory Bandwidth                             IP throughput

XEON vs Opteron                Close or equal                       Xeon                                      Opteron                                                 Identical

Better clock speed helps     Extra memory controllers help            The IP stack isn’t NUMA aware yet

 

Disk

10krpm                   ~8.1ms per I/O      ~123 IOPS              @ 8kb blocks, 123 IOPS = 984kbps per drive

15krpm                   ~5.8ms per I/O      ~172 IOPS              @ 8kb blocks, 172 IOPS = 1.4mbps per drive

You will almost certainly run out of IOPS long before you run out of bandwidth….

Forget about U160, U320, 2g fibre, etc

There ARE exceptions                                     

Assuming typical 70% read/30% write spread:

Random -- Raid 10 is best (about 2x perf of raid-5)

Sequential -- Raid 5 and raid 10 are equal (raid-5 can be a hair faster in some write instances)

Raid 6 (or whatever you call N+2 dual checksum striping), about 80% of the perf of Raid-5

When you split buses, you add interrupts and you split up your avail memory.

As a result, you can actually DROP performance by splitting up a small number of drives.

 

 

S-ATA works well for streaming data (if it’s single threaded, like most free benchmark tests), or works very well for cost/performance

SCSI and FC are much better for random, multi-threaded access.

SCSI disks spin faster

SCSI disks have shorter seek times

SCSI disks use more power

SCSI disks are built heavier to handle greater vibration isolation

As you add more disks, vibration causes more read retries (check the soft error logs on the drives)

 

Because S-ATA disks are designed for desktop use, they have local cache turned on. When power is lost, so is the data in cache that hasn’t written out. Hope it wasn’t a critical db write that hadn’t made it to disk yet. In servers, local cache turned off is a good thing, and server class drives typically ship with this turned off.

 

SCSI drives will queue up a set of transactions and sort them prior to write and read functions, reducing wear and tear from disk thrashing.

 

Larger drives seek faster (tracks are closer together so the heads move less). 73g to 146g provides about 20% increase. Two 73g to one 146g provides about 80% net degredation. You are making that performance increase in throughput and latency by getting rid of seek overhead.

 

 

VMware

Yes, VMware takes overhead.

However, in many areas, it makes up for it. NIC drivers will switch from polling mode to coalesce mode under high workloads. This will reduce the interrupt requests performed by the OS, reducing one of the worst bottlenecking issues of the IP stack. Thus, as we add more systems doing a moderate IP load, the overall load can reduce total overhead per packet and improve overall performance to overcome VMware’s inherent overhead.

 

Databases

MS SQL is the only db that has processor thread affinity. The others have node affinity.

Processor affinity is ideal of Opteron, which has memory attached to a CPU.

Node affinity is targeted at 4 CPU nodes as defined by an Intel FSB, and seems to see all of the Opterons as a group as a single node…. providing no affinity in that instance.

 

 

If you disagree with any rule of thumb, please email admin@xseries.org with your argument.

As always, if you know something else that SHOULD be up here, submit it.

Again, this info is as is, and is not guaranteed accurate, politically correct, or anything else. Use at your own risk. If the numbers here are a few percent off, that’s life -- there will be no stress by the contributor of this for minor inaccuracies. Gross inaccuracies, with adequate backup information and validation on the part of the contributor, will happily be adjusted.