xseries.org
system performance - rules of thumb
please note, these are rules of
thumb, not hard rules -- you're milage may vary
the info here is as is and is
guaranteed to be neither 100% accurate nor 100% comprehensive
this is not a detailed analysis page. all you see here are the results of someone else's analysis.
Last updated: Jan 2005
gigE adapters
assumtions: gigE adapters, 64k blocks, streaming data
One gigE adapter can saturate an older P-III system
Two gigE adapters can saturate a 400mhz FSB XEON system
Three gigE adapters can saturate an 800mhz FSB XEON system
notes: at smaller block sizes, you will max out CPU utilization long before you peak your throughput performance. 64k seems to be the sweet spot.
current TOE products (as of May 2004) seem to have their biggest impact (reducing CPU utilization) with large block transfers -- don't count on TOE to solve your woes….. Yet
Bus speeds
PCI-X 133mhz
x 64 bit = .85 GB/s half duplex
PCI-Express 4x = 1GB/s full
duplex
gigE = .1 GB/s full duplex
Using large frames (8k+), you can get up to 160 megabytes/sec throughput per NIC
(that’s full duplex, saturating wire speed)
Note, at an estimated 80 megabytes/sec (or .08 GB/s to keep with the earlier numbers) EACH DIRECTION after overhead, that plays out.
Dual NIC may not scale very well under Linux until you hit 2k frames (today). Windows currently performs about 2x overall (single NIC). If a dual NIC doesn’t scale up in Linux, then a dual NIC Windows box could end up 4x the performance of a dual NIC Linux box.
Bottlenecks may occur at much lower throughputs due to small packets, especially if they are all going the same direction.
Faster FSB = better IP performance
U320 SCSI = .32 GB/s half duplex
2g FC = .2 GB/s full duplex
SATA = .15 GB/s full duplex
CPUs
this is a VERY rough comparison of CPU performance
there is a LOT more that goes into actual system performance, especially memory controllers and capabilty of the supported OS
different architectures scale up differently as well, so quantities of CPUs may be even more inaccurate than this comparison
also, cache sizes vary, so some apps (like databases) MAY actually squeeze some bigger differences out
reference CPU roughly equal CPUs
Alpha 500mhz P-III 500mhz
Sparc III 1ghz P-III 1ghz
P-4 2ghz P-III 1.4ghz
PPC970 1.6ghz (P-4) Xeon DP 3.2ghz
Opteron 1.6ghz (P-4) Xeon DP 3.2ghz
Memory latency CPU Core intensive Memory Bandwidth IP throughput
XEON vs
Opteron Close
or equal Xeon Opteron Identical
Better clock speed helps Extra memory controllers help The
IP stack isn’t NUMA aware yet
Disk
10krpm ~8.1ms per I/O ~123
IOPS @ 8kb blocks, 123 IOPS =
984kbps per drive
15krpm ~5.8ms per I/O ~172
IOPS @ 8kb blocks, 172 IOPS =
1.4mbps per drive
You will almost certainly run out of IOPS long before you run out of bandwidth….
Forget about U160, U320, 2g fibre, etc
There ARE exceptions
Assuming typical 70% read/30% write spread:
Random -- Raid 10 is best (about 2x perf of raid-5)
Sequential -- Raid 5 and raid 10 are equal (raid-5 can be a hair faster in some write instances)
Raid 6 (or whatever you call N+2 dual checksum striping), about 80% of the perf of Raid-5
When you split buses, you add interrupts and you split up your avail memory.
As a result, you can actually DROP performance by splitting up a small number of drives.
S-ATA works well for streaming data (if it’s single threaded, like most free benchmark tests), or works very well for cost/performance
SCSI and FC are much better for random, multi-threaded access.
SCSI disks spin faster
SCSI disks have shorter seek times
SCSI disks use more power
SCSI disks are built heavier to handle greater vibration isolation
As you add more disks, vibration causes more read retries (check the soft error logs on the drives)
Because S-ATA disks are designed for desktop use, they have local cache turned on. When power is lost, so is the data in cache that hasn’t written out. Hope it wasn’t a critical db write that hadn’t made it to disk yet. In servers, local cache turned off is a good thing, and server class drives typically ship with this turned off.
SCSI drives will queue up a set of transactions and sort them prior to write and read functions, reducing wear and tear from disk thrashing.
Larger drives seek faster (tracks are closer together so the heads move less). 73g to 146g provides about 20% increase. Two 73g to one 146g provides about 80% net degredation. You are making that performance increase in throughput and latency by getting rid of seek overhead.
VMware
Yes, VMware takes overhead.
However, in many areas, it makes up for it. NIC drivers will switch from polling mode to coalesce mode under high workloads. This will reduce the interrupt requests performed by the OS, reducing one of the worst bottlenecking issues of the IP stack. Thus, as we add more systems doing a moderate IP load, the overall load can reduce total overhead per packet and improve overall performance to overcome VMware’s inherent overhead.
Databases
MS SQL is the only db that has processor thread affinity. The others have node affinity.
Processor affinity is ideal of Opteron, which has memory attached to a CPU.
Node affinity is targeted at 4 CPU nodes as defined by an Intel FSB, and seems to see all of the Opterons as a group as a single node…. providing no affinity in that instance.
If you disagree with any rule of thumb, please email admin@xseries.org with your argument.
As always, if you know something else that SHOULD be up here, submit it.
Again, this info is as is, and is not guaranteed accurate, politically correct, or anything else. Use at your own risk. If the numbers here are a few percent off, that’s life -- there will be no stress by the contributor of this for minor inaccuracies. Gross inaccuracies, with adequate backup information and validation on the part of the contributor, will happily be adjusted.