SSD Performance has System-wide Effects

The impact of SSDs on I/O related performance in a server is dramatic. Unleashed from the mechanical limitations of spinning disks, a single server can perform 10s or 100s of thousands of I/O Operations per second, IOPS, instead of just a few thousand.  Without long I/O waits, processor, RAM and RAID controllers can all get much busier, but the impact on humbler pieces of the server design can also be significant. Servers based on all-SSD RAID start revealing limitations in many components that were once taken for granted.

Early in the development of the SR-71 SpeedServer, ION discovered that a servers power system was a critical part of the solution.  “Power?” you ask, “I thought SSDs used much less power than spinning disks.”  SSDs, Solid State Drives do use much less power than disks.  Disks draw power on 12V for the motors and 5V for control and data.  Until recently, power supplies for most server systems were optimized to provide plenty of 12V power, but it was mostly the serverboard that used significant 5V power.  Most SSDs, in contrast, use only 5V power and when the system gets busy writing up to 24 SSDs simultaneously, the 5V power requirements are much higher.  Customizing a power supply for the higher 5V power requirements of an all-SSD server has been one of the challenges along the way.

Another early assumption that was quickly pushed to the side was the common belief that storage servers do not demand much from their processors.  It may be true that it would take a large number of rotating disks to deliver the number of Input/output Operations Per Second, or IOPS, necessary to make a current Intel Xeon processor busy.  With just (24) SSDs it is quite possible to drive all cores of a pair of Xeon processors to full utilization.  All-SSD servers demand the fastest processors available.

As the speed of SSDs has increased, RAID controllers have been improved to try to unlock the full potential of an all-SSD RAID array.  The result is that many more bits per second and transactions per second are moving in the system than was ever anticipated in systems based only on spinning disks.  The ION lab started seeing increasingly disparate performance between identical arrays of SSDs within the same system a couple of years ago.  The obvious assumption was that one of the arrays had one or more SSDs that were responding slowly, but it proved impossible to isolate any slow SSDs, individually or in array-sized sets.

As the ION engineers worked through these symptoms, the next most obvious assumption was that it must be the RAID controller in question. Once again, it proved impossible to isolate one MegaRAID controller that was operating slower than the other two identical controllers in the system.

It was not until this point that the SAS multi-lane cables were suspected.  There were no problems with the controllers seeing all of their attached SSDs and no problems configuring RAID arrays.  In fact, in some circumstances, all of the arrays within the system were capable of phenomenal data rates and IOPS rates.  The problem was, in fact the SAS cables, and as the investigation proceeded, it turned out to affect every cable manufacturer whose products were tested.

The solution to the cable problem was the very high quality flat Twin Axial SAS cables manufactured by 3M. In addition to eliminating the issues of inconsistent performance from the SSD arrays, the flat, foldable 3M cabling made cable routing easier, improving maintainability and air flow within the system.

A new challenge to the design of all-SSD servers presented itself recently. The system under question matched the latest MegaRAID controller with (10) Intel DC S3700 800GB SSDs.  The RAID controller failed an SSD out of the array.  And its replacement.  And on closer inspection, was logging frequent errors to all of the SSDs with drive timeouts and drive resets.  Performance was obviously well below expectations.  As before, the SSDs, the RAID controller and the SAS cables were all eliminated as the cause of the problem.  That left just one piece in the puzzle: the SAS expander.  Again, all drives were found, arrays could be created and some fairly significant I/O could be done.  In this case, a simple firmware update to the SAS expander eliminated the issue; performance leapt up and maximum latencies returned to standard, disk-impossible levels.

Experience has shown that the very high data rates and transaction rates possible in servers using all-SSD RAID exposes quality issues and performance sensitivities that wouyld never be discovered in all-disk servers.  As with most challenges, a painstaking engineering approach that re-evaluates all system interactions reveals the weak link and points to the solution.

Leave a Reply

Your email address will not be published.