Or, How Many Eggs in One Basket?
Huge storage systems are available today supporting large numbers of disks, allowing the creation of massive storage resources. Storage Servers and storage enclosures suporting up to (60) 3.5″ disk drives are now common. Filling those bays with 6TB disks yields a system with 360TB raw capacity – one third of a petabyte! That is now an easy system to build, offers amazing density in 8 rack units or less, and is quite affordable, but does it make sense?
One of the factors to consider in sizing the component RAID 5 arrays is the Unrecoverable Read Error Rate URE) of the disks. That is one class of error that causes a RAID controller to fail a disk, and a similar failure from another member of the array during rebuild would lead to data loss. The enterprise SATA drives used by ION in its storage servers for capacity are the Seagate Enterprise Capacity drives with a URE of 1 sector per 10^15 bits read. That works out to about once per every 110TB read. That would suggest that individual RAID 5 arrays should probably be limited to about 8 to 12 disks. RAID 50 and RAID 60 both create RAID 0 stripes across smaller, matching RAID 5 or RAID 6 arrays. These approaches decrease the risk of multiple URE failures, increase redundancy, reduce rebuild time and add performance.
RASUM, Reliability, Availability, Serviceability, Usability and Manageability are the factors most often used to evaluate the suitability of a particular system as a server.
How reliable is the system, or what are the risks of failure? RAID 50 significantly reduces the susceptibility of the data to failure of more than one disk. Several reliability factors remain in this large, single resource. In particular, there are a number of single points of failure, including a single serverboard, a single RAID controller, and a single power system. The power supply in this class of system is certainly redundant – 3 or more redundant power supply modules. The liability is that all of those modules plug into a single power distribution frame, a single point of failure. Cooling, all of those fans that the system depends on to keep from overheating, is another concern. The bigger and more complex a system is, the more factors there are reducing the Mean Time Between Failure or MTBF.
How much downtime should be expected? A fault in any of the single points of failure listed above will certainly lead to downtime for repairs. Other failures that can affect uptime are rare, but when amplified by over 300TB of data that becomes unavailable, they are a matter for concern. Some of those issues are failures of RAM or processors, but of greater concern are the finite number of power connections and network connections to that one large resource.
Several strategies are being used for creating very dense storage systems. Some systems have hot-swap drive trays mounted on both the front and rear of a big system. Others have two drives mounted in a single tray, or even combine both of those approaches. And still others have a very big pull-out tray with all 60 drives loaded vertically into bays in that tray. How serviceable are those approaches? Some sound like the safest approach is to shutdown the system before performing routine maintenance like swapping a disk or a fan. The other thing to consider is how difficult it will be to access, remove, replace and re-test some of those core components like power distribution unit or the motherboard in that big system. Difficult repairs will equate to a long Mean Time To Repair (MTTR) and that means even less availability.
There are probably some usability advantages to creating one big monolithic storage entity, but there are disadvantages also. The serviceability issues make a big system harder to use. If the requirement is really a single volume that is as big as possible, that is one thing. But often, storage systems are configured in one giant pool and then many relatively small volumes are carved out of that and assigned to needs. If great care is not taken to size and align these volumes so that seek and access requirements on one volume do not compete with the seek needs of other volumes, the performance impacts can be devastating. It does not take long for low performance becomes the ultimate usability barrier.
On the surface, one big pool of storage that can be carved up and assigned to various requirements sounds like the easiest way to manage storage. If assigning the capacity to those requirements is the only consideration, that may be true. If performance matters, if latency matters, the complications of dividing that storage into volumes that do not compete with each other rapidly becomes a manageability nightmare.
All of the concerns above argue against the deployment of giant storage systems with 60 or more disks. Here are some alternatives.
A 3U storage server can support 16 disks plus mirrored SSDs for boot and, optionally, mirrored SSDs for read/write caching. Depending on storage requirements, this could be configured as three RAID 5 volumes of 16TB each, leaving a hot spare disk in the system. 3×5 RAID50 configures that storage server with a single 48TB volume with great reliability and high performance. For the cases where requirements cannot be architected into 48TB chunks, these 3U storage nodes can be provisioned, for example, as iSCSI targets and a number of them could be managed and provisioned by a front-end system using RAID 0 or RAID 5 across a wall of these storage bricks to offer a much larger single volume. ION’s S5 StorageServer followed this kind of approach.
A better approach is a 2U server with up to 10 external SAS RAID ports to connect up to (20) SAS JBOD units. This approach then scales from a 4U server with (12) drives to a full 42U rack with up to (240) 3.5″ disks or (480) 2.5″ disks. A reasonable approach to this architecture would be to provision a separate RAID 5 array in each enclosure with a hot spare disk in that enclosure and, optionally one or two SSDs for accelerated caching by the RAID controller. Assuming a hot spare and two SSDs, each 2U capacity module then holds a 9-drive RAID5 array for 48TB. Fully scaled with 20 storage modules, this approach delivers up to 960TB that can be accessed via 4 ports of 10GB Ethernet. With no SSD caching, that climbs to 1200TB or 1090TiB. The ION C2 StorageServer has been designed around this approach and also includes shelves of 10,000 RPM Seagate Enterprise Performance Disks and shelves fully populated with Intel Data Center series SSDs.
Both of these approaches, with separate modules which each have redundant power and redundant cooling, and provisioned as separate RAID 5 or RAID 50 arrays, provide solid technology platforms for delivering huge single volumes while addressing most of the RASUM concerns. Will performance measured in megabytes per second (MBps) or input / output operations per second (IOPS) or latency be a concern for the applications and users accessing a storage pool? If so, it is usually easier to understand the requirements well enough to divide them into chunks of less than 50TB than to provision non-competing volumes from one giant capacity pool. The huge resource ONLY makes sense if allocation of capacity is the only concern. As soon as performance of one or more volumes becomes an issue, it is time to divide the problem into smaller pieces. ION’s C2 StorageServer is designed to enable this approach. And with up to 60TB in a single shelf / single volume, the storage administrator still has a lot of flexibility.