Storage, Raid, and intel’s ICHxR
If you look in the Internet, in various forums including storagereview, hardocp, and anandtech you will find hundreds of people talking about RAID. I’d say almost all of them involve someone thinking about using some onboard raid solution that came with their computer, and a bunch of other more “experienced” forum members posting replies to the effect of: “don’t even bother with it, buy a “real” raid controller and go from there”.
Let’s take a deep look at storage, and while we are there we can stop and see if the if Intel’s ubiquitous ich9r is any good at raid.
The Performance analysis of storage is generally partitioned in two, Sequential R/W transfer rates STR, and Random R/W performance usually measured in I/Os per second or a Latency measurement.
STR>
Sequential Transfer Rate (STR) is the data throughput a storage device can achieve without having to seek to various parts of the disc.
Hard disc primer for STR:
SCSI, SAS, ATA-66/100/133, and SATA1/2 all had impressive throughput rates for their time, but the interface was never the bottleneck.
The discs themselves have sustained transfer rates (STR) limited by:
1. Linear Speed L of the Head which is a function of:
- Rotational Velocity V (revolutions per time) on most hard drives this is not dynamic.
- Location of the data r (radial distance of the head where the data is being operated on).
so L=V*2*Pi*r , in a normal 7200 revolution per minute, 3.5″ hard disc (radius near 1.5″) the Linear Velocity L = 7200*2*Pi*1.5 = ~67858 inches per minute at the outer edge, less as you get closer to the center.
this Linear Velocity reduces in a linear fashion as you approach the spindle of the disc.
2. Linear data density d, (bits per inch), which is usually proportional to the square root of the areal density (higher density means the head can traverse and r/w more sectors for a given linear velocity)
so
Sequential Throughput ~ L * d ~ 2 * Pi * V * r * d
so there you have it, in a given disc, sequential data throughput is linearly related to distance from center.
this is why 3.5″ drives are generally faster in sequential operations since much of the data is a lot further away from the center than any 2.5″ disc.
————————————————————————–
It may not be called a disc problem, but effective disc large-file transfer rate can be throttled if the data has to be fragmented on various spots on the disc, since that requires head seeks for something that could be a sequential operation. Seeks cost time and transfer no data.
Carefully chosen modern 7200rpm SATA2 high-areal density discs like the this or this can perform sustained sequential reads or writes close to 1Gb per second at the outer edge. The discs I have been messing with ( Western Digital
these graphs are decreasing because the program calls 100% the innermost, and 0% the outermost of the disc… also you may notice the graphs are not linear as I suggested, this is because the horizontal axis is “%” which is % of data, not % of radial distance.
I will not bore anyone with the math/logic to understand why this makes sense, but it does… and graphing it like HD-Tune did here should theoretically yield a quadratic, which it seems to by the pictures.
————————————————————————–
Latency>
The other major performance specification of any storage system is the latency, or time required to begin a read or write at a random location on the disc.
In a traditional rotating disc drive, the latency is not so straightforward to predict based on the geometry.
Clearly the latency cannot be predicted simply by the location of the data, one must also know where the head will be before needing to access the data. But lets talk of average random access times…
Certainly faster rotation will minimize rotational delays, while higher areal density can effectively minimize real world latency if it means you can cram more data at the outer edge, minimizing or eliminating the need for the head to go towards the spindle… but on a disc full of data, the argument is irrelevant.
Let me save a more thorough analysis of seek times for another day, and lets think of it as a latency (penalty) for each read/write that is not adjacent to the previous i/o.
————————————————————————–
question: does the revision and/or firmware of a disc drive matter?
answer: I like to use graphs to communicate, so here goes:
(these are all WD6400AAKS Western Digital 640GB SE16 drives, but different versions)



So, clearly the revision/version of a specific disc drive matters significantly. The 00A7B0 was quicker in latency, but slower in STR than the other two. The 65A7B0 was the best in STR.
ok, everything so far was for a specific type of storage device consisting of 1 head and 1 rotating disc (yes there are multiple platters , each face getting a head, but they do not seek independently, so effectively we can consider it as 1 face of 1 platter with 1 head.)
Redundant Array of Inexpensive Discs?>
Although the actual words which make the acronym are not all that applicable today (since many raid volumes lack redundancy and few are cheap), RAID has taken off as an admirably simple solution to achieve what many situations require:
Primer:
Why would anyone want to use raid?
+a given hard disk around a 5% chance of failing per year. (this depends on age and temperature and utilization) so It’s nice if a failure does not result in data loss or even any downtime at all.
+having a larger pool, rather than several smaller pools to store data is preferable as it eases file management. (no shuffling around data to various discs to make room)
+the hope that n discs could be n times faster than 1…
RAID 1
If your only concern is maintaining data integrity and availability in the event of a disc failure, RAID 1 “mirroring” is the obvious solution. The idea that all writes are done to 2 discs, instead of one. If one disc fails, it simply reads from the remaining one, since all the data on both drives is identical.

- write throughput depends on implementation, but could be slower than a single disc
- write latency should be a bit worse, since it must be written on both drives, the write is not complete until both seek, so its the slower of the two seeks, every time.
- read latency could be slightly better than a single disc, as a smart controller could only request the data from the disc who’s head is closer to the data, or request it from both, and take the one who gives it first.
- read speed throughput depends on implementation, but could be faster than a single disc, as the controller could have each disc read a different half of the file, halving the time it takes to read the whole file… this is not generally implemented.
- capacity is halved, as the 2 discs have identical data, usable storage is only the data capacity of one disc, so it costs 1/2 capacity for the redundancy of 2-disc raid 1.
- If each disc has a f % annual failure rate, the raid1 array will have a 100*( (f/100)^2 ) % failure rate. (so if the disc failure rate is 5.0%, the raid1 volume failure rate will be 0.25% )
RAID 0
If you want a larger pool of data then a single drive can provide, or you want faster file reads and writes, and you aren’t worried about disc failure than you RAID 0 is the solution.

- read and write throughput should linearly improve with the number of drives in the array.
- Latency should be a bit worse than a single disc, as each disc needs to seek to the file, so the longest any of the heads have to seek is your access time for that read or write action.
- volume capacity is simply the number of discs times the capacity of each disc (if they are not the same size, then the capacity of the smallest disc times the number of discs), no or very little wasted space.
- there is absolutely no data redundancy, in fact if each drive has a f% chance to die in a year, then a n-disc raid0 volume has a 100*(1-(1-f/100)^n)% chance of failing resulting in data loss of the entire volume. (so a 4-disc array in raid0 where each drive has a 5% chance of failing annually means the raid0 volume has a 18.55% chance of failing each year.)
So, raid1 gives you redundancy but no speed gain, while eating up 1/2 of your disc space, and raid0 can yield a lot of streaming speed gains, but will actually make your reliability much worse.
RAID1+0 or RAID 10
This is a even-number-of-discs >= 4 solution where the controller uses raid1 for each pair of discs, and RAID 0 to stripe the pairs together into a larger, faster volume.

6-disc raid 10 :
RAID 0
.-----------------------------------.
| | |
RAID 1 RAID 1 RAID 1
.--------. .--------. .--------.
| | | | | |
120 GB 120 GB 120 GB 120 GB 120 GB 120 GB
A1 A1 A2 A2 A3 A3
A4 A4 A5 A5 A6 A6
A7 A7 A8 A8 A9 A9
A10 A10 A11 A11 A12 A12
if there are n-discs in the array:
- Read throughput should be about n/2 times single disc speeds.
- Read latency could be close to a single disc, as raid0′s latency is the worst of all member volumes, but here member volumes are raid1, where each could take only the fastest seek of its member discs.
- Write throughput could be close to n/2 times single disc speeds, but is dependant on the implementation as each write must duplicated on both drives of each of the mirrored volumes.
- Write latency should be in the same ballpark as a single disc, but should be worse as both raid1 and raid0 must complete the seek on all member drives for the operation to complete.
- total volume size of RAID10 arrays are n/2 times the single disc capacity, so the cost of redundancy here is 50% of the disc space.
- If the disc failure rate is f%, then the raid10 volume failure rate will be 100*(1-(f/100)^2)^(n/2) %. If each drive has a 5.0% annual failure rate, a 4-disc RAID10 array will have a failure rate of 0.499% , while a 6-disc RAID10 array will have a failure rate of 0.748%
So raid10 might seem like a decent balance of performance and reliability, but it is at the cost of 50% of the hard drives involved.
RAID 5
RAID5 arrays are of 3 or more discs where the data is striped across all discs, but for each stripe of data, a parity bit is written on one of the discs, so if one drive fails, the data can be recreated or recovered for each stripe.

- Read throughput should be n-1 times as fast as a single disc.
- Read latency should be slightly worse than a single disc, as all but one of the discs must complete the seek, for the operation to complete.
- Write throughput could be n-1 times as fast as a single disc, but there are some important notes on write performance. While data is being written, the controller must calculate the parity for each stripe, this has the potential to slow down writing significantly, while some may call this parity writing a latency, but since it is a constant tax on all writes, big or small, I’ll consider to effectively throttle write throughput. Cache/buffer can buy time for the system to calculate this parity.
- Write latency in raid 5 is inherently troublesome. not only does the system need to calculate parity for each stripe that is written, and write the corresponding parity bit, but if the data being written is smaller than the stripe itself, the controller must first read the stripe, then modify the data (adding / changing bits as needed), and then calculate parity on the entire stripe, and write it entirely. I will refer to this as the partial-stripe-write issue/penalty.
- Total usable volume size of a RAID 5 array is about (n-1) * smallest-disc-size, so a 4 disc array will have a usable storage area of 3*disc-size.
- a n-disc RAID 5 volume failure rate will be the odds that 2 (or more) discs fail in the array. If each disc has a failure rate of f%, the RAID 5 volume failure rate is 100*[1- ((1-f/100)^n * (f/100)^0 + nC1*(1-f/100)^(n-1) * (f/100)^1)] %. (for a 4-disc array where the Annual Failure Rate per-disc is 5%, the failure rate of the RAID 5 volume is 1.40%, 3.2% for a 6-disc array) In production environments the effective failure rate will be closer to zero for the volume since a failed disc will be replaced in a matter of hours or days, the array will be rebuilt and redundancy restored.
.
RAID 5 clearly looks like the best and most scalable raid technology listed above, providing streaming speed boosts and protection from disc failure, while only “wasting” 1 disc of the array for parity…. but do we need an expensive or obscure RAID controller to get decent write performance in the real world?
enter the Intel ICH9r.
this is Intel’s current raid-enabled southbridge, paired with most 3-series northbridges and many motherboard manufactures have paired it with the x48 northbridge.
Obviously the southbridge does many things, but here let’s look at it solely as a raid controller.
During the write of a raid 5 stripe, a parity bit must be calculated from the data to be written, before the write occurs. Some high-end RAID controllers perform this calculation with the aid of an XOR processor. Intel’s ICH9r does not have an XOR processor, it leverages the CPU to do the calculation. In the past this may be an enormous concern, today we have dual and quad core CPUs on the cheap, yielding an excessive amount of unused processing power in most small business machines.
Another trick expensive RAID controllers use is onboard cache, for aiding in small writes, this is critically important because the system must calculate the parity of a stripe before it is written, so to have a buffer to buy time w/o holding up other IOs is key. Intel’s ICH9r has no cache, but it can simply take some of the very cheap and abundant system memory, It may be more now days, but back in the ich7r days it allocated 4MB of system memory on boot for caching raid arrays. The is enough to handle small random writes gracefully, dealing with the partial-stripe-write issue.
So with that out of the way, the ich9r is a hardware raid controller in that, the raid arrays are transparent to the OS, but it does utilize the system resources to do the XOR for the parity calculation and volume write caching.
That’s all well and good, but text is cheap, lets look at graphs:
Read Throughput:
Read Latency:

Write Throughput:

Here we see the ICH9R doing a pretty good job in Raid10 and Raid5. In raid10 performance is almost double a single disc, which is what you would expect/hope for. In raid5 you might hope for tipple the performance of a single drive, but it looks like we will have to settle for double. Clearly if you are not going to enable disc and raid cache, you should not use raid5 for anything where write performance matters.
Write Latency:

Here we see impressive number from the ICH9R in Raid10, latency is significantly less than a single disc if disc cache is enabled, I’m not sure why this is, but possibly a larger pool of cache from both volumes in the stripe. Raid5 is, as expected worse than a single disc in write latency, with disc and raid cache enabled, the latency is comparable, yet still worse than a single disc.
is it really this simple?
(Can the performance of these drives really be characterized by only these metrics?)
for a single disc:


Overall for a single disc, with or without cache the 2 metric model seems to work pretty well, although around 1MiB request sizes we see actual performance fail to match predicted, especially for writes.
What about for Raid1:

It’s almost as if it only does striped reads when RAID cache is off and the read data request is larger than 64MiB… strange.
What about writes?:

Here we see a discrepancy across the board at 1MiB transfers…Clearly HDD cache on, RAID cache off is ideal for Raid 1 on the Intel ich9r.
What about Raid10:


Let’s see what’s going on at small write request sizes:

Here we see the 1st order latency/str model really failing to model the mid-size transfer random write performance of the drive…
Although it doesn’t have the highest write STR or latency, it seems RAID cache should be disabled, with hdd cache on because the read STR is so much larger. Intel’s raid cache seems to have adverse affect in many scenarios.
And finally, Raid 5:

Raid5 random read performance is decently approximated by the 1st order latency/str model, although there are significantly under-performing real results in the mid-range as seen above from 1MiB to 64MiB data request size.
what about writes:

at least in the mid range this simple model of latency/str seems to fail to really characterize performance here… Let’s look at the log scale:

Cacheless and disc cache only raid5 volume configurations seem to act as expected by the simplistic 1st order latency/STR model…
But that is becuase the model doesn’t really deal with the cache. Cache can greatly improve random write effective latency, since the system does not have to wait for the controller to calculate the parity bit and then physically write the entire stripe to the disc… eventually as the transfer request size is large, this cache becomes irrelevant.
I am not one to be content to leave things w/o a full understanding, so lets dig deeper:
OK, so we know w/o cache the raid5 volume has a latency of 62.5ms on average..
that includes a seek and a partial-stripe-write penalty…
we know from a single cacheless disc, the seek time is about 17.5ms,
so the partial-stripe-write penalty = 62.5-17.5 = 45ms
As the write request gets large relative to stripe size (here is 64KiB), the write will probably incurr 1 seek and a partial-write penalty at the start, and a partial-write-penalty at the end of the write for a total of 1 seek and 2 partial-stripe-write penalties. This means the effective random write latency is 17.5+45+45 = 107.5ms, knowing this I can go back and re-calculate the STR on buffered writes, for example with HDD&RAID cache enabled the 4-disc r5 volume got 0.31 iops in random writes of 512MiB, so
.1075 + 512/STR = 1/0.31 ==> STR = 164.2MiB/sec write with disc and ich9r cache enabled.
ok so lets model a volume with 107.5ms random write latency and 164.2MiB/sec write STR:

This predicts it dead on after 1MiB data request sizes…
Now this makes some sense… with small data sizes, the caching can buffer the writes, buying some time for the system to read the stripe, make the write in the buffer, then calculate the parity bit for that modified stripe without holding up the next operation….(the data is written to the storage device, even though its actually not written to the physical platters yet).
The combined cache does seem to adequately buffer the sequential writes, by giving the system time to calculate the parity bit before its actually written to the platters, it is not 3x the STR of a single disc, as it theoretically could be, but its still admirable.
note* performance was a bit better with OS caching,, (the “advanced performance” option under the volume properties in the device manager), and while I don’t feel scared when using the disc or raid controller’s cache… I think the slight gains (~5% write STR) isn’t worth the bother of having yet another place data is written to without actually going to the disc.
Brief Cache Talk:
I focused on various caching schemes throughout the article, and for good reason, it has massive effects on real-world performance.
there is something worth mentioning here:
In OS level caching, writes can be buffered in the system memory, it’s possible something could interfere before this data makes it to the hard disc. While this is probably not a big concern, ive seen little-to-zero advantages in my testing in windows, in fact sometimes the overhead.
Raid Controller cache, for many fancy controllers are on the controller card itself, so its a bit independent of the OS, which is good.. but if power goes out this data is not written to the discs and power is lost. High end controllers have cache batteries which are meant to keep that data alive until its written to the disc, in the event where power is lost. This intel controller dedicates some of the system ram to itself on boot, so its fairly isolated from the system. The battery backup i think is a like using a hammer to fix a computer… Yes it will be able to put the cache on teh disc, but who’s to say the cache doesn’t have partial information of files, resulting in corruption…
Hard Disc Cache is basically a bit of ram on the drive itself, it is used for buffering writes and buffering IOs to allow for re-ordering (NCQ), anything in this buffer will not be saved to disc in the event of power-loss to the system.
Disabling cache altogether is also not a solution, if the head of the disc if half way through a write operation, or a file write, partial writes will occur, possibly resulting in corruption/unusable data.
Bottom line if you care about integrity of your data, your system simply can not lose power, spend the money on redundant PSU’s and UPS’s rather than the (imho) useless cache battery system.
Conclusions>
well, this was more a lesson in storage than anything else, but lets see:
The ICH9R does a very admiral job, we aren’t talking about breaking any records here but performs respectively as long as you configure the caching properly. It also has price (nearly free) going for it as well as an industry giant supporting it… not to mention its ginormous userbase and forward/backward compatibility of raid volumes.
In summery, raid on the ICH9R:
| 1-disc AHCI | 2-drive raid1 | 4-drive raid10 | 4-drive raid10 | 4-drive raid5 | |
| optimal configuration | hdd cache on | hdd cache on | hdd cache on | hdd&raid cache on | hdd&raid cache on |
| cost per usable GB | 13.3 cents | 26.6 cents | 26.6 cents | 26.6 cents | 17.7 cents |
| annual failure rate (assuming no disc replace/volume rebuild) | 5% | 0.25% | 0.50% | 0.50% | 1.4% |
| Sequential Reads (average) | 78MiB/sec | 144MiB/sec | 237MiB/sec | 175MiB/sec | 232MiB/sec |
| Random Read Latency (average) | 12.9ms | 13.2ms | 15.9ms | 14.3ms | 14.34ms |
| Sequential Writes (average) | 84.6MiB/sec | 85.1MiB/sec | 148.3MiB/sec | 161.5MiB/sec | 164.2MiB/sec |
| Random Latency (average) on small writes | 12.9ms | 7.35ms | 4.3ms | 4.1ms | 9.65ms |
| Random Latency (average) on large writes | 17.5ms | 18.1ms | 18.9ms | 18.9ms | 107.5ms |
There is more to do, but my time is finite.
to do: present my data and analyze how allowing multiple random IOs to accumulate before a read/write affects performance.
to do: in the future compare the ICH9r with a fancy/expensive raid controller
to do: confirm that Intel’s new ICH10R performance almost identical with the ICH9R
to do: RAID-Z <– this is big, but seemingly not ready for prime-time.
to do: clean up this article and fix some slight imperfections here and there.
This entry was posted on Wednesday, August 13th, 2008 at 04:24 and is filed under it. Find similar posts by selecting any of the following tags: ich9r, it, raid, storage. You can follow any responses to this entry through the RSS 2.0 feed. You can skip to the end and leave a response. Pinging is currently not allowed.


on 19.08.2008 at 09:34 Will Murnane wrote:
3.5 inch disks don’t have radius 3.5 inches—more like 1.2. But don’t take my word for it—break out the torx bits and the caliper.
What chunk size did you use for your raid 5 array?
Why don’t your graphs have the data points you actually sampled labeled?
on 19.08.2008 at 13:36 stephen wrote:
yes you are very right about the radius, I saw that some time ago, but I thought i fixed it… apparently not.. thanks for the catch.
my labeling scheme in the graphs was meant to reduce clutter and visual noise.
on 30.08.2008 at 03:16 Chris wrote:
Nice write up!
You mention OS cache a few times. How do you enable this cache w/ intel onboard raid?
Thanks!
on 14.09.2008 at 18:12 L wrote:
Fascinating and well-produced article. I look forward to a comparison with a high-end controler.