In our case, since every server for a given purpose is identical, we can snapshot the OS image and in the case of a failure, just install a brand new one, getting us right back to the start, much quicker than running a vendor's recovery process.
Good for you, you have lots of servers and tunnel vision to boot. Since they all have the same purpose, you are supervising the (same exact basic server) x (LOTS). You left out that you are running some form of VM or OS image type solution. Bravo, you're doing what everyone who has the funds in the industry is doing, real star trek type stuff there spotlight.
If you have a snapshot of the OS, you are placing it back in the same state that it was when it crashed. That is not a new re-install and you have a solution to mitigate the loss. That is very different from suggesting someone run with no raid or fault tolerance. Even still, from my perspective, preventing the outage from ever happening is better than having to re-image even in 15 minutes. But I'm sure your company did the ROI for that extra hard drive x(LOTS).
We spend our money and effort protecting the data.
Agreed, for some systems the data is crucial. You suffer from typical IT guy tunnel vision, for some systems uptime is crucial. Your business has established an SLA that you are protected by. XST's SLA will be "the server is down".
Fault tolerance has many, many definitions, and the key word in the phrase is "tolerance", not "fault". Assess the risk...when you scale up to where I spend my days, there are so many redundant systems and processes in place that the hard cost of having the OS on a RAID partition is outweighed by the cost savings and admin headache (soft costs) of managing it.
I'm already scaled up and I still managed to assess XST's risk. 2 disks, why would he not mirror the OS? I know how to do things in a small business solution because in a large enterprise, sometimes you need to consider cost and reality. Sometimes you're not in a data center. Don't trivialize someone's business because it is a small setup, and if you're not comfortable advising on a small setup, don't give bad advice. That's why IT has such a bad rep to begin with.
If you want to show off your tech savvy, don't do it while offering a consultation. "Fault tolerant" means that the solution can stand up to reasonable failure, you're implying "acceptable loss" which means "we tried, but $h!t happens and that server can be down for x amount of time." That's determined on a box by box basis.
Or offer your solution to XST, advise on a snapshot solution the space required to hold that snapshot, a backup solution for the data, the required connectivity and licensing, and the amount of recovery time (15-45 minutes?) on failure.
I'd rather he take advantage of RAID1 since he will mirror the data partitions, why not mirror the OS partition? The
0 minutes of downtime because the server more than likely stayed online, XST finding the dead drive and scheduling 20 minutes after hours to replace the drive and let it rebuild in the background, his boss learning that he saved them from a major outage, he looks like a hero, builds cred, applies cred toward his next recommendation.
I know, I know, when you have all those LOTS of servers, it's hard to hear Joanne in accounting yelling "Oh fu$k, the fu$king server is down again."