Creating Reliable and Redundant IT Infrastructure with Stretched Clusters
A business' IT infrastructure should be more than an exercise in "technology for the sake of it." It should represent the rock-solid foundation upon which that company is built. When executed properly, your IT and business strategies should be one and the same. The former should become an organic part of the latter, making it easier than ever for your people to do their jobs in a way that pushes the entire enterprise forward.
When building your IT foundation, it doesn't always have to be the flashiest deployment, the most advanced, or the most expensive—but it needs to be reliable, above all else. At , reliability was a major focal point guiding recent IT initiatives.
USS-POSCO is a finishing facility in the steel manufacturing industry. We bring in raw steel and put a finish on it, manipulating a product by way of adding elements like tin or galvanized coating to bring things more in line with a customer's unique expectations. While we might not be a household name, you’ve undoubtedly used a product we’ve worked on. To give you a frame of reference, Silgan Containers, who make cans for Campbell's Soup, is one of our clients.
We're an older organization and, because of that, some of the limitations in our own IT infrastructure were beginning to show themselves. I've been with the company for nearly 20 years and I saw this slow transition happen firsthand. For an organization like ours, IT downtime is unacceptable because of the implications for our manufacturing facilities. If someone's desktop system goes down, it's an inconvenience. But if the larger system goes down, we quite literally can’t operate.
Sadly, we experienced several failures in the last couple of years from our aging infrastructure. For example, we had a database server that went down—taking our facilities with it—for about 30 hours. At that point, you lose more than productivity. You're paying employees and utilities for people to be on-site, with nothing getting made.
We had two or three major instances like that in the last few years that caused a massive ripple effect across the entire facility. At that point, we decided it was long overdue to sit down and solve this problem once and for all.
The Best Laid Plans
Originally, our plan was a straightforward one: We would add a secondary data center site for our automation group to leverage the benefits of a VM environment. This would enable their group to virtualize a large number of the automation servers and replicate them between two distinct sides of the business. In turn, they’d achieve redundancy, easy recoverability, and more.
This plan managed to get as far along as looking for a partner, but things began to fall apart pretty quickly. We received a quote from Dell, but it was absurdly expensive. But more than that, we wouldn't even have the backup functionality we needed—a crucial part of our requirement. It wasn't an acceptable solution by any stretch of the imagination, so we realized we had to keep looking.
Thankfully, we didn't stay at square one for very long.
Right around that time, we were shown (which was called SimpliVity at the time). The recommendation came from one of our vendors who knew we were considering an upgrade to our VM environment. Not only was their solution more than capable of handling all of our backup and redundancy requirements, but the total cost of ownership (TCO) was far lower than what Dell had quoted us.
A lower TCO was a massive benefit, but what really excited me was learning that HPE SimpliVity supported two-node clusters as one single entity. Based on that, I could deploy a —something with major implications for an organization like ours. We may not have the budget for an off-site disaster recovery option, but what we do have is real estate. Our facility has 87 acres under our roof.
I could put nodes in two different locations, separated by about a mile. That would mean if I lost Site A, then Site B would be our fail-over solution. But, better than just that, everything would be automatic. There's no programming and no SRM. It's all native. In essence, I would take VMware's high availability and leverage it with HPE SimpliVity's data technology to get me disaster recovery with high availability in a far easier, more scalable way.
But it's easy to say all of this will work in theory. It's another thing entirely to get everything on its feet in a way you can depend on. We recognized that HPE SimpliVity would be a great solution, but I also knew comprehensive testing would be in order.
Because You Can't Afford Half Measures
To help make sure this plan was as sound as I thought it was, I ran a test of my own. I took a system and I pulled the plug on one side of the facility. In less than ten minutes, it was running at the other side. I never touched a thing, apart from the power cord that I unplugged.
Granted, once we get the systems loaded, if we have a failure, it would be more than ten minutes to solve our problem. You're probably looking at something like 20 minutes, especially as VMware started bringing the data back up.
But the reality is I didn't have to make a phone call. I didn't have to program anything, and I didn't have to pay for an additional product. On top of all of that, I'm getting backups built right into the product.
Prior to this, we had a similar situation that wasn't quite as simple. In fact, it took 26 hours on the phone to get the system up and running again. As I started to build out our new solution, I continued to do a bunch of testing—including more sophisticated ones than pulling out a power cord.
No matter what I did, no matter what I tried, I couldn't kill it. We weren't going to get that with Dell.
The Best Kind of Transition
While I would still consider ourselves a company in transition, we're already seeing major improvements with our HPE SimpliVity and stretched cluster solution—particularly in terms of redundancy and reliability.
We had our first test of redundancy and built-in backup/restore last March when a Windows update came out. The update caused problems and I had to go back four weeks in HPE SimpliVity to restore the VM. The entire process took less than five minutes to complete. That's restoring from the VM environment, booting, logging in, and rejoining the domain—all in less than five minutes.
That 300-gigabyte restore would have taken hours before. Five minutes is less time than I would have even spent waiting on hold for a vendor to pick up the phone.
In June or July of this year, we ran into a similar situation where we had to replace our core switches on both sides of our facility. Again, they were part of our legacy infrastructure. During this process, we failed the system over to one side of our facility and left all the VMs up on the other side.
We changed the core switch and then brought everything back up. The process couldn’t have been simpler. With our stretched cluster setup, we didn't even have to work with backups or do any storage modifications at all. I just put a host in maintenance-mode and failed over to the other side of my facility, removed our assets from maintenance-mode, and before we knew it, the system was running.
In terms of reliability, we haven't had any major outages since this process began, but even if we did, we would be all set. The only thing that would put us in a tough spot would be something large enough to impact the entire facility and take both systems down at the same time. That would require something massive, like an earthquake. Thankfully, we don't have major power outages because we're fed directly off a power plant. The only systems that ever caused us downtime were physical systems that have been put out to pasture since going to HPE SimpliVity.
Our new setup definitely helps me sleep better at night.
Another big aspect of reliability is around the diverse set of apps we use on a daily basis. In addition to SQL and an on-premise Exchange deployment, we also have applications like industrial manufacturing control systems, Nagios, FactoryTalk View, Motorola MNIS, Qlik, and vCenter Server. We run the gamut in applications we need to support.
Our new HPE SimpliVity setup helps all of these applications run faster and more stable than ever before. Our SQL servers in particular saw a big improvement in backup times and performance.
We’re also seeing incredible results with applications like QlikView. There, jobs are running 100% faster than before. Since a big part of IT’s job is to enable our employees, it’s amazing when we get to see applications working twice as fast. Since greater reliability means the jobs on QlikView no longer fail, employees don’t need to waste their time troubleshooting. They can just simply get on with their day. That also means fewer people coming to our department with complaints.
Simplification Is the Name of the Game
HPE SimpliVity is an extraordinary solution for an environment like ours because it lets us leverage stretched clusters. Organizations with similar physical setups—like school campuses—would be wise to consider this architecture. When you're spread out, you can often leverage IT solutions that don't make sense for someone that's in, say, an industrial complex. You get backup and disaster recovery, all in the same package and the same facility.
If nothing else, this transition has allowed me to finally build the type of framework that our entire organization needs to simplify IT and reduce complexity. Our new setup is more powerful and future-proof, but it’s also easier to support.
Our solution with HPE SimpliVity simply has fewer pieces of the puzzle to work with. We don’t have as many products that we have to maintain, program, or configure. We have a less complex hardware infrastructure so I don't have to work with multiple management interfaces.
We’re just in the early days, but it’s clear we’ve built the right solution—not just the one that seemed the fanciest. We focused on our needs and our unique environment to create redundancy and reliability, all with less complexity. Often times, less is definitely more.