After reading this article, I had a few dissenting thoughts, maybe someone will provide their perspective?
The article suggests not running critical workloads virtually based on a failure scenario of the hosting environment (such as ransomware on hypervisor).
That does allow using the ‘all your eggs in one basket’ phrase, so I agree that running at least one instance of a service physically could be justified, but threat actors will be trying to time execution of attacks against both if possible. Adding complexity works both ways here.
I don’t really agree with the comments about not patching however. The premise that the physical workload or instance would be patched or updated more than the virtual one seems unrelated. A hesitance to patch systems is more about up time vs downtime vs breaking vs risk in my opinion.
Is your organization running critical workloads virtual like anything else, combination physical and virtual, or combination of all previous plus cloud solutions (off prem)?
That article is SO wrong. You don’t run one instance of a tier1 application. And they are on separate DCs, on separate networks, and the firewall rules allow only for application traffic. Management (rdp/ssh) is from another network, through bastion servers. At the very least you have daily/monthly/yearly (yes, yearly) backups. And you take snapshots before patching/app upgrades. Or you even move to containers, with bare hypervisors deployed in minutes via netinstall, configured via ansible. You got infected? Too bad, reinstall and redeploy. There will be downtime but not horrible. The DBs/storage are another matter of course, but that’s why you have synchronous and asynchronous replicas, read only replicas, offsites, etc. But for the love of what you have dear, don’t run stuff on bare metal because “what if the hypervisor gets infected”. Consider the attack vector and work around that.
You can prevent downtime by mirroring your container repository and keeping a cold stack in a different cloud service. We wrote an loe, decided the extra maintenance wasn’t worth the effort to plan for provider failures. But then providers only sign contracts if you are in their cloud and you end up doing it anyways.
Unfortunately most victims aren’t using best practices let alone industry standards. The author definitely learned the wrong lesson though.
Good comments.
Do you think there’s still a lot of traditional or legacy thinking in IT departments?
Containers aren’t new, neither is the idea of infrastructure as code, but the ability to redeploy a major application stack or even significant chunks of the enterprise with automation and the restoration of data is newer.
There is so much old and creaky stuff lying around and people have no idea what it does. Beige boxes in a cabinet that when we had to decommission it the only way to understand what it does was doing the scream test: turn it off and see who screams!
Or even stuff that was deployed as IaC by an engineer but then they left and so was managed “clickOps”, but documentation never updated.
When people talk about the Tier1 systems they often forget the peripheral stuff required to make them work. Sure the super mega shiny ERP system is clustered, with FT and DR, backups off site etc. But it talks to the rest of the world through an internal smtp server running on a Linux box under the stairs connected to a single consumer grade switch (I’ve seen this. Dust bunnies were almost sentient lol).
Everyone wants the new shiny stuff but nobody wants to take care of the old stuff.
Or they say “oh we need a new VM quickly, we’ll install the old way and then migrate to a container in the cloud”. And guess what, it never happens.