As people, we do love to assume that things just work, it’s very rare we test things, some things we do, we get our car serviced, we go to the dentist, we may even get a health check…but on the whole we are a bit rubbish at those kind of things and just assume things will be fine…only to be surprised when things go wrong!
Working in IT, one of the things we come across a lot, is people not testing their IT and especially not testing their IT resilience and recovery options, things are changing of course, increasingly we see that organisations are taking this more seriously and starting to build IT tests into their plans, but also technology, from smart storage through to the flexibility of cloud, allows us to more simply test full IT resilience and recovery.
At the minute one of the companies I do quite a bit of work with have been going through that very process, looking at their entire IT infrastructure resilience and recovery, I’ve spent a bit of time with them over this period, looking at how we can build our tests and look at developing a run book of things we should consider in our recovery plans.
Out of these tests their came a set of considerations, that I thought maybe useful to share, and it maybe something that you can use in your own IT test plans.
Oh and if you haven’t got any IT test plans… you really do need to get some!!!
Did I just say, if you don’t have any IT test plans…get some… hope so…just in case I didn’t, thought best I mentioned it again!!
Now for the rest of this article, We are going to discuss building a test environment from which we can define our full IT recovery plan. This is not a definitive guide by any stretch and there are lots of alternatives, but hopefully this will give you some things to consider so you can start to develop your IT recovery plans.
If you’re going to do tests, the important thing is to make sure you have an environment that you can actually do your testing in, a few things you should consider with your tests;
There are a couple of type of test environments you want to consider, there is the environment for a full DR test – so this is when you fully simulate a failure, that is we pretty much pull the plug in production and do the full-on failover. However these test require a plot of planning and are not without risk.
What’s more usable, is the environment we need to build our tests, an environment than can be brought on line to test DR alongside the full production environment – so what do we need to consider?
Sandbox it – When a test environment replicates production we have to ensure it can’t be something that can be seen by our production systems, if we are bringing up copies of line of business systems then we can’t have those systems running on the same network that our production systems sit on.
For this customer, it was something we could easily do, we had all of our systems replicated to an alternate location and we had a separate infrastructure that we could mount and in their virtual infrastructure we could use private networks to ensure the virtual machines we recovered could only see each other.
What “infrastructure” do we need – The next challenge we had to address was how much infrastructure do we need to mount a “production” replica, as with many organisations, this company runs a windows domain, so we needed to make sure we had replicas of domain controllers and DNS servers. A number of SAN based LUNS needed presenting as well, so it was important we took into account a way to re-present LUNS from a different storage array to these servers.
So what infrastructure do you need in your environment to ensure that test systems start appropriately?
Populate our system – So how do we get a system and its data there to test in the first place. Today one of the things you should certainly consider in your production systems is not only how you protect your systems, but how easily your protection technology will allow you to test recovery.
This customer had two key components in their production system that allows them both a flexible DR environment in which we could test their DR plans, importantly without breaking any of their live DR environment, while we tested.
In their case the system they have was a mixed hypervisor estate (HyperV and Vsphere) with an underlying NetApp storage infrastructure, which meant we could use application integrated snapshots to give consistent systems and then Mirror and Clone technology to present our test infrastructure.
Review what you are running, there is plenty of really good tech out there at the minute that can help with ensuring data is well protected but also capable of quickly presenting DR test environments, solutions like Catalogic DPX, Veeam and Actifio, as well as a range of smart “as-a-service” offerings.
Look at what you have and how it allows you to populate your test environment and does it help you meet you DR needs?
How do we test? – One thing that often slips us up at this stage is thinking about how we actually test our environment, if you think, we have presented an isolated virtual DR environment, then it’s important that we have a mechanism to test it, so potentially things like a Virtual PC image can be useful, so we can reproduce our production desktop environment.
Again this comes down to what the test environment looks like, but we do need to ensure we have a way of connecting to our applications and data.
Things we needed to test ? – May seem obvious, but of course it’s very important to think about, as it will dictate the kind of test you do. In the case of this test, we had highlighted a set of key servers that we need to test initially, however because of the way the infrastructure was developed, we could actually test pretty much everything.
In your case, it may not be full environments, maybe its just discreet datasets, define what’s important then look at how you test it.
Was it worth it? – That’s a strange, but good question, was it indeed worth all the work (and there was quite a lot of work required to develop the test environment). Well really it’s not a question we can answer for you, for this customer, it absolutely was, for a number of reasons, but as a business they realise they can’t operate without their key IT in the event of some kind of outage.
The testing means, that in a few weeks time when we do a full DR test, we can be very confident that our key infrastructure servers will run and we can spend our time on developing our DR “run books”. Which, then, in the event of an actual incident, we will be confident we can recover.
Also a strong test environment, means we can test more regularly than we could if our only option was full and disruptive DR tests.
Another useful side effect of the testing we did, was that we also identified some issues in the production environment that the organisation where not aware of, which has allowed them to address those issues as part of a separate project.
Ask yourself, how long would your organisation function for, it it lost its key IT systems?
Will it be worth it for you, well I would always say yes, if you can prove you can recover your key systems under test, you know you have a robust recovery option that hopefully you’ll never need. Ask yourself, how long would your organisation function for, it it lost its key IT systems? and then decide if testing your recovery plans are worth it.
What do we do next?
Well after some successful tests, the next stage will be to do a full DR test, so we can see if there are any areas we didn’t identify in the test environment that may cause us issues in the event of a real outage.
But alongside this, some of the areas that where highlighted in the test environment have led to a new project where we are reviewing some of the critical systems that where identified, to ensure we are protecting them appropriately to meet the organisations needs.
How do we define appropriately? This comes back to identifying our recovery point and recovery time objectives for our key applications and data. As I’m sure I’ve mentioned before, there is no point in having a system that backs up your data once a day, if, in the event of a system outage, you couldn’t afford to lose more than 4 hours of data, if you are going to a 24 hour old backup, to be honest, you may as well of not bothered.
Designing DR solutions and processes to carry out in the event of a continuity incident is not a straightforward task, however a flexible data protection mechanism can allow you to regularly test and develop your plans.
In this post I’ve shared with you some of the considerations we had in designing and testing the continuity plans within this specific organisation, hopefully they can be something you can take into your own organisation to build a DR plan, both to test and of course to actually be able to use in the event of a real issue.
- What do we need to test?
- How do we test it?
- What do we need to test it?
- Build a “sandbox” environment
- Populate our environment
- Prioritise your IT recovery and continuity plans
Hope you find this useful… Happy testing!