IBM VIOS Configuration Backups Disaster Recovery Advice
I recently had a discussion with our Managed Services Team around the above topic. I have customers who are backing up the IBM i operating system and data as I would expect but there is no visible backup of the VIOS partitions that underpin the whole infrastructure. Rebuilding of partitions in the event of a disaster can only begin once VIOS is recovered so I wanted some assurance that this could be handled in the best possible way. We all appreciate the importance of regular backups to ensure there is no data loss but of equal importance is the backups of the underlying OS and configuration that holds the system together. I have no hesitation in suggesting that the majority of customers do not perform a regular backup of their VIOS environments and would look to scratch install should they experience an outage. If you have everything documented and up to date then the install should not unnecessarily delay the restore of partition data. If not, then read on as there is an alternative way to recover this data.
As most customers have a dual redundant VIOS configuration it is often not deemed necessary to perform OS backups for them. This is because in the event of a VIOS server failure all the logical partitions they serve with I/O will continue to run on remaining active VIOS server.
The lack of an OS backup is fine and normally not an issue, however should a VIOS server ever require rebuilding from scratch all it’s virtual device configurations would need to be recreated. This is a time consuming task requiring manual reconfiguration of all the virtual devices from the initial build documentation. While entirely possible this does leave customers in an exposed situation in terms of redundancy for the period of the VIOS rebuild.
As part of the initial VIOS build by the Chilli-IT Professional Services Team a schedule is created on each VIOS to make a local backup of the configuration files. The schedule runs daily and uses the viosbr command to create a rolling 10 day backup of all configurations required for a rebuild. The command produces an .xml file and compresses in the format *.tar.gz , this file can easily have 10’s of thousands of lines for system with many logical partitions, this gives some idea of the potential complexity of manually reconfiguring the VIOS. While this is useful to have a local copy both for reference and to restore configurations on a running VIOS server it doesn’t help if there’s a failure resulting in a VIOS that will not boot.
Chilli-IT have seen a recent instance where a customer had a complete VIOS failure so although it’s rare it can happen. So is it possible to speed up the rebuild process to reduce the exposure window caused by running on a single VIOS server?
To help reduce our Managed Service customers exposure in the event of a VIOS failure Chilli-IT have implemented an automated configuration file back regime. This is also done for their Storwize & Fibre Switch configurations so the same shortened rebuild time applies to these devices too.
Every morning a schedule runs runs a shell script on the Procare servers at each customer site, this connects to each VIOS and downloads the all configuration files. These are then pulled using Ansible from the customer sites to our central Procare servers so they would be available in the event of a site loss. The daily backup also drops these files via an Ansible push on to our customer Wiki page for easy retrieval by our consultants.
Using these configuration backup files a VIOS rebuild process is greatly shortened. Once the base build is complete the latest configuration backup can be uploaded to the VIOS and the virtual device configurations restored using the command viosbr -restore option. This process would take seconds whereas to input the configuration manually from initial build documentation is likely to take hours, the more partitions the longer it with take.