During one of my shifts working on a outsourced helpdesk (one of many help desks I have had the pleasure to work with) I noticed that I was getting alerts from a remote Oracle Server.
We had a batch utility that basically pinged the servers and if there was a time out we'd get an alert to our desktop to open a helpdesk case as a P1. As the helpdesk agent I would retain the case but engage the various technical resources available.
As we ran a 24/7 operation we had tight deadlines with things like back ups. Every on site (we supported a nationwide financial organisation) server needed to be backed up before business opened. It was preferable that the backups had completed about an hour before that time.
When I was alerted it was, needless to say, about an hour before the business started for the day. As the case developed we realised that the server was completely gone. Without the Oracle server this office was out of action and losing large amounts of money.
We had contacted the on site engineer who was pretty sheepish. Seems that he was building a server and had come in early to start. Of course he started the install correctly . Insert a bootable floppy so he could format the drives and then create the files system and partitions. All via the install disk.
Only problem is that he put the disk in the wrong machine and had forgotten that he had not turned the machine on. So Oracle server gets shut down ungracefully and then auto boots into formatting the drives. Lovely!
The engineer learned from the mistake...I doubt he will ever do something like that again. Mistakes happen. Sure there was a chewing out but last I hear this engineer is very dedicated and working on some very nice projects. Lesson learned.
I learned much from this incident as well. I learned that you cannot remove all risk but you can place processes in place to mitigate as much risk as possible.
・ Do not have staging areas in the server room
・ Use a worksheet detailing not only the work to be done but also all the other info needed (asset tags, user group owner, configuration info)
・ Follow a defined release management process including install sign off, testing sign off and placing into production sign off
・ Any automated system that requires a major change in the system state needs human approval before continuing (No booting up and straight into formatting the drives)
Its all about mitigation!
1 day ago
No comments:
Post a Comment