Thursday 13 August 2009

Ten Thousand Fatal Alerts

Sometimes, even with the best will and intention, things just go wrong.

One night when I was working in what is essence was a NOC our Tivoli (which I do really rather like!) tool started to generate some fatal traps. Normally these we would delete them as they were false positives.

So after deleting these traps (about 5 I seem to remember) I went off to make a coffee...well it was about 1am!

I get back to my workstation some ten minutes later or so and my console was full of these alerts and it was rising! Within a half hour or so I had something like ten thousand of these alerts. And not all were false positives.

You know that sinking feeling you get when you know that things are really not going well at all? Well I had that feeling in spades. I really had little idea of how to tackle this (not being a Tivoli expert) so I called and woke up my team lead.

15 minutes later he's in the office and gets cracking on solving the problem. By the end of the night the system had generated multiple tens of thousands of these alerts. Amazingly enough my team lead managed to resolve the issue (bad config somewhere in the infrastructure) by about half two. He then hunkered down under his desk just in case it came back.

I know this is not a particulary amusing incident but there is lesson and a positive one - getting the right kind of team lead is important. Not only for being able to lead the team but also being the technical lead.

Top marks for this team lead!

No comments: