Monday, 24 August 2009

Dammit Jim! I'm a Hardware Engineer not a miracle worker!

I really enjoy working with people from different cultures and countries. Not only does it widen my horizons but also affords the opportunity to meet new people with very different points of view.

However there are some truths that are universal.

I was dealing with a desktop machine that had lost its networking capability. A pretty easy fix...just need to install a new step card and install the drivers. This card was not on the NT4 HCL (hardware compatibility list) and therefore the device drivers had to be installed by hand.

As I was several thousands of miles distant from the machine I needed to engage with the local hardware engineers. So I call them and state what the issue was and what I needed to have done. Which was really to install the step card and put a floppy (install disk) in the disk drive.

They were more than happy to do the install. No problems. Until I came to the part about the floppy.

'Sorry but we are hardware engineers. We don't 'do' software'.

Picking myself up off the ground after hearing this, I replied that that was a very interesting answer and that while yes indeed they were hardware engineers driver installs are a part and parcel of the job.

Much too-ing and fro-ing ensued regarding this. The engineers were resolute that they should not be installing any software or middleware or any other kind of ware.

Eventually the end user herself installed the entire thing including drivers in about ten minutes.

Last I heard there were two hardware engineers looking for work. They did not realise that the machine they needed to fix was actually their bosses. Oppps!

Lessons learnt – well for the hardware engineers I'd say the top one is to find out who has a problem before playing games.

Tuesday, 18 August 2009

Plug the Hole! The Birds! The Birds! The Horror!

When you are looking to build out a Data Centre there are usually a few options one should consider from the start.

Are you looking to build a brand new facility or looking for a existing building? This is a good start and as with most things one needs to understand the impact of that decision. In this post I am looking more at pre-existing facilities.

There are many factors to consider when looking at a possible Data Centre facility. Location, security, size, access, utilities, wildlife...all must be considered.

Some very brief examples (I know there is much more but for the sake of my fingers I will jump to the chase)
  • Location - this comes with a price tag. Central London is expensive but siting your DC out in the middle of nowhere also has costs.
  • Security - is it located near an airport? What is the surrounding area like? You don't really want to place your DC in a high crime area.
  • Size - This is a good one. Can your Data Centre withstand growth if needed? Is there space?
  • Access - No bloody stairs!* How easy is it for kit to be dropped off, staged and rollout into the actual machine room?
  • Utilities - Does the area suffer from brown/black outs? Who do you share your mains power with? How old and how well maintained are local water services?
  • Wildlife - When buildout has completed make sure there are no creatures left behind.
Ok yes the last one is slightly odd I admit. However there is a story about this. At the time not amusing for all concerned but as with many harshly learnt lessons, time lends a hand to soften the blow.

I was visiting a DC when there was a commotion in the server room. I noticed a couple of engineers running past the window into the server room. I looked at my host, the DC Manager and the DC Director looked back at me with looks of 'no...we are not crazy here!'. Obviously I don't have much of a poker face!

So DC Manager rushes into the server room while Director and I make idle chat. About 5 minutes later of watching the activities in the server room and making small chat the Director goes into the server room.

The next sight was one I really would never have thought I would see. There was the DC Director using their jacket to catch something. So I amble closer to the window and peer in.

Birds. Well pigeons. Flying rats. Two of them had somehow managed to get in this room. That is supposed to be sealed. So I was given a top class show on how geeks catch birds and how managers deal with the supervising aspect of such a project. I was pretty floored and counted my lucky stars that I was not involved. Mainly as I am sure I'd have flapped around like a headless chicken as well. Still there was no way I was going to leave!

I think eventually the birds got tired and the good Data Centre folks caught them and released them outside. Then they spent a goodly time trying to figure out what happened. While they were poking around the server room one of the engineers spotted a pigeons head poking through from the outside wall.

Seems that there was a smallish hole (you would not have seen it unless you looked straight at it) that these birds were using to get into the server room. Yes...they eventually found the nest as well.

Lessons learned here are to really make sure that your site is actually secure. That includes the facility's integrity. This means that while the buildout of the server room is being done that there are checks to ensure that there are no holes.

* <--- I hate stairs. No really.

Sunday, 16 August 2009

You Gotta Be Strong In IT

Bingo. A national pastime for many people. I was tasked to help roll out a bunch of Domino servers across the UK. In reality I was there as muscle.

It was a technology re-fresh meaning all brand new kit. Including server racks. Which tend to be quite weighty. Make that very weighty. Now one would think that that ought not to be a problem as they do have wheels/coasters. And you would be quite correct.

However many of this organisations Bingo halls were located in Victorian era buildings. No elevators but lots and lots of stairs. Sometimes narrow but always steep. Not fun with only two people. The Domino server install engineer and myself. Luckily my partner was built like a gorilla (and a top notch techie to boot!) and I'm not exactly a light weight either. Still these things were a real struggle and a real pain to carry.

So one day we arrive at a new site and do a quick very cursory survey and realise with heavy sinking hearts that the server room is not only on the top fifth floor but also that it may as well have been the attic. We didn't look into the room as the manager was not on site yet - as usual the cleaners had let us in.

It took us an entire day just to get the rack up there. The real killer was the last spiral stair case. At least three times I faced a squishing and my colleague a hernia or a snapped back.

So we finally get the thing onto the top floor next to the room. We left then and there for a well deserved beer or two and dinner at our hotel.

We arrive back on site and carry up the rest of the kit - hub, switch, two servers and a tape unit plus cable. On the first trip up the manager is waiting for us by the rack we'd carried up.

He quizzically asks "what is this thing?"

I reply that its the replacement rack. For some reason I had a not so good feeling about this.

He turns to the door to the server room and says "Well I've never seen one of those! Must have been fun carrying up those stairs though..."

The door opens and yes. There was the old server, switch and hub...sitting on a table.

Thursday, 13 August 2009

Ten Thousand Fatal Alerts

Sometimes, even with the best will and intention, things just go wrong.

One night when I was working in what is essence was a NOC our Tivoli (which I do really rather like!) tool started to generate some fatal traps. Normally these we would delete them as they were false positives.

So after deleting these traps (about 5 I seem to remember) I went off to make a coffee...well it was about 1am!

I get back to my workstation some ten minutes later or so and my console was full of these alerts and it was rising! Within a half hour or so I had something like ten thousand of these alerts. And not all were false positives.

You know that sinking feeling you get when you know that things are really not going well at all? Well I had that feeling in spades. I really had little idea of how to tackle this (not being a Tivoli expert) so I called and woke up my team lead.

15 minutes later he's in the office and gets cracking on solving the problem. By the end of the night the system had generated multiple tens of thousands of these alerts. Amazingly enough my team lead managed to resolve the issue (bad config somewhere in the infrastructure) by about half two. He then hunkered down under his desk just in case it came back.

I know this is not a particulary amusing incident but there is lesson and a positive one - getting the right kind of team lead is important. Not only for being able to lead the team but also being the technical lead.

Top marks for this team lead!

Wednesday, 12 August 2009

You Need A Larger Mouse Mat!

I once worked on a project that quite possibly was one of the best run change/transformation projects I have seen and had the pleasure to work on.

I'll not go into the massively well designed infrastructure and the other technical stuff; suffice it to say it brought tears to this ICT support vetran's eyes.

One of the more interesting areas was the engagement of the user community and how the change would impact them. This is one of the areas that can make or break a project and frankly is not an easy part of a project to manage. However with planning as well as understanding the needs of the user community you can do much to limit any user type issues. In this case an intensive 3 week IT traning course for the users.

However there are always some...

A colleague had a call from user who complained that the mouse would not move across the screen properly. So the usual the mouse cable plugged all the way in, is there any slack of give in the cable...all questions we have either asked or been asked.

In this case everything was fine so we requested a reboot of the machine. Of course this did not solve the problem. So we engaged an on site engineer to have a look with the user.

We paid little heed to the case until it was closed and we recieved the resolution details from the onsite engineer.

"User requires larger mouse mat".

In other words the user just moved the mouse to the edge of the mouse mat and then stopped and somehow expected the on screen pointer to move over the required icon.

Lessons learnt - even with the best will in the world you will encounter issues like this.

Monday, 10 August 2009

When Production Gets Re-formatted

During one of my shifts working on a outsourced helpdesk (one of many help desks I have had the pleasure to work with) I noticed that I was getting alerts from a remote Oracle Server.

We had a batch utility that basically pinged the servers and if there was a time out we'd get an alert to our desktop to open a helpdesk case as a P1. As the helpdesk agent I would retain the case but engage the various technical resources available.

As we ran a 24/7 operation we had tight deadlines with things like back ups. Every on site (we supported a nationwide financial organisation) server needed to be backed up before business opened. It was preferable that the backups had completed about an hour before that time.

When I was alerted it was, needless to say, about an hour before the business started for the day. As the case developed we realised that the server was completely gone. Without the Oracle server this office was out of action and losing large amounts of money.

We had contacted the on site engineer who was pretty sheepish. Seems that he was building a server and had come in early to start. Of course he started the install correctly . Insert a bootable floppy so he could format the drives and then create the files system and partitions. All via the install disk.

Only problem is that he put the disk in the wrong machine and had forgotten that he had not turned the machine on. So Oracle server gets shut down ungracefully and then auto boots into formatting the drives. Lovely!

The engineer learned from the mistake...I doubt he will ever do something like that again. Mistakes happen. Sure there was a chewing out but last I hear this engineer is very dedicated and working on some very nice projects. Lesson learned.

I learned much from this incident as well. I learned that you cannot remove all risk but you can place processes in place to mitigate as much risk as possible.

・ Do not have staging areas in the server room
・ Use a worksheet detailing not only the work to be done but also all the other info needed (asset tags, user group owner, configuration info)
・ Follow a defined release management process including install sign off, testing sign off and placing into production sign off
・ Any automated system that requires a major change in the system state needs human approval before continuing (No booting up and straight into formatting the drives)

Its all about mitigation!

Friday, 7 August 2009

Uses For A Car Jack In A Server Room

When I take on new ICT management roles I usually have a look at the facilities IT infrastructure with a emphasis on the server room or data centre.

This happened on my first day.

One design scheme in data centres uses what we call a raised floor. This meant that there is space of about a foot from the concrete floor to the tiles that rest on a metal framework. Of course the design has an intrinsic flaw and that is how much weight it can handle before collapsing, which could be a bit of a problem. Especially when the weight is from one of those big Airedales* air con unit leaning forwards by about 7 or 10 degrees off centre.

I turned to the facilities guy and quietly say - ok...move back slowly...not a sound. He looks at me and says oh that's ok look and proceeds to walk in front of this beast of a device. I swear I could see the thing swaying. I scanned the trajectory and realised that the Symetrix storage array (several hundreds of thousands of UK pounds worth of kit) was just plumb in the way. More a disaster than a problem.

So I go to my car get my car jack as well as his and we jacked the unit upright. It was a bit iffy to start with I have to admit.

I never did get my jack back!

* not really sure the exact weight but certainly in the region of a ton. And I have to say they really did the air con job well.

Lessons Learnt?

While reading posts on another website I realised that we all have our horror stories of ICT support. So I thought as I have quite a few accounts to relate that I may as well give it a go by using a blog.

Yes. I am a blog virgin. So this should be a wild ride. I will take you to the dizzying heights of bizzare user requests to the lows of some pretty wild asks from businesses. Such as the company that insisted that all mice were asset tagged and logged on a asset register. The fact that the tag actually made the mouse unusable did not seem to be a concern.

You...yes you! You too are a part of this and I would enjoy hearing from you. Please feel free to send me any stories - nfmueller at yahoo. co. uk

Please note - this is not to be used to publicise any grievence towards any organisation or persons by myself or any other participant. This blog is to relate amusing stories that also can be learned from. And as an added bonus technical subjects will crop up here and there:)

Welcome and lets record those lessons!