Where Everybody's Crazy

I'm a missionary in Japan. The name of my mission agency is WEC International. That's supposedly Worldwide Evangelisation for Christ, but I think I have a better idea about what it stands for...

2008-07-02

We're back: Reason for Outage

It's been a horrible three days. Both my servers have been down, which meant no mail for me, none of my web sites up, no mail for WEC Japan, no WEC Japan web sites up, no lists up, no mail or web for my other users. Sorry, guys. We would have got things together much faster were it not for the hosting company, RapidSwitch. They were worse than unhelpful to begin with, but redeemed themselves towards the end. To keep things fair, I've interposed their excuses with the story.

On Sunday afternoon my time, there was a power failure at the hosting company's facility. They say:

At 4.43am on Sunday morning the building lost mains power. The building suffered a power failure which caused the automatic systems to start the generator, which ran as expected. The system is then design to switch off the Air Circuit Breaker (ACB) to the mains feed, and close the ACB to the generator, thus supplying the UPS with generator power. This worked as expected and the generator took the load. Approximately 2 minutes later, the power cut ended, and power was restored switching down the generator and operating the ACB's to switch back to mains, which all worked as planned.

Shortly after this there was a further power cut, which re-started the above sequence, in that the generator started (successfully), the mains ACB opened (successfully) and the signal was sent to close the generator ACB. This signal was sent to the ACB, however the ACB failed to close, thus meaning that the generator could not supply the UPS with power during the power cut. The UPS worked as expected and took the load. During this time the mains came back on. The ACBs have a physical and electrical interlocking system, which prevents both ACBs from being operated at the same time, thus preventing the possibility of both mains and generators feeding the load, which would result in a severe failure. Because the signals were sent to the generator ACB to close, but it never did, the interlocking systems got into a state of deadlock, where they were both stuck in an 'open' position, thus leaving the UPS with no feed, resulting in the batteries draining down after 15 minutes, and the system loosing the critical load.

So far, annoying, but not their fault. As a result of the power bouncing up and down, our server suffered a hardware fault affecting the IDE controller. (Also not RapidSwitch's fault, really, as much as I'd like to blame it on them.)

I brought the server back up, but within a few minutes it had become unresponsive even on the serial console. RapidSwitch have a facility for connecting up a keyboard, video and mouse to the server and making these available through a VNC session over the network. I got them to connect this up for me so I could see what was happening on the screen. Despite repeatedly rebooting the server, not a thing happened on the screen. I therefore presumed that the hardware was completely dead. In reality, however, the RapidSwitch technician had managed to connect up the KVM without actually noticing the server was powered off. Not great.

I have checked our logs for this and found a KVM session was processed at 11:38:27 on 29th June. The session was activated by one of our newest technicians and unfortunately he has clearly made a mistake. I will never condone rushing a job, but given the circumstances during the day, I think a slight mistake by such a junior member of staff is at least partially understandable. I will speak to him and highlight the effect that not checking his work has had. I am confident this is not something that happens except in extreme circumstances.

Because we thought the machine was dead, we thought the best thing to do was to order a new dedicated server from RS. And since a server is kinda useless without data, we asked them to help us transfer the disks from the old hardware to the new one. This was, apparently, anathema to them, so they very kindly cancelled the order and left us to start again.

I am sorry this is not part of the process we can do in our standard order processing. We have cancelled your orders because we cannot fulfil them to your requirements.

So we started again. They built the machine, it arrived in the rack and I started to set up the infrastructure; at this point I still held the vain hope that they would help us to transfer the data later, since, you know, a server without data is useless, and, well, we'd told them several times we needed to do this. While I was setting things up, the machine failed, twice. I guess they had given us a new computer with dead IDE hardware as an exact replacement of our old computer with dead IDE hardware.

I can assure that providing defective hardware is a very rare occurrence and usually only happens with new hardware that is faulty. If a piece of hardware is faulty it is highly unusual for the installation and update process to complete successfully without it being noticed. We certainly taking [sic.] the testing of our hardware very seriously.

I complained about this, and was told that if I wanted them to investigate the hardware I would have to agree to potentially being charged thirty pounds a half hour if they didn't find a problem. Faced with the fact that I was being asked if I was willing to accept a new, non-working server and a bill for the privilege, I stepped away from the keyboard and went to bed, leaving Jamie to respond before I did something I might regret.

Before doing so, I asked them to try rebooting the old server for me so I could try to get the data off that. Roughly six hours later, someone went and pressed the power button.

Unfortunately the technicians who were on had no idea how quick a fix your server might be. In that situation they have to deal with problems in order of the oldest ticket first. However, I completely agree, this was too long a period of time for the work to happen. Unfortunately one of the technicians for the night shift called in sick, which was particularly bad timing. I'm sorry that they had so much work on and that we were a technician short. Again, under normal circumstances the problems you had would not have impacted you the way they did.

While I was asleep, Jamie managed to sweet-talk (actually it probably wasn't very sweet) someone into building a second new machine. That one actually seemed to work, and I set up the infrastructure again but still we had no data. I asked them to connect up the old disks to this new, surprisingly-working server, but was told:

I'm sorry, but fitting the drives from a colocated server into a dedicated one is just not feasible. For starters, we obviously don't know what drives are in the existing chassis, not to mention that these are old drives going into a new chassis. We are (believe it or not!) quite particular about our components, as standardisation is a very effective tool for providing constantly high levels of support.

"Constantly high levels of support." At that point the red mist came down and I had step away from the keyboard again. And we still had no data, and it was now Tuesday.

Things started to improve at this point, though. RS, to their credit, offered to send up a technician to the old box with a USB drive so we could get the data off. They get points for the thought, but of course, the old box still has failing IDE hardware, so it's just going to crash again. Which it did.

Oh, did I mention that throughout all this, I'm in Japan and Jamie's on holiday in Cornwall?

He eventually came up with the plan of getting his parents to go into his house, pick up a spare server chassis, take it to the IT guy at one of our clients, and have him drive down to Maidenhead, get the old box out the rack, swap the disks into the spare server, and put it back in the rack. He used the work room at RS to do this, for which we were charged 30 pounds per half an hour, this time for the privilege of supplying our own technician. Do any other hosting services charge for build room time? I know Redbus doesn't.

With the old server resurrected, I started transferring the data onto the new server, finishing around 4:30am on Wednesday morning. My time. Again while I slept, Jamie persuaded RS to give us another KVM session (they were going to charge us for that as well) so we could reboot the new server safely into Xen, at which point we were cooking with gas. Well, there were a few little niggles, one with kernel drivers and one with networking - RS had put the new server on a different subnet to the old one, so we had to do clever forwarding tricks - but by midday on Wednesday, everything was back up and running.

I don't know what to think about RS. They were great once the dust had settled, but when we needed them, they were atrocious, and that's what makes the customer service experience. It's a bit like the Army. An army which is great in peacetime but completely pathetic in the fog of war is going to get routed. And not in the networking sense.


Posted at 05:25:55 in whats-going-on technology | # | G | P | 2 Comments
Language
Japanese English
Links

Tags and Tools
« 2008-07 »
S M TWTFS
1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31

RSS


I am...

lathos: Going from iPod 1.x to 2.x and severely regretting it.


Photoblog

DSC_4810.jpg

DSC_4823.jpg

DSC_4828.jpg

DSC_4827.jpg

DSC_4855.jpg


Speedblog

http://glosoli.blogspot.com/2005/09/encrypted-thumb-drive-and-autoplay.html # it's my blog: Encrypted thumb drive and autoplay howto

http://daiyainn.gooside.com/ # 京都だいや旅館 京へおこしやす

http://www.e-chords.com/guitartab.asp?idmusica=96629&keyb=true # Where Could I go Tab by Ben Harper - E-Chords

http://www.inmamaskitchen.com/RECIPES/RECIPES/Soups/vegetable_stock.html # Moosewood's Vegetable Stock Recipe

http://www.foodnetwork.com/food/recipes/recipe/0,,FOOD_9936_8389,00.html # Good Eats Roast Turkey Recipe: Recipes: Food Network

http://www.reallivepreacher.com/node/203 # You Ain't Jesus, PreacherPart Two: Losing The Language of Love

http://leiterreports.typepad.com/blog/2005/06/95_theses_on_th.html # Leiter Reports: A Philosophy Blog: 95 Theses on the Religious Right

http://cbae.nmsu.edu/~dboje/teaching/338/traits.htm # TRAITS

http://jweb.kokken.go.jp/gitaigo/index.html # 擬音語・擬態語 - 日本語を楽しもう! -

http://www.nanzan-u.ac.jp/SHUBUNKEN/publications/jjrs/jjrs_cumulative_list.htm # Japanese Journal of Religious Studies: Cumulative list of Essays & Book Reviews

http://www.myspace.com/chloecfrancis # www.myspace.com/chloecfrancis

http://www.solar.ifa.hawaii.edu/cgi-bin/StrikeProb?latitude=+35.38&longitude=-136.26&location=Nagahama,+Japan # Tropical Cyclone Strike Probabilities for Nagahama, Japan

http://www.missionjapan.org/mission/jmissionorg.html # Japan Mission Organization List

http://www.aquasapone.com.au/soapmaking/showergel_soap.html # AquaSapone - How to make shower gel from natural handmade soap

http://www.ultimate-guitar.com/tabs/d/danilo_montero/la_unica_razon_crd.htm # La Unica Razon Chords by Danilo Montero @ Ultimate-Guitar.Com


Musicblog

Seal – Crazy (acoustic)

Holly Cole – (Looking For) The Heart of Saturday Night

Oasis – Don't Look Back in Anger

Powered by Glob!
Search: