[lug] server spec
nate at natetech.com
Thu Jun 14 15:17:49 MDT 2007
Sean Reifschneider wrote:
> On Wed, Jun 13, 2007 at 06:59:50PM -0600, Lee Woodworth wrote:
>>> cooling WILL fail. Never been in any data center where it didn't,
>> Even when there are backups for the AC?
> Yeah, this is weird. I've never seen a data center cooling system fail. I
> mean, yes, one of a pair of redundant units may fail, but overall the
> system has continued working. Not in a class A facility. I've only heard
> about it once, the data center for a national warehouse related to one of
> the railroads, IIRC. This was back in the '80s.
Speaking of "class A" facilities, that's another problem for most folks
-- finding those. Which facilities in Colorado would you consider
"class A" that are open to the public?
Level 3? SunGard? ummm... ViaWest (their downtown site is only what,
about 10 feet above the Platte river? - ever seen the Platte flood? It
will. Eventually.) Qwest's various co-los?
The failures I've seen... (all of these are in-person or via phone where
I was directly involved in recovery activities)...
- Shared cooling tower coolant failure. Multiple AC units, but all were
using the same cooling tower. Whoops. Coolant everywhere, quite a
mess. Glycol everywhere.
- Rainwater infiltration to AC control panel area, AC had to be shut down.
- Extended power failure to a site that had enough generator to run the
equipment but not the AC.
Other "fun" failures... non-AC...
- Both diesel generators failed to start. Always entertaining.
(Battery plant usually doesn't run AC, depending on the site.)
- AC inverter to generate AC off the telco-style battery plant literally
blew up after failure of a high-voltage filter-capacitor array. Damage
to array was caused during Florida hurricane two weeks prior to failure.
No one hurt, replacement inverter not available/on backorder,
load-shedding of all non-essential equipment done on its "twin" that was
powering the other AC equipment, and emergency electrical work to attach
the rest of the AC load (carefully) to the remaining inverter.
Replacement arrived via flatbed truck the next week and took a crack
crew five days to install.
- Diesel generator blew an oil hose, lost all oil pressure and shut
down. UPS system tripped off-line completely during surges from the
input side caused by the hiccuping generator. Site down.
- Electrical contractor took EPO switch off the wall and was digging
around behind it without authorization when his escort wasn't looking,
shorted terminals on EPO when putting it back on the wall. (Yeah, we
couldn't believe it either, and I got to watch some VERY pissed off
managers throw him into the hallway and ask him never EVER to return.)
And the best one of all...
- New hire technician couldn't get conduit cover at floor level (it came
up out of the raised floor and ran up the wall) to go back on, after
adding multiple -48VDC electrical runs through it. Decided kicking it
was the best he could think of. Rubbed insulation off two of the 6
gauge wires already active in the conduit, shorting that -48 VDC circuit
to ground. Breaker on -48 VDC master distribution blew before breaker
on that circuit (shouldn't happen, but it did, and was investigated
later), and all -48 VDC equipment in the facility went down.
[This one was the one where I was the data center "engineer" and man was
I pissed. We had a "come to Jesus meeting" that afternoon where I
described how far up someone's ass my foot was going to become lodged if
I ever heard about anyone kicking a power conduit in my data center ever
again... after we'd recovered all the customer gear that had dropped
off-line. Also had a little "chat" with the person who was supposed to
be supervising the new-hire. That was probably the maddest I've ever
been in my professional career, because not only had they caused the
outage - they did it while I was on a lunch break and didn't call me. I
came back to chaos. Oh man, I was HOT.]
Two other "fun and almost deadly" -48 VDC stories, both were techs doing
the STUPIDEST thing of all, pulling LIVE cables. NEVER EVER EVER do this.
In one, I was on the phone with the 2nd tech at the site, when loud
popping and sparking was heard just before we, 25 miles away, saw all of
our data terminals drop dead (and they were Wyse 50's and 60's, and our
DACS/Mux equipment we were using at the time would spew garbage when it
lost DS1 synchronization... including CTRL-G (bell) characters. The
entire callcenter started beeping, and I was the only person who knew
why... because I'd heard the zapping through the phone and the cussing).
The tech had pulled a live -48 VDC cable (in a hurry) and had let it
hit the frame of the DACS/Mux equipment we were using. Two cards blown,
the rest of the system survived after a complete power-cycle. Dumb.
In the other, I came in one morning to find various techs excited about
an early morning incident where the overnight tech had started pulling
cables not realizing that he'd accidentally made them live when he threw
the wrong breaker. The cable "zapped" over to the top of a grounded
cabinet while he was standing on a ladder and he had the foresight to
pull the cable AWAY from the cabinet. But then he was stuck. Up a
ladder, alone, with a cable on the end of a wooden broomstick held up in
the air. He was there a while until someone found him and safetied the
circuit properly. We had a number of conversations about how wise it
was to have people working on power alone, the need to have anyone doing
that carrying a phone or communication's device.
> Of course, there have been class A power outages in the news recently, but
> those seem to have been caused by someone intentionally pressing the EPO.
> That happened to Live Journal twice in the last several years.
EPO power downs are boring compared to what happens out there in the
real world. Data center companies will go to GREAT lengths not to have
these stories get out.
It's very useful to have your OWN temperature monitoring in your cabinet
and have it alarm if it reaches certain temperatures... I've seen telco
co-lo's lose all AC for a day or two and never tell any customers who
didn't ask why the temp went up.
In general, it's hard to get across the undercurrent of levity that goes
with these stories in e-mail. They're the "war stories" of telco, and
as long as you weren't the idiot that caused the outage, they're usually
chuckled about, and a good laugh is had by all when things are restored
and the meetings are long over, and the customers aren't yelling anymore.
But if you work in either CO or datacenter environments for long enough
(and in enough of them, they come in all shapes, sizes, and capability
levels) -- you WILL see these failures. As the old guys retire, one of
the things that's lost in telco is the level-headedness and cool
response to big outages at CO's. The new guys panic, and/or do things
that put themselves or others in danger while trying to effect the
Sometimes, when the electrical transfer autoswitch didn't throw, and
there's 2 feet of standing water that just came in through the roof into
the electrical distribution room... you don't play hero and try to turn
things back on. It takes some experience to know when you're risking a
life (yours or anyone else's) versus a couple of million in lost
revenue, and to know the life is worth more than the job.
More information about the LUG