Tuesday, February 7, 2012

Pain; or, the saga so far

Last week was the YouTube ski trip.  I elected not to go, and to catch up on work instead while most of my coworkers were away.  Since the office was going to be largely deserted on Tuesday and Wednesday, I planned to work from home.

Meanwhile, we've been doing a long overdue major purge at home, with the home office as the current focus.  It has doubled as a storage space for years.  Working in it while hemmed in by mountains of unsorted belongings has been a drag, but over the past couple of weeks we got it emptied out, looking a lot better, and feeling a lot more comfortable.  But we still had to get my big ugly desk out of there and replace it with the nicer, smaller one we had standing by.

Getting the desk out of the office meant powering down my computer and uncabling it.  My desktop computer serves as my mail and web server too: it's the home of emphatic.com, my e-mail domain and family website, and of geebobg.com, my blog.  Managing my own services is a point of pride for an old-school hacker like myself.  I also insist on having control over my own data.  What followed next is a crystal clear demonstration of the folly of that desire, and that the control I believed I had was just an illusion.

So: on Tuesday morning I shut down my computer, taking my domains temporarily offline.  Bit by bit I undid the rat's nest of cables hidden behind and underneath my desk, carefully labeling each one with some masking tape and a Sharpie.  One by one the components moved to other spots around the house: the keyboard and mouse; the USB hub; the external drives; the monitor; a KVM switch; the Ethernet hub; the wifi access point; and finally the computer tower itself, a Dell model.

We moved the desks around, disassembling my old ugly one and removing it from the room.  I reassembled my computing rig, taking care to manage the cables more neatly, and ended up with everything reconnected and no rat's nest!  I switched the computer on... and it wouldn't boot.

I switched it off, waited thirty seconds, and tried again.  It still wouldn't boot.  It got as far as drawing the BIOS welcome screen, but the chunky progress bar stopped halfway.  It wouldn't respond to presses of F2 (the BIOS setup key) or any other key.

I started to sweat.  All the important data on my computer is backed up in "the cloud," but not in a form that's easily reconstituted into a working system without a lot of work.  The prospect of downloading it all from the cloud (once I sorted out whatever hardware failure was occurring) and recreating my emphatic.com and geebobg.com services was daunting, to say the least.

Plus, as long as my computer was down, emphatic.com mail wouldn't be arriving.  That's the domain where friends, family, and businesses all know to reach me.  Luckily I'd had the foresight, years earlier, to have another set of servers (at zanshin.com) serve as a "secondary MX," so any messages to emphatic.com that couldn't get through would get queued up there until emphatic.com was ready again.  But decades on the Internet, much of that time as an e-mail technologist, has caused my peace of mind to depend crucially on the healthy flow of mail.  So I had a lot of constant agitation to look forward to until this issue could be resolved.

With the computer refusing to start, I wondered whether a connection was loose or something.  I turned the computer on its side and opened it up.  I carefully vacuumed out a lot of dust, reseated some connectors, and turned the computer on again.  This time it came to life!  Relieved, I powered it off and closed it back up.

That was a mistake.  After that I couldn't get it to boot again, no matter what I did.  I hypothesized that either the hard drive was dying -- it would only boot once every N tries, for some large value of N -- or that the computer's PATA bus was flaky.  That's the motherboard component to which the internal disk drive connects.

Knowing the unreliability of consumer hard drives, I figured the odds were good that the problem lay with the drive and not the motherboard.  I placed an order online for a new internal drive, with next-day delivery.  Meanwhile I was desperate to get the existing drive running again.  If I could do that for just a few hours, then I could copy all the files on it to an external hard drive temporarily -- I happened to have one sitting around -- and could then recopy them to the new internal drive when I installed it the following day.

Using my wife's computer, I sent out e-mail and Facebook alerts about emphatic.com being down, asking friends and relatives to use my Gmail address instead.  I asked my friend Bart, another e-mail guru who had admin access to my secondary MX, to have it forward to my Gmail account while emphatic.com couldn't receive.  So I wasn't totally offline, but things still sucked.  For the rest of the day, while I should have been doing YouTube work, I fretted and muttered and swore, and several times per hour I switched my computer on to see if it would boot this time.  It never did.

In the evening I began to prepare for installing a new system from scratch the next day.  I started by making a Fedora install CD on my wife's computer.  I had an appointment to get to that night, but the CD finished burning before I had to leave, so I quickly switched on my computer and inserted the CD to see if it would at least boot up that way.  It did.  So I switched it off again and went to my appointment.

Turns out, that was a mistake too.  It booted up with the CD only because the hard drive happened to respond that time!  When I came home later that evening I couldn't get it to start again, with or without the CD.  And I tried over and over, late into the night and beginning again early Wednesday morning.

Despairing, I tried to get some YouTube work done while awaiting delivery of my new internal drive.  Then, unexpectedly, at 9:40 my occasional attempts to boot my machine finally worked!  I know the time precisely because archived in my Gmail account is this chat message that I immediately sent to my wife:
I unpacked the external hard drive that I had standing by.  (I bought it a few months ago, knowing I'd need it sooner or later, when I heard that flooding in Thailand was going to cause hard drive prices to spike.)  I plugged it in and tried to partition it with fdisk but encountered errors I couldn't figure out.  I didn't want to waste any time -- who knew how long the computer would stay up? -- so I rushed to the Best Buy down the street and bought another external hard drive.  I brought it home, opened it up, plugged it in -- and then realized that the problem had been with the first external drive.  It had been automounted as soon as I plugged it in, and fdisk won't work on a mounted partition.  I put the newer hard drive aside, partitioned the old standby hard drive, formatted a filesystem on it, and began copying half a terabyte of data onto it.

I watched with a growing sense of relief as the copying proceeded.  I carefully repackaged the newer hard drive and took it back to Best Buy, who, to their credit, accepted it as a return despite it being opened and gave me a full refund.  Everything was looking up.  I got back home, checked in on the copying -- still going -- and settled in to get some YouTube work done, finally.

An hour later, and only a fraction of the way into the half-terabyte copy operation, something began to go wrong.  The Linux kernel was randomly losing track of the plugged-in drive, and then re-detecting it and assigning it a new drive letter.  Of course each time this happened, the copying was interrupted.  I could get things going again by remounting the drive using its new device name, but the same problem would happen just a minute or two later.  Over and over this happened.

Believing at first that the problem this time was in the kernel's USB driver software, I thought a reboot might fix it.  Here's the relevant part of the chat record with my wife:
i think i understand the problem with the external drive. if i'm right, then the same problem would happen with any new drive that i buy. (which i haven't done yet)
however, fixing the problem requires restarting linux. which risks the disk-won't-start problem again
not sure whether disk-won't-start happens on reboot, or only on power-off/power-on
oh no!
is there anything that you can do to backup to the cloud everything that you need?
and then if it doesn't come back at all then you have everything you need somewhere else?
everything important is already backed up. it's just that it's in a form that will make it painful to rebuild the system. what i'm after now is a mirror of the disk i have, everything in the right place and ready to swap right in when the new disk is ready
that will make a big difference when it comes to rebuilding the machine
i'm nervous about restarting though
you haven't tried a reboot, correct?
only on/off
right, i haven't tried
and there's no way to make a mirror image in the cloud?
no, it would take days, and mucho dinero
so what's the problem with the other disk?
and what are the chances that you are right/wrong about that?
the problem isn't with the disk but with the usb system in the kernel. the chances are good that i'm right about it, given other symptoms that i've seen
i just have to do it. wish me luck
good luck
[I do it]
success! whew
But the problem returned quickly.  I decided that either the motherboard was bad somehow after all -- although it was hard to imagine a fault that would affect both internal drives on the PATA bus and external USB drives -- or the external hard drive was.  The only way to tell was to return to Best Buy, get that newer external drive back, and begin a new half-terabyte copy onto that.  (When I got there, I offered to buy my opened item back again to save them the trouble of restocking it.  But the paperwork turned out to be too hard, so I just got another off the shelf.)  I got the new copy going and it ran for hours, finally finishing with no further problems late Wednesday evening.  (I later tried the original external drive on a different computer, and it had the same problem there, so that part of the mystery was solved.)

By that time my new internal drive had arrived, so after my files were safely stored on the even-newer external drive, I shut down my computer -- not without a lot of hesitation, wondering whether it would ever come back up! -- and swapped out the old internal drive for the new.

It booted right up with the Linux install CD.  Installing the OS took only twenty or thirty minutes, and then it was time to copy everything from the external drive into a "/old" tree on the new internal drive.  That would take all night; I'd begin the process of installing software and configuring my server (from files and settings I could now pull from /old) the next day.

The next day, Thursday, feeling confident about how things were now going, I decided not to configure the system yet, but instead to reinstall the OS from scratch.  I wasn't happy with some of the choices I'd made when setting up the new disk, such as the pointless default of partitioning of the disk into separate root and /home volumes (ensuring one or the other would run up against an arbitrary limit long before the disk itself was full) and overriding the system/normal account numbering for the sake of preserving my old numeric user id.  My prior server config was the result of many years of accumulated cruft.  I had the opportunity to start fresh with a clean, perfect config, and wanted to make the most of it.  That night I repartitioned the internal drive, reinstalled the OS, and recopied everything from the external drive to /old.

Friday night was a frenzy of server configuration.  In addition to recreating user accounts, setting up sendmail and httpd, restoring MySQL and Wordpress, and so on, I also had to grapple with changes that the current version of Fedora abruptly made to its system administration tools: "systemctl" instead of the time-honored SysV init scripts; "ip" in place of tried-and-true "ifconfig"; SELinux hovering over it all and fascistically preventing anything from quite working; and so on.  (Fedora wasn't broke, but they went and fixed it anyway.)

Finally, by 2am Saturday morning, everything was working perfectly.  Better than before!  The backlog of mail from my secondary MX was arriving and processing apace.  I went to bed feeling relaxed for the first time in days, deeply satisfied with my heroic rescue of emphatic.com.  Still, if I needed more convincing of the value of cloud computing, this was it.  There had been sysadmin emergencies like this before over the years, but this time, even with my computer down, I wasn't isolated, and was correspondingly less frantic.  I still had e-mail and chat and all my important files -- through my smartphone if need be.  Maybe it was time to consider turning over administration of emphatic.com and geebobg.com to a cloud provider.  Google Apps, for instance.  Let the pros handle the emergencies and even the routine stuff.  Let me have my scant free time back.  I'm getting too old for this shit.

When I woke up, things sucked again.  Overnight, my machine had had a "kernel panic."  Now it wouldn't boot again, with the same behavior as earlier in the week: freezing in the BIOS startup screen, the internal disk refusing to spin up.  I tried several times.  I couldn't understand it: the new disk had so clearly appeared to solve the problem.  The computer started right up as soon as I installed it, and had started flawlessly several times since.  Was there really a motherboard problem after all, and not a disk problem?  Had I managed to avoid it only by random chance several times in a row beginning exactly when I installed the new drive?  Or did the problem reside on the old drive, and now somehow the new drive had developed the same problem?

I resigned myself to buying a new computer.  Something was unreliable.  I didn't know what, exactly, but replacing everything seemed like the best course.  I chose a model from the Dell website and chatted online with one of their sales drones about the possibility of avoiding the Microsoft tax: I didn't need Windows installed, and I didn't even need a main disk.  I wanted just to remove the internal drive from my current machine, where all my data and my perfectly restored OS already lived, and plop it into the new machine.  The sales drone consulted his manager and came back with the news that Dell wouldn't sell me a computer that way.

Late in the day I got my machine to boot again.  I placed an order for a new Dell anyway (using a big Dell discount that we had).  It arrives in a week.  I'm a little worried that when I plop the current disk into it, the new one will develop the same problem, but I guess we'll see.  This is where things have been since Saturday night: my server has been running fine since then, but who knows for how long.