pseydtonne: Behold the Operator, speaking into a 1930s headset with its large mouthpiece. (bright-blessings)
[personal profile] pseydtonne
Ever wake up without the alarm clock's prompting but still ask yourself: sheesh, have I slept?

The last three days at work have been exhausting. I'm getting more proficient at my job (translation: I suck less! I suck even less right now!). I'm being forged in fire: throw weird stuff at me and I figure it out. I have two specializations at work: Unix (which suits me to a tee) and database (which as you will read if you click the lj-cut puts my triage and funeral director skills to the test).

Still, I was given a very daunting task this week: be the only person on the Database team while the other guy (the one training me) is on vacation. I know jack about databases -- well, slightly more than jack but I couldn't program one. Our team handles the worst-case scenarios: corrupted database containers, resurrection questions. When someone is at the point where they need us, it's already an emergency situation. We have the meat (sorry, "customer") run a test. If the test comes back with anything other than zero, we have the meat tarball or zip up the database directory and send it in. We run the test again, look at the kinds of errors involved and run a program that rewrites the files that determine database locations. We test it again. If it comes back zero, we tell the customer to do what we did and they can get on with their work.

If the rewrite (it has another name but that name is proprietary so I'll keep calling it 'rewrite') does not lead to a fixed database, this means serious horrors could await the customer. This is when a customer stops being meat because tenderizing will not do the trick. Rewrite failure means the data inside the database (not just the directions on where to find data) is corrupted. Usually this happens when a hard drive crashes magnificently. Sometimes this happens because a hard drive array or something more elaborate (insert geek terminology for massive storage here) rebooted itself without warning the server which was writing to it. Imagine you had a hard drive so elaborate that it needed a full computer inside it. Keep in mind that it's still only storage, so it can't live on its own -- it needs to be a server's bitch. However, tricks can turn nasty. This is what happened to one of my customers last week.

At first the meat (the customer is still meat at this point) ran a check on the drive itself. Then the meat assumed "okay, I can let the hundreds of users go back to work". Then the users saw massive errors. The meat is one of five meats that look at this stuff, so this meat and the others start calling in tickets to my office. They don't talk to each other first: they just slip individually into panic mode.

We start fielding the calls. Eventually I get three of them and merge them into one call because they are the same thing: data is hosed. This is when the meat becomes customer and I need to switch from Happy Analogy Boy mode to "Sorry, Delores, your toitle is dead" mode. I used to fear this but I'm getting damn good at it.

Before we ask a meat with a db problem to tarball the db, we lay on the important questions: "Do you have a backup of this container? How recent is it? Do you have replicas of this container in other locations?" (Having replicas is a feature of our product but it costs a little extra so not everyone uses it.) All of this leads to the First Scary Thing I Must Say: "Your container is corrupted. It would be wise of you to backup what you have of it now, tell everyone to stop working on it, and prepare to restore from backup or replica. If our basic level of fixing your product fails, it could take us weeks of manual labor to cull out your data. We cannot, I repeat, CANNOT guarantee restoration. During all of this time you will be down. If you have a good backup, it may be a lot easier to restore from it because then you could be back to work in a couple hours."

This is a major reality check for some system administrators. Some admins live for urgent situations and say "restore? I got the 2 a.m. backup here and it's good. I'll get right onto that, tell my devs to save their present stuff to their local drives and I'll call you in two hours. No prob, dude."

Other admins need hand-holding at this point. They have a backup policy but they've never tested it or even restored from it. Maybe the regular admin is sick and this caller is the filler guy. Maybe the caller is unfamiliar with our product -- s/he is completely hip to system administration but has no clue what individual applications do (the person could be a hardware god but just tends the farm). This is when I show the customer how to restore, what it all means, and the weird little steps that get everything back up.

Some admins panic and are wise to do this. They explode. They get abusive. I have not had one cry yet but I have only been on the job a few months. They are going through the Five Stages of Accepting Death. The hard drive didn't just die: so did their day job. The admin failed to develop a tenable backup policy for the company's assets. This means millions of dollars in company property is gone in a flash. By "tenable" I mean the backup is frequent enough to be useful and has been regularly tested for restoration at a moment's notice. It's not good enough to shove a copy in a locker: that copy has to be usable and you have to know how to use it.

One other thing: if you make a backup, don't keep it in the same place as the working copy. Keep a copy offline. Keep a copy in an office miles away on a separate power grid. Make a rotation of copies. Make certain that there is not a single point of failure in the resurrection process.

My customer did not pay attention to this. My customer made unreliable copies. My customer is reluctant to use my product anyway (they got forced into it because they are a division of our parent company and had a solution before we came along). My customer kept the replica on the exact same storage array as the production system. Thus, the copy and the original died together.

I have spent several hours of the past three days on the phone with this customer. I have listened as he started with a bulldog front, bluffing by asking intricate and unnecessary questions about the nature of our product. I had to use my salescritter skills to get him and his circle jerk of coworkers listening to the call to reign in their arguments and understand that the real issue is getting his users back to work.

Isn't it? I mean, every hour his developers cannot write code is at least tens of thousands of dollars lost. Every minute he argues with me is a minute when he is not yet on the road to restoring from backup. Can you imagine that every syllable someone spoke cost you your monthly salary? You'd have your hand on the guy's mouth and wrestle him to the ground. You'd take action. You'd fire the guy.

A couple weeks ago I had a case where the people argued with me hardcore about "why can't we just keep working? Why do we have to lock the containers and stop everything?" They went on and on. Locking the containers means developers cannot write code but it also means that any possible problem does not get compounded while we look for a solution. I had to say "you're grinding gears and you want to know why you should hit the brakes." In their case, I was able to fix the problem and they looked at me like I was a god. That's nice, but I'm just glad it only took 75 minutes of absorbing their blusters.

Yesterday we pulled together a solution for the customer: two-thirds of his containers can be fixed. Since many of those are backups of ones that cannot be fixed, he should be able to get back the majority of his work. He was still acting up, but only slightly. You could hear the bruises in his voice. He'd been battered by this event.

Our company (well, division of a company) is centered on the idea that many eyes are better than two. We set up our work environment for ease of discussion: we talk about problems and hand each other solutions. When we all started talking about the same problem while venting about our individual problems, we realized the scale of the customer's failure. We got a composite of a customer having a major seizure.

Our company's response has been to escalate the customer's problem and eventually create an on-site expert. This person will have to have brass balls. This person will need to know our product and hardware configuration inside and out. Frankly, one person reading this is perfectly equipped for the task but I would not wish the task on you because I respect you far too much to make you suffer these fools and their compounded problems.

I am grateful that I may have been the only level 2 database guy but i was not really alone. I had a level 3 guy with me much of the past couple days. I had many other people making sure I stayed sane in the face of the customer's thrashing. I even got free food, which always helps.

It takes a lot of courage to explain to a customer "this is the position you are in and it is not good." As a result, I am drained. I am still adjusting to this new part of my task and its emotional affects. I slept a lot but I am harried. I woke up this morning unable to go back to sleep because my body had rested eight hours. My mind still wanted some kind of peace, so I started writing this.

I am also horny, which I can't explain. I'm tapped, right? Why should I need to get off? I guess these states are not related.

I feel better now that I have told this story. Please let me know whether it made sense.

Date: 2005-07-16 03:12 pm (UTC)
From: [identity profile] moominmolly.livejournal.com
It made sense to me, but I'm a special case. :) You just described the things I love about your job (and mine, but in a different way). The level 3 guy you must have been working with is one of my favorites. Being able to get a customer into a calm state while simultaneously whipping them into shape is a fun skill, but it is very emotional. In my job, your relationship with the customer genuinely is like a Relationship. I'm babbling.

I'll be down in your neck of the woods (in that corner there) by August.

Date: 2005-07-16 06:12 pm (UTC)
From: [identity profile] pseydtonne.livejournal.com
You will? Yayyyyyy! (flailing of arms and hands in the air a la Kermit the Frog)

-back to cleaning the apartment but I'd earned a break, Dante

Date: 2005-07-16 03:27 pm (UTC)
From: [identity profile] intuition-ist.livejournal.com
you have put your finger on why i will never ever become a system (or database) administrator. not willing to be johnny-on-the-spot 24x7, not willing to bear that kind of responsibility for a company's vital assets. not my dog.

you've also put your finger on why I will never do tech support as a primary job function. i don't have 1% of the patience necessary to deal with "meat".

Date: 2005-07-16 06:14 pm (UTC)
From: [identity profile] pseydtonne.livejournal.com
In contrast, I enjoy being the shepherd of machinery and would like to move in either that direction or learn more and more about the consultative method in the on-site approach. I need to get more versed in several things so I see that as a five-year goal. I dig machinery.

Date: 2005-07-16 05:18 pm (UTC)
From: [identity profile] teddywolf.livejournal.com
Welcome to one of the less enjoyable parts of tech support. You may be doing it on a far higher scale than I ever did but the human part of the equation doesn't change that much.

Date: 2005-07-16 09:55 pm (UTC)
From: [identity profile] fuzzplugjones.livejournal.com
God bless ya if you can take it, much less enjoy it. I get to critical mass easily when I'm surrounded by idiots. And I guess for a long time my only use to a woman was to be able to fix her computer, which ended up grating on me after awhile. Anyway, it's true what they say, there are two kinds of computer users: Those who make backups, and those who have never lost data. Like so many things in life, you don't learn until you've fallen flat on your face.

Date: 2005-07-17 04:44 am (UTC)
From: [identity profile] tkitch.livejournal.com
What's a database? Who's Unix?

Date: 2005-07-17 07:41 am (UTC)
From: [identity profile] fuzzplugjones.livejournal.com
Oh stop it. It's bad enough Dante gets so much cool computer shit from you... I'm always so jealous :-)

Date: 2005-07-17 11:54 am (UTC)
From: [identity profile] pseydtonne.livejournal.com
Now now, don't make me call timeout on both of you.

By the way, I'm heading to the Flea Market. I'll keep you in mind... oh wait, I do that anyway. I'll just say "f.p.j. could always use the magic of..." and extrapolate.

Date: 2005-07-17 11:57 am (UTC)
From: [identity profile] fuzzplugjones.livejournal.com
Actually, the motherboard in Scruples blew up yesterday (smoke and everything!) so if you're feelin' generous, an older Athlon (like 1.3ghz athlon) board would be nice to just get in the mail... one that can fit in that Aria case...

:-) Like you need to buy me shit. But I just thought I'd mention.

Date: 2005-07-17 03:01 pm (UTC)
From: [identity profile] cris6848.livejournal.com
reformatvob is proprietary, eh?

The company I'm contracting with had a hard disk crash a few months ago, and we went through all the steps you listed. Fortunately, this was our off-site replica for disaster recovery purposes, and we'll probably blow away the entire multisite and spend a month or two replicating 70Gb of data.

I've been a release engineer using ClearCase since it was DSEE, and currently am a contractor. I'm usually hired by the VP of R&D after what I call a "CM catastrophe." The three biggies are:

1. Customer reports a incredibly bad P1 bug, gets bug fixed within a day, sails along happily for a few weeks. Customer installs next software release, bug comes back, customer goes ballistic.

2. R&D ships software out into the world, customer reports bug, Customer Service asks R&D to fix bug, R&D realizes they have no idea what source code was used to build the release, and thus cannot fix customer's problem.

3. The source repository gets corrupted as you describe above, and all hell breaks loose.

What the these three scenarios have in common is that R&D's messy software processes are made visible to the entire corporation, and not only is there a financial impact and schedule slippage, but also the R&D vice president feels serious heat. And someone like me comes in as a hired gun to talk about multi-level backup strategies and replication and "we shall ship no software without a label attached".

As for the personal shit you're taking... Good support people walk the difficult line between being callous and rude and unfeeling, and getting emotionally involved with the customer's issues. Venting in forums like this will help. Good luck!

Date: 2005-07-17 07:52 pm (UTC)
From: [identity profile] pseydtonne.livejournal.com
Note reformatvob, actually. I shan't say outside work which one it is, but it has to do with the db directory of a container and our product need not be started for it to work. You've been doing this long enough for the light bulb to go off.

We don't do the intrusive stuff first. We read all the useful data we can, get the meat to ship what is necessary, do a dry run and then tell the meat which steps to follow.

Date: 2005-07-17 11:42 pm (UTC)
From: [identity profile] cris6848.livejournal.com
Ah, one of commands in the etc directory that has comments reading "Upon Pain of Direst Death, do not run this script unless directed to do so by Technical Support."

Date: 2005-07-24 06:58 am (UTC)
From: [identity profile] michigansundog.livejournal.com
Great Post!

August 2016

S M T W T F S
 123456
78910111213
1415 1617181920
21222324252627
28293031   

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Feb. 6th, 2026 07:56 am
Powered by Dreamwidth Studios