This is an excerpt from Chapter 5 of Everyday Oracle DBA by April Wells, copyright 2006, from Oracle Press, a division of McGraw-Hill. Click here to read the full chapter.
Whenever I hear "database down" I think of the 1978 movie
Gray Lady Down where the nuclear submarine Neptune
sinks after hitting a freighter and the folks up top have 48
hours to rescue the crew. It's at those times that I can hear
the sound effects associated with submarines in my head,
as well as the background music, and a sense of fear grips my heart. Okay,
so I should get a life, not just rent one from Blockbuster. But the analogy is
still close.
You rarely have the luxury of 48 hours when it comes to rescuing your
database, however. What usually happens is you have a dozen or so
managers in suits standing in your cube doorway at six in the morning
wondering how long it will be before the database is back up and running.
Rather nerve-wracking. So in this chapter, we'll look at some of the best
ways to get those pesky managers out of your cube as gently and quickly
as possible.
Database down
While it really doesn't happen often, there are times when your database
does crash and burn and you find yourself looking at a SQL prompt that says
a shared memory realm doesn't exist or that Oracle is unavailable. Of course,
this is when you're lucky enough to find out that the database is down before
your users do. When they find out first, you find yourself scrambling to answer
questions while furiously typing and misspelling words that you know you
know how to spell (like sqlplus, or sysdba). I'm a pro at consistently misspelling
"select" any time I find myself under pressure, either from me trying to get
things back under control as quickly as possible or due to those dozen pairs
of manager eyes boring into the back of my head.
The most important thing when confronted with a down database is to
get it back up and running. Then afterward, you need to figure out how, if
possible, to keep it from happening again.
Restarting
The first thing to do is check the alert logs. See if anything jumps out at you
as a reason for the database being down. For example, did one of the DBAs
in your shop do maintenance last night and forget to bring the database
back up? Not that this would ever happen, but just for grins check and see if
the end of the alert logs might show this is the case. Of course, if someone
with just the wrong access decided to go out and kill a bunch of background
processes, maybe because they seemed to be taking a lot of the resources on
the box, and it was too late at night to bother the DBA with stuff like that,
there won't be anything in the alert log or, at best, not much of one.
If there's nothing glaring in the alert log that tells you something horrible
happened (like someone deleted a bunch of data files or maybe all of the
control files are gone) simply try restarting the database. You might be
surprised. Whatever caused the database to crash might turn out to be a
simple and transient thing, and your database will be revived simply by
using the startup command.
The most important thing to users and to management is getting the database
restarted. However, sometimes a restart will wipe out important information,
including evidence of what happened, and you won't be able to find out
what happened. Try to at least dump out the contents (if possible) of some of
the v$ views to help in your analysis before restarting. Once the database is
accessible, you can worry about getting to the root of the problem (sometimes
referred to by cranky upper management as Root Cause Analysis, or RCA).
If it doesn't start
Okay, you've tried the simplest and most straightforward solution -- simply
restarting the database -- but it didn't start. Now what?
Well, you start what could arguably be seen as the fun part of being a DBA
(if you're a truly warped individual, which I am, and if you enjoy a real
challenge). You have to try to figure out why it isn't starting (hopefully as
quickly as possible) and get it back up and running.
If it doesn't stop
Yeah, sometimes the database gets stuck … up. Not only stuck in the up
position, but since "stuck up" implies that you were trying to shut down the
database, no one can now connect to it because a database shutdown is in
process.
It's normal for shutdown to sometimes get stuck, that's a given. Since
"normal" implies you're willing to wait for all connections to disconnect, there
have to be connections out there waiting for someone to do something. The
solution? Kill off all sessions connected to shutdown immediate or (gulp) do
a shutdown abort.
Okay, so that's no big deal, right? Sure, but what happens when shutdown
immediate gets stuck? The emn0 background process sometimes forgets it's
running, goes to sleep and just won't wake up. Sometimes Oracle weirds
out and refuses to be cooperative for some other reason.
If shutdown immediate gets stuck, there are only two ways to bring down
the database. One is to kill –9 on Unix, or kill Task Manager in Windows.
This is usually used only as a last resort, or by overly anxious operators with
just a little too much knowledge. The other is shutdown abort. Yes, this is a
valid way of shutting down the database. Of course, so is pulling the power
plug or pressing the reset button, but Oracle will assure you that it's a valid
shutdown method. It still makes my stomach knot, but I've actually done it.
Of course, I start up as soon as the database is down and then perform
shutdown immediate again so the shutdown is in as stable a state as possible.
Finding out why
Okay, so we haven't gotten it started, and we need to find out not only why
it won't start, but what brought it down to begin with. Chances are that,
when you tried to bring up the database, if there was an issue that caused it
to not come up, it was either sent to the screen or the alert logs and you
have some idea of what didn't work and maybe what the error was that
meant it didn't work. This is an excellent place to start. You can look in the
manuals to find out what the error means, or you can make use of some of
Oracle's valuable resources to help you figure out what happened and
where to go from here.
The oerr utility
The Oracle Error (oerr) utility is provided to you, free of charge on Unix and
Linux, as a way to help you quickly find information about the errors you're
seeing in your database without having to resort to searching with your
favorite search engine or running to the manuals to find out what exactly an
error message means. By using the following command at the operating
system prompt, you can quickly see what Oracle is trying to tell you.
oerr <prefix> <number>
The prefix is usually the three characters preceding the hyphen in the
error that's displayed (perhaps ORA, MSG, PLS, or something else) while the
number part of the command is the number to the right of the hyphen. For
example, did the dreaded ORA-00600 appear in your alert log? If so, you
might be able to get more information by running the following command:
oerr ORA 00600
In this case, as you can see next, the information you get back will
probably be less than useful, as ORA-00600 can cover many different kinds
of errors, but you can still get some idea of how the command works and
the format of the output, although to allow it to fit within the confines of the
book, I had to take liberties with a couple of the line breaks.
$ oerr ora 00600
00600, 00000, "internal error code, arguments:
[%s], [%s], [%s], [%s], [%s], [%s], [%s], [%s]"
// *Cause: This is the generic internal error number for
// Oracle program
// exceptions. This indicates that a process has encountered an
// exceptional condition.
// *Action: Report as a bug - the first argument is
// the internal error number
While the original Oracle-provided version of this utility runs on Unix
but not Windows (because it makes use of the AWK command, and not
every Oracle installation assumes you've installed MKS Toolkit or CYGWIN
to emulate Unix on Windows), there are ways to make a similar utility run
in a Windows environment. This will make you very happy if you happen to
be one of those who have Windows environments without access to a Unix
alternative on which to run the command.
I like to tell SAs who want to debug errors how to use oerr. They can
then look up errors themselves, which is a big help if they can field even a
few user problems.
Click here to download the rest of this chapter.
About the author
April Wells is a database administrator currently working as an Oracle
Apps DBA for TEK Systems in Austin, Texas, where she lives with
her husband and two children. She has over seven years of IT experience
to her credit, including as a COBOL and Visual Basic programmer,
SQLServer DBA, webmaster, Oracle DBA, Oracle Apps DBA,
disaster recovery team member and Linux System Administrator.