Whenever I hear "database down" I think of the 1978 movie Gray Lady Down where the nuclear submarine Neptune sinks after hitting a freighter and the folks up top have 48 hours to rescue the crew. It's at those times that I can hear the sound effects associated with submarines in my head, as well as the background music, and a sense of fear grips my heart. Okay, so I should get a life, not just rent one from Blockbuster. But the analogy is still close.
You rarely have the luxury of 48 hours when it comes to rescuing your database, however. What usually happens is you have a dozen or so managers in suits standing in your cube doorway at six in the morning wondering how long it will be before the database is back up and running. Rather nerve-wracking. So in this chapter, we'll look at some of the best ways to get those pesky managers out of your cube as gently and quickly as possible.
While it really doesn't happen often, there are times when your database does crash and burn and you find yourself looking at a SQL prompt that says a shared memory realm doesn't exist or that Oracle is unavailable. Of course, this is when you're lucky enough to find out that the database is down before your users do. When they find out first, you find yourself scrambling to answer questions while furiously typing and misspelling words that you know you know how to spell (like sqlplus, or sysdba). I'm a pro at consistently misspelling "select" any time I find myself under pressure, either from me trying to get things back under control as quickly as possible or due to those dozen pairs of manager eyes boring into the back of my head.
The most important thing when confronted with a down database is to get it back up and running. Then afterward, you need to figure out how, if possible, to keep it from happening again.
The first thing to do is check the alert logs. See if anything jumps out at you as a reason for the database being down. For example, did one of the DBAs in your shop do maintenance last night and forget to bring the database back up? Not that this would ever happen, but just for grins check and see if the end of the alert logs might show this is the case. Of course, if someone with just the wrong access decided to go out and kill a bunch of background processes, maybe because they seemed to be taking a lot of the resources on the box, and it was too late at night to bother the DBA with stuff like that, there won't be anything in the alert log or, at best, not much of one.
If there's nothing glaring in the alert log that tells you something horrible happened (like someone deleted a bunch of data files or maybe all of the control files are gone) simply try restarting the database. You might be surprised. Whatever caused the database to crash might turn out to be a simple and transient thing, and your database will be revived simply by using the startup command.
The most important thing to users and to management is getting the database restarted. However, sometimes a restart will wipe out important information, including evidence of what happened, and you won't be able to find out what happened. Try to at least dump out the contents (if possible) of some of the v$ views to help in your analysis before restarting. Once the database is accessible, you can worry about getting to the root of the problem (sometimes referred to by cranky upper management as Root Cause Analysis, or RCA).
If it doesn't start
Okay, you've tried the simplest and most straightforward solution -- simply restarting the database -- but it didn't start. Now what?
Well, you start what could arguably be seen as the fun part of being a DBA (if you're a truly warped individual, which I am, and if you enjoy a real challenge). You have to try to figure out why it isn't starting (hopefully as quickly as possible) and get it back up and running.
If it doesn't stop
Yeah, sometimes the database gets stuck … up. Not only stuck in the up position, but since "stuck up" implies that you were trying to shut down the database, no one can now connect to it because a database shutdown is in process.
It's normal for shutdown to sometimes get stuck, that's a given. Since "normal" implies you're willing to wait for all connections to disconnect, there have to be connections out there waiting for someone to do something. The solution? Kill off all sessions connected to shutdown immediate or (gulp) do a shutdown abort.
Okay, so that's no big deal, right? Sure, but what happens when shutdown immediate gets stuck? The emn0 background process sometimes forgets it's running, goes to sleep and just won't wake up. Sometimes Oracle weirds out and refuses to be cooperative for some other reason.
If shutdown immediate gets stuck, there are only two ways to bring down the database. One is to kill –9 on Unix, or kill Task Manager in Windows. This is usually used only as a last resort, or by overly anxious operators with just a little too much knowledge. The other is shutdown abort. Yes, this is a valid way of shutting down the database. Of course, so is pulling the power plug or pressing the reset button, but Oracle will assure you that it's a valid shutdown method. It still makes my stomach knot, but I've actually done it. Of course, I start up as soon as the database is down and then perform shutdown immediate again so the shutdown is in as stable a state as possible.
Finding out why
Okay, so we haven't gotten it started, and we need to find out not only why it won't start, but what brought it down to begin with. Chances are that, when you tried to bring up the database, if there was an issue that caused it to not come up, it was either sent to the screen or the alert logs and you have some idea of what didn't work and maybe what the error was that meant it didn't work. This is an excellent place to start. You can look in the manuals to find out what the error means, or you can make use of some of Oracle's valuable resources to help you figure out what happened and where to go from here.
The oerr utility
The Oracle Error (oerr) utility is provided to you, free of charge on Unix and Linux, as a way to help you quickly find information about the errors you're seeing in your database without having to resort to searching with your favorite search engine or running to the manuals to find out what exactly an error message means. By using the following command at the operating system prompt, you can quickly see what Oracle is trying to tell you.
oerr <prefix> <number>
The prefix is usually the three characters preceding the hyphen in the error that's displayed (perhaps ORA, MSG, PLS, or something else) while the number part of the command is the number to the right of the hyphen. For example, did the dreaded ORA-00600 appear in your alert log? If so, you might be able to get more information by running the following command:
oerr ORA 00600
In this case, as you can see next, the information you get back will probably be less than useful, as ORA-00600 can cover many different kinds of errors, but you can still get some idea of how the command works and the format of the output, although to allow it to fit within the confines of the book, I had to take liberties with a couple of the line breaks.
$ oerr ora 00600 00600, 00000, "internal error code, arguments: [%s], [%s], [%s], [%s], [%s], [%s], [%s], [%s]" // *Cause: This is the generic internal error number for // Oracle program // exceptions. This indicates that a process has encountered an // exceptional condition. // *Action: Report as a bug - the first argument is // the internal error number
While the original Oracle-provided version of this utility runs on Unix but not Windows (because it makes use of the AWK command, and not every Oracle installation assumes you've installed MKS Toolkit or CYGWIN to emulate Unix on Windows), there are ways to make a similar utility run in a Windows environment. This will make you very happy if you happen to be one of those who have Windows environments without access to a Unix alternative on which to run the command.
I like to tell SAs who want to debug errors how to use oerr. They can then look up errors themselves, which is a big help if they can field even a few user problems.
Click here to download the rest of this chapter.
About the author
April Wells is a database administrator currently working as an Oracle Apps DBA for TEK Systems in Austin, Texas, where she lives with her husband and two children. She has over seven years of IT experience to her credit, including as a COBOL and Visual Basic programmer, SQLServer DBA, webmaster, Oracle DBA, Oracle Apps DBA, disaster recovery team member and Linux System Administrator.