In this chapter you will learn how to
- Recover from loss of a controlfile
- Recover from loss of a redo log file
- Recover from loss of a system-critical datafile
- Recover from loss of a nonsystem-critical datafile
It is impossible to corrupt an Oracle database. The mechanism of instance recovery, where redo and undo are used to return the database to a consistent state after an instance failure, guarantees that. It is, however, possible to lose data following media failure—if the DBA has not taken appropriate precautions. The precautions are simple: to run the database in archivelog mode; to multiplex the controlfile, the online logfiles, and the archive log files; and to back up datafiles and archive log files. Following media failure, the backups and the archive logs can be used to recover the database up to the point of the failure, without loss of even one committed row of data. But whereas instance recovery is automatic—indeed, unavoidable—media recovery is a manual process. This chapter will go through elementary recovery techniques. More advanced techniques, applicable to more complex problems, will be covered in later chapters.
Recovery structures and processes
Following media failure, there are different techniques for recovery, depending on which files were damaged. The database consists of three file types: the controlfile, the online redo log files, and the datafiles. Recovery from damage to the controlfile or the online redo log files is a trivial exercise, provided that they were multiplexed. Recovery from damage to one or more datafiles is more complex, but still straightforward.
A damaged controlfile can be replaced with a multiplexed copy or re-created with a CREATE CONTROLFILE command. In extreme circumstances, it can be restored from a backup, but this should never be necessary following media failure, if you have followed a suitable multiplexing strategy.
A damaged online redo log file can be regenerated. Oracle provides an ALTER DATABASE CLEAR LOGFILE GROUP # command (where # is the number of the group with a damaged member), which will delete and re-create the members of a logfile group. If the database is running in archivelog mode (and it should be), the logfile group must have been archived before Oracle will permit execution of the clear logfile command. This is because clearing an unarchived log file group would mean that the archive log stream would be missing one logfile, and therefore that recovery would not be possible. There is a variation on the command, ALTER DATABASE CLEAR UNARCHIVED LOGFILE GROUP #, which will delete and re-create a logfile even if it has not been successfully archived, but after executing this command it is absolutely vital to perform a whole database backup.
A damaged datafile requires use of backups and archive logs. Following media failure resulting in damage to a datafile, there are two options for recovery: complete recovery, meaning no loss of data, and incomplete recovery, where you deliberately lose work by stopping the recovery process before it has completed. Incomplete recovery is an advanced procedure dealt with in Chapter 27. Complete recovery is a two-stage process. First, the damaged file must be restored from a backup. Second, the restored file must be recovered, by using redo information in the archive logs to bring it forward in time until it is synchronized with the rest of the database.
EXAM TIP In the Oracle environment, "restore" means to replace a damaged or missing file with a backup; "recover" means to synchronize the file with the rest of the database by use of archive logs.
Since online redo logs are never backed up by RMAN, RMAN cannot be used to recover from damage to them; repairing online logfiles damaged by media failure can by done only with SQL*Plus, or through Database Control. The controlfile and datafiles can be restored and recovered by RMAN; indeed, if you backed them up into backup sets, RMAN is your only option.
To open a database, all the controlfile copies, at least one member of each online logfile group, and all the online datafiles must be present and synchronized. If, during a startup, SMON finds that this is not the case, the startup will not complete. If a controlfile copy is damaged or missing, the startup will stop in NOMOUNT mode. A message is written out to the alert log detailing which copy (or copies) of the controlfile is damaged. Assuming that the controlfiles are fine, SMON proceeds to open the database. During this phase, it checks the headers of all the online datafiles. If any are missing or damaged, appropriate error messages are written out to the alert log, and the database remains in mount mode. If all the online files are present and not damaged, but one or more of them are not synchronized, SMON will attempt to synchronize them by using the online redo logs. This is the process of instance recovery, detailed in Chapter 18, and will happen automatically. If the online logs required are not available, then the database cannot be opened. If one or more datafiles have been restored from a backup, they will almost certainly be so far out-of-date that the online redo logs will not go far enough back in time to recover them: this is when you must use archive log files for the recovery, which is a procedure that must be initiated manually—from SQL*Plus if you are backing up with operating system commands, or with RMAN if (as Oracle strongly advises) you have committed to using RMAN for your backups.
If the media damage occurs while the database is open, the effect will depend on which files were affected. Damage to any controlfile copy will result in the instance terminating immediately. Damage to a datafile that is part of the SYSTEM tablespace or the active undo tablespace will have the same effect. But damage to an online log will not terminate the instance, as long as there is a surviving member of the logfile group. In fact, the instance will continue to function, and your end users will not even notice. But error messages will be written out to the alert log, and the situation should be corrected without delay; such corrections can and should be done online, while people continue to work. Damage to a datafile that is part of a tablespace other than SYSTEM or the active undo tablespace will also not result in an instance failure, but clearly the end users may have problems, because a part of the database will be missing. How your application will react to this is unpredictable—it will depend completely on how the application is structured. The restore and recovery of damaged datafiles can be done online, provided that they are not datafiles belonging to SYSTEM or the undo tablespace. Finally, damage to the tempfiles that make up your temporary tablespaces may not be noticed by the end users at all. Oracle does not validate the existence of tempfiles until they are needed, and a well-tuned database may never need them. This means that tempfiles can be missing for some time before there is any noticeable effect. It also means that a damaged tempfile can be dropped and re-created at any time, unless it happens to be in use at that moment.
As with backups, a restore can be done with RMAN or with operating system utilities. But if your RMAN backups were to backup sets, rather than as image copies, the restore can be done only with RMAN: there is no other way to extract datafiles from a backup set. Recovery after a restore can be carried out with SQL*Plus commands or with RMAN, but the same restriction applies: only RMAN can extract archive logs from a backup set.
Recovery from media failure
Restore and recovery following media failure is covered in much greater detail in later chapters and the second OCP examination, but it is necessary to know the rudiments of recovery from simple problems for the first examination too. These simple problems are loss of one copy of a multiplexed controlfile and an online redo log file, and complete recovery following loss of critical and noncritical datafiles.
Recovery from loss of a multiplexed controlfile
As long as a surviving multiplexed copy of the controlfile exists, recovery from loss of a controlfile is simple. Just replace it with a surviving copy of the controlfile. To restore the damaged or missing controlfile copy from a backup would be useless in these circumstances, because all copies of the controlfile must be identical; clearly, a restored copy would not be synchronized with the surviving copies, nor with the rest of the database.
Virtually the moment the damage occurs, the instance will terminate. As ever, the DBA's first reaction to a crashed instance should be to attempt a startup. This will fail, in NOMOUNT mode, with an appropriate error message. The alert log will state which controlfile copy is missing, and also—in the section listing the nondefault initialization parameters—how many controlfiles there actually are, and where they are. At this point, you have three options. First, you could edit the parameter file to remove the reference to the missing or damaged controlfile.
This is fine, but your database will now be running on one fewer multiplexed copies, which will presumably be in breach of your security guidelines. A better option is therefore to replace the damaged file with a copy made from a surviving copy or indeed to change the CONTROL_FILES initialization parameter to replace the reference to the damaged file with a reference to a brand new file, and copy the surviving controlfile to that.
EXAM TIP Recovering from loss of a controlfile will entail downtime. It cannot be done online.
Exercise 20-1: Recovering from loss of a controlfile
In this exercise, you will simulate the loss of a multiplexed controlfile and replace it with a copy.
- Connect to your database with SQL*Plus, and ensure that your controlfile
is multiplexed with this query:
SQL> select * from v$controlfile;This query must return at least two rows. If it does not, multiplex your controlfile.
- Simulate damage to a controlfile by crashing the database and renaming one of your controlfiles. Note that on Windows you may have to stop the Windows service before Windows will let you rename the file, and start it again afterward.
- Issue a startup command. The startup will stop in nomount mode, with an "ORA-00205: error in identifying controlfile, check alert log for more info" error message.
- Copy your surviving controlfile to the name and location of the file you renamed.
- Issue another startup command, which will be successful.
Loss of which of these files will cause an open database to crash? (Choose three answers.)
- A multiplexed controlfile
- A multiplexed online logfile
- A multiplexed archive log file
- An active undo tablespace datafile
- An active temporary tablespace tempfile
- A datafile from the SYSAUX tablespace
- A datafile from the SYSTEM tablespace
- A datafile containing critical user data
A copy of a multiplexed controlfile is damaged. What should you do? (Choose the best answer.)
- Replace it with a surviving copy.
- Restore it with RMAN.
- Restore it with operating system commands.
- Re-create it with the CREATE CONTROLFILE command.
Click here to read the answers and the rest of this chapter.
About the authors
Damir Bersinic, OCP, MCSE, MCSDBA, MCT, is an infrastructure consultant with Trecata Corporation, a system integration consultancy in Toronto, Canada. He has more than 20 years of industry experience and has authored a number of titles on Oracle, SQL Server, Active Directory and Windows. His credits include multiple exam preparation titles for Oracle and Microsoft certification programs.
John Watson, OCP DBA 8.0, 8.i, and 9i, Internet Application Server Administrator, and certified on Managing Oracle on Unix, is a senior consultant with BLP Management Consultants in South Africa. He has taught for Oracle University in South Africa for four years. John taught the DB10g New Features course, the DB10g Manageability and Performance seminars, validated hundreds of OCP 10g New Features exam questions and contributed to the course material for all the 10g courses. John is the author of numerous articles on technology.
This was first published in October 2006