9i RAC: Manual Backup and Recovery
Transcription
9i RAC: Manual Backup and Recovery
9i RAC: Manual Backup and Recovery Shankar Govindan Introduction When we plan to move a large database or a heavily used OLTP database to a cluster setup, to get enough mileage from the horizontal scaling of 9i RAC, we normally have lots of questions on how we are going to handle various database maintenance that are traditionally done, how we are going to setup the same and what changes or tools need to be in place to handle RAC. In this paper we are going to look at one of the important administration jobs, the backing up of a 9i RAC. We will also look at how we can recover from a simple data file loss or the other extreme, a Disaster recovery. Backup Method We can use RMAN to backup and recover a database, but if the database is huge or the backup times are more than 5 hours, then precious time is lost. Sites that have large databases can look at Shadow copy. We use Hitachi RAID system and use Hitachi Shadow copy. (For more information on the Hitachi Shadow copy options, you can visit their website). The backup is done in the traditional manner: • • • tablespace is put in the begin backup mode, the database files are copied and then the tablespace is put back in the end backup mode. The way the shadow backup works, is by having a media server which has a set of disks. The disks are attached to the production server and synced (syncing of a terabyte takes 6 hours), the syncing can happen any time. Once, the media server disks are in sync with the production server, we fire a script which will put all the tablespaces in backup mode. The sync mirror disk is then sliced off the production server. This takes less than 3 minutes. The tablespaces are then put back in the end backup mode. The backup is now complete and takes only 3 minutes. The mirror disk is then copied over to tape offline using the media server. We can also sync the backup archive log directories and copy them over to tape. Setting up environment for Manual Backup Let’s see how we can do a traditional backup in a 9i RAC environment. As we know that there are multiple nodes in a RAC setup, all pointing to a single database. But the backup of database should be initiated from a single node, usually the primary node. The primary node is the one you setup first and do most of the maintenance from that node. The first thing to do is setup your profile in such a way that when you login as the owner of the database, in our case usually ‘ORACLE’, your environment is setup correctly. We have to remember that the database name, let’s say RPTQ in our case is only used for connecting to the database by the clients (this can also be masked/wrapped by using alias in tnsnames). When we login locally to one of the nodes, we have to set our ORACLE_SID to the instance name and not to the database name. Let’s say we have a two node(LJCQS034 and LJCQS035) instance called rptq1 and rptq2, pointing to a database RPTQ ( remember that the SID’s are case sensitive in Unix and it is better to set them all up in the lower case to avoid confusion) as shown in the figure below. When we login to server LJCQS034 hosting instance rptq1, then we set our ORACLE_SID pointing to rptq1( presuming our ORACLE_HOME is shared or point the ORACLE_HOME to the instance rptq1’s oracle home). We will execute all the maintenance jobs as a DBA locally and setup our environment for a single node, the primary node, to execute all our automatic maintenance and monitoring scripts. Your cron for the database will run in the primary node, although you can copy over the same cron to the second node for failover and activate it manually. Setting up your .profile for a single node will be as shown below: #---------------------------------------------------------------------------------------# If you are oracle and logging in one of the instance, then set that specific env #---------------------------------------------------------------------------------------if [ "`/usr/ucb/whoami`" = "oracle" ] && [ "`hostname`" = "LJCQS034" ]; then . ./rptq1.env else . ./rptq2.env fi You can include the same lines in your second node’s .profile. The env script rptq1.env should exist and will look like: #!/bin/ksh #|----------------------------------------------------------------------------------| #| | #| filename: /dba/etc/rptq1.env | #| | #| Created by : Shankar Govindan | #| Dated : 23-July-2003 | #| | #| | #| History : | #| | #|----------------------------------------------------------------------------------| # If TWO_TASK is set, unset it if [ -n "$TWO_TASK" ]; then unset TWO_TASK fi export LD_LIBRARY_PATH=/usr/lib:/usr/ccs/lib:/usr/ucblib:/usr/dt/lib:/lib:/sv03/sw/oracle/rptqdb /9.2.0/lib: /sv03/sw/oracle/rptqdb/9.2.0/lib32:/sv03/sw/oracle/rptqdb/9.2.0/jdbc/lib:. export NLS_DATE_FORMAT=DD-MON-RR export NLS_DATE_LANGUAGE=AMERICAN export NLS_LANG=AMERICAN_AMERICA.WE8ISO8859P1 export NLS_NUMERIC_CHARACTERS=., export NLS_SORT=BINARY export ORACLE_BASE=/sv03 export ORACLE_HOME=/sv03/sw/oracle/rptqdb/9.2.0 export ORACLE_SID=rptq1 export ORACLE_TERM=vt100 unset ORAENV_ASK; export ORAENV_ASK export ORA_NLS32=/sv03/sw/oracle/rptqdb/9.2.0/ocommon/nls/admin/data export ORA_NLS33=/sv03/sw/oracle/rptqdb/9.2.0/ocommon/nls/admin/data export ORA_NLS=/sv03/sw/oracle/rptqdb/9.2.0/ocommon/nls/admin/data export PATH=/sv03/sw/oracle/rptqdb/9.2.0/bin:/bin:/usr/bin:/usr/sbin:/usr/ccs/bin:/usr/ucb:/opt/ local/bin: /opt/hpnp/bin:/usr/local/bin:/usr/openwin/bin:/dba/bin:/dba/sbin:. export TNS_ADMIN=/sv03/sw/oracle/rptqdb/9.2.0/network/admin export UDUMP=/sv03/oracle/admin/rptq/udump export PFILE=/sv03/oracle/admin/rptq/pfile export BDUMP=/sv03/oracle/admin/rptq/bdump If you need to automate the backup and if the scripts are run as a cron, then you need to source the ORACLE_SID from somewhere, either from a file or from the Oratab. The oratab entry for a 9i RAC instance will have the instance name and not the database name. For example we if we are initiating the backup from the primary node, then the oratab of the primary node will have: rptq1:/sv03/sw/oracle/rptqdb/9.2.0:N The secondary node or the second node will have rptq2 in the oratab. Manual Backup of data files The backup of the database is initiated from the primary node or primary instance. We don’t have to setup anything or initiate anything from the secondary node or any other nodes that are pointing to the database. Login to the primary instance as a user who has the privilege to backup the datafiles and initiate the backup commands. • • • Alter tablespace tablespace_name begin backup; Copy datafiles to backup directory/server and then end the tablespace backup. Alter tablespace tablespace_name end backup; We know that we need to do this for all the tablespaces. If you are using the shadow copy concept like we do for large databases, then we need to put all the tablespaces into backup mode at once and then break the mirror disk. We then put all the tablespace to end backup mode. The mirror copy goes to tape backup offline. Manual Backup of archive logs The archive logs that are generated at the time of backup should get into the backup set for a meaningful recovery. Incase of disaster and we need to recover to the last backup, then the last few archive logs are required to bring the database to some consistent state for a cancel based recovery. In a RAC environment we setup the archive format to include a %t. The %t will identify which thread or instance that generated the archivelog. log_archive_format = arch%s_%t.arc Traditionally we force a log switch before and after a backup to get the timestamps in on the data files and to push the last few logs to be archived, so they get into the tape as part of the backup set. We normally execute, Alter system checkpoint; Alter system switch logfile; In a RAC environment, this command will only switch the logfile for a single instance. We have to remember that there are archive logs generated by the other instances too, and these archives need to be part of the backup set for any meaningful recovery. To force a log switch and push the logs to archive for all the instances, we need to execute, Alter system archive log current; Alter system archive log current; Once these archivelogs are pushed and visible, they are then compressed/moved to the backup archivelog directory and are part of the backup. If you have setup shadow copy, then, the backup archivelog directory can also be synced and mirrored, it can then be part of the backup set when the mirror is broken after the data files backup. Manual backup of server config file There is a server config file that stores all the database and instance information when the RAC setup is created. The name of the file is srvm.dbf. There is also a file in the /var/opt/oracle called srvConfig.loc oracle ljcqs034:=> pwd /var/opt/oracle oracle ljcqs034:=> ls -ltr total 8 -rw-r--r-1 oracle dba -rw-r--r-1 oracle dba -rw-rw-r-1 oracle other 47 Aug 18 16:47 srvConfig.loc 123 Aug 26 13:18 oraInst.loc 812 Aug 29 18:35 oratab The srvConfig.loc file contains the pointer to the location of the srvm.dbf. srvconfig_loc=/sv00/db00/oradata/rptq/srvm.dbf We have to make sure that this file is in the oracle dbf file directory location, so that it gets backed up periodically. Incase you are not shadow copying your data files, then you need to back this file up as part of your backup procedure. (You can always work around the loss of this file, by recreating the RAC setup once again using the srvctl commands). Recovery from a lost data file Let’s try to simulate the loss of a data file and try to recover the same in a RAC environment. • Create a tablespace SQL>create tablespace testrac datafile '/Ora_base/db11/oradata/drac/testrac_01.dbf' extent management local uniform size 4M SEGMENT SPACE MANAGEMENT AUTO; size 100M SQL> alter tablespace testrac add datafile '/Ora_base/db11/oradata/drac/testrac_02.dbf' size 100M; Tablespace altered. SQL> select file_name,bytes from dba_data_files where tablespace_name like 'TESTRAC'; FILE_NAME -------------------------------------------------/Ora_base/db11/oradata/drac/testrac_01.dbf /Ora_base/db11/oradata/drac/testrac_02.dbf BYTES ---------104857600 104857600 SQL> alter user sxgovind default tablespace testrac; User altered. SQL> connect sgovind • Create some tables and load data SQL> create table ruby as select * from dba_objects; Table created. SQL> create table hammerhead as select * from dba_tables; Table created. SQL> select segment_name,segment_type,tablespace_name from dba_segments where tablespace_name like 'TESTRAC'; SEGMENT_NAME SEGMENT_TYPE TABLESPACE_NAME --------------------------------------------------------------------------RUBY TABLE TESTRAC HAMMERHEAD TABLE TESTRAC SQL>alter tablespace TESTRAC begin backup; • Backup the tablespace SQL>select d.name,b.status,b.time from v$datafile d,v$backup b where d.file#=b.file# and b.status = 'ACTIVE'; NAME --------------------------------------------/Ora_base/db11/oradata/drac/testrac_01.dbf /Ora_base/db11/oradata/drac/testrac_02.dbf ACTIVE ACTIVE STATUS TIME ------------------ --------22-MAY-03 22-MAY-03 oracle ljcqs097:=> cp testrac_01.dbf $HOME SQL> alter tablespace testrac end backup; Tablespace altered. • Remove data file associated with the tablespace oracle ljcqs097:=> rm testrac_01.dbf SQL> select file_name,bytes from dba_data_files where tablespace_name like 'TESTRAC'; select file_name,bytes from dba_data_files where tablespace_name like 'TESTRAC' * ERROR at line 1: ORA-01116: error in opening database file 64 ORA-01110: data file 64: '/Ora_base/db11/oradata/drac/testrac_01.dbf' ORA-27041: unable to open file SVR4 Error: 2: No such file or directory Additional information: 3 • Recover data file associated with the tablespace SQL> alter database datafile '/Ora_base/db11/oradata/drac/testrac_01.dbf' offline; Database altered. oracle ljcqs098:=> cp testrac_01.dbf /Ora_base/db11/oradata/drac SQL> alter database recover datafile '/Ora_base/db11/oradata/drac/testrac_01.dbf'; alter database recover datafile '/Ora_base/db11/oradata/drac/testrac_01.dbf' * ERROR at line 1: ORA-00279: change 1936203896230 generated at 05/22/2003 14:01:09 needed for thread 2 ORA-00289: suggestion : /shared/arch/oradata/drac/arch/arch_2_12.arc ORA-00280: change 1936203896230 for thread 2 is in sequence #12 SQL> ALTER DATABASE RECOVER CANCEL; SQL> recover datafile '/Ora_base/db11/oradata/drac/testrac_01.dbf'; ORA-00279: change 1936203896230 generated at 05/22/2003 14:01:09 needed for thread 2 ORA-00289: suggestion : /shared/arch/oradata/drac/arch/arch_2_12.arc ORA-00280: change 1936203896230 for thread 2 is in sequence #12 Specify log: {<RET>=suggested | filename | AUTO | CANCEL} AUTO ORA-00279: change 1936203896361 generated at 05/22/2003 14:06:13 needed for thread 2 ORA-00289: suggestion : /shared/arch/oradata/drac/arch/arch_2_13.arc ORA-00280: change 1936203896361 for thread 2 is in sequence #13 ORA-00278: log file '/shared/arch/oradata/drac/arch/arch_2_12.arc' no longer needed for this recovery Log applied. Media recovery complete. SQL> alter database datafile '/Ora_base/db11/oradata/drac/testrac_01.dbf' online; Database altered. SQL> select file_name,bytes from dba_data_files where tablespace_name like 'TESTRAC'; FILE_NAME -------------------------------------------------/Ora_base/db11/oradata/drac/testrac_01.dbf /Ora_base/db11/oradata/drac/testrac_02.dbf BYTES ---------104857600 104857600 Done. Recovery from a Disaster When we recover a lost data file, we recover them online and both the instances of the RAC environment are up and running. We only offline the data file and try to recover the data file from a backup, the same way we do with a non-RAC setup. Let’s see what happens at the extreme end, when we have a disaster and we need to bring the last backup from tape and recover to a point in time or apply the available archive logs and do a cancel based recovery. We need to remember that we cannot bring both the instances of the database up. We need to mount the database in a single instance mode and then initiate a recovery. Once the recovery is complete we can bring all the other nodes up and running. We don’t have to do any changes to the environment or the parameter files. Verify the srvm.dbf file exists in the location and is not overwritten by the copy over of datafiles. Check the node visibility and start GSD Daemon to check RAC config is okay. oracle ljcqs034:=> lsnodes ljcqs034 ljcqs035 oracle ljcqs034:=> gsdctl stat GSD is not running on the local node oracle ljcqs035:=> gsdctl stat GSD is not running on the local node oracle ljcqs034:=> gsdctl start Failed to start GSD on local node We have to make sure that we have the server configured correctly and the GSD daemon is up and running. In case the srvm.dbf file is lost and the GSD does not come up, then recreate the RAC configuration as explained below in the Recovery of srvConfig information section of this note. Once the gsd daemon is up and running, we startup the database in a single instance mode. Login to the primary node and verify the ORACLE_SID, oracle ljcqs034:=> echo $ORACLE_SID rptq1 oracle ljcqs032:=> sqlplus /nolog SQL*Plus: Release 9.2.0.3.0 - Production on Wed Oct 8 09:24:28 2003 Copyright (c) 1982, 2002, Oracle Corporation. All rights reserved. SQL> connect / as sysdba Connected to an idle instance. SQL> startup nomount; SQL> recover database using backup controlfile until cancel; ORA-00279: change 1937810322614 generated at 08/16/2003 12:29:16 needed for thread 1 ORA-00289: suggestion : /sv04/data/arch/rptq/arch703203_1.arc ORA-00280: change 1937810322614 for thread 1 is in sequence #703203 Specify log: {<RET>=suggested | filename | AUTO | CANCEL} ORA-00279: change 1937810322614 generated at 08/16/2003 12:29:16 needed for thread 2 It does not suggest what archivelog file to apply for thread 2. You need to choose the latest one that match the timestamp of the one applied for thread 1 and start the apply from there. Specify log: {<RET>=suggested | filename | AUTO | CANCEL} /sv04/data/arch/rptq/arch3636_2.arc ORA-00279: change 1937810325033 generated at 08/16/2003 12:35:50 needed for thread 1 ORA-00289: suggestion : /sv04/data/arch/rptq/arch703204_1.arc ORA-00280: change 1937810325033 for thread 1 is in sequence #703204 ORA-00278: log file '/sv04/data/arch/rptq/arch703203_1.arc' no longer needed for this recovery Specify log: {<RET>=suggested | filename | AUTO | CANCEL} This time it suggests what file is required for thread 2. The first time seems to be an issue, once it applies the first right archive log file for thread 2, it then prompts for more. ORA-00279: change 1937810464267 generated at 08/16/2003 13:07:38 needed for thread 2 ORA-00289: suggestion : /sv04/data/arch/rptq/arch3637_2.arc ORA-00280: change 1937810464267 for thread 2 is in sequence #3637 ORA-00278: log file '/sv04/data/arch/rptq/arch3636_2.arc' no longer needed for this recovery Specify log: {<RET>=suggested | filename | AUTO | CANCEL} ORA-00279: change 1937810500912 generated at 08/16/2003 13:12:27 needed for thread 2 ORA-00289: suggestion : /sv04/data/arch/rptq/arch3638_2.arc ORA-00280: change 1937810500912 for thread 2 is in sequence #3638 ORA-00278: log file '/sv04/data/arch/rptq/arch3637_2.arc' no longer needed for this recovery Specify log: {<RET>=suggested | filename | AUTO | CANCEL} ORA-00279: change 1937810531734 generated at 08/16/2003 13:15:44 needed for thread 2 ORA-00289: suggestion : /sv04/data/arch/rptq/arch3639_2.arc ORA-00280: change 1937810531734 for thread 2 is in sequence #3639 ORA-00278: log file '/sv04/data/arch/rptq/arch3638_2.arc' no longer needed for this recovery Specify log: {<RET>=suggested | filename | AUTO | CANCEL} CANCEL Media recovery cancelled. SQL> alter database open resetlogs; Database altered. If your database is setup to use the tempfile temporary tablespace, then you need to recreate them. ALTER TABLESPACE TEMP ADD TEMPFILE '/sv00/db13/oradata/rptq/temp_01.dbf' SIZE 2044M REUSE AUTOEXTEND OFF; ALTER TABLESPACE TEMP ADD TEMPFILE '/sv00/db13/oradata/rptq/temp_02.dbf' SIZE 2044M REUSE AUTOEXTEND OFF; ALTER TABLESPACE TEMP ADD TEMPFILE '/sv00/db13/oradata/rptq/temp_03.dbf' SIZE 2044M REUSE AUTOEXTEND OFF; ALTER TABLESPACE TEMP ADD TEMPFILE '/sv00/db13/oradata/rptq/temp_04.dbf' SIZE 2044M REUSE AUTOEXTEND OFF; ALTER TABLESPACE TEMP ADD TEMPFILE '/sv00/db13/oradata/rptq/temp_05.dbf' SIZE 2044M REUSE AUTOEXTEND OFF; Recovery from loss of srvConfig information If you loose the srvm.dbf file and the GSD daemon does not come up, then its time to recreate the srvm.dbf, by recreating the server configuration. oracle ljcqs098:=> which gsd /shared/oracle/product/9.2.0/bin/gsd oracle ljcqs098:=> cd $ORACLE_HOME oracle ljcqs032:=> gsdctl start Failed to start GSD on local node The following command will wipe out all the previous information that existed in the srvm.dbf file. Once you setup the environment, you should not execute the below command again. oracle ljcqs032:=> srvconfig -init –f oracle.ops.mgmt.rawdevice.RawDeviceException: PRKR-1025 : file /var/opt/oracle/srvConfig.loc does not contain property srvconfig_loc at java.lang.Throwable.<init>(Compiled Code) at java.lang.Exception.<init>(Compiled Code) at oracle.ops.mgmt.rawdevice.RawDeviceException.<init>(Compiled Code) at oracle.ops.mgmt.rawdevice.RawDeviceUtil.getDeviceName(Compiled Code) at oracle.ops.mgmt.rawdevice.RawDeviceUtil.<init>(Compiled Code) at oracle.ops.mgmt.rawdevice.RawDeviceUtil.main(Compiled Code) The file srvm.dbf was not part of the backup and hence was not recovered. The file does not exist and the srvconfig command does point out the same. Work around is to create a new file and update the /var/opt/oracle/srvConfig.loc file of the new location of the srvm.dbf file. oracle ljcqs032:=>touch /sv00/db00/oradata/rpt1/srvm.dbf oracle ljcqs032:=>chmod 755 srvm.dbf oracle ljcqs032:=>cd /var/opt/oracle oracle ljcqs032:=>vi srvConfig.log and add this line and save. srvconfig_loc=/sv00/db00/oradata/rpt1/srvm.dbf Now start the GSD daemon and then start adding the database and instance information to the srvm.dbf file. oracle ljcqs098:=> gsdctl start Successfully started GSD on local node oracle ljcqs097:=> srvctl add database -d rptq -o /shared/oracle/product/9.2.0 oracle ljcqs098:=> srvctl add instance -d rptq -i rptq1 -n ljcqs097 oracle ljcqs098:=> srvctl add instance -d rptq -i rptq2 -n ljcqs098 Check if the configuration has been setup correctly. oracle ljcqs097:=> srvctl config rptq oracle ljcqs097:=> srvctl config database -d rptq ljcqs097 rptq1 /shared/oracle/product/9.2.0 ljcqs098 rptq2 /shared/oracle/product/9.2.0 Shankar Govindan works as a Sr. Oracle DBA at CNF Inc, Portland, Oregon. Shankar Govindan is Oracle Certified 7, 8 and 8I; you can contact him at [email protected]. Note: The above info as usual is of my individual tests and opinions and has nothing to do with the company I work for or represent.