My VIP Cards v2.0

16/12/2012

My VIP Cards

Using MyVIPCards is one of the easiest ways to equip all your loyalty cards, drive licence, library card, healthy insurance cards, medical care cards, business cards or any other important paper cards that you may have, and store them in one location with only one click to show them up instantly.

Main features include:
1) Easily add card including name, comment and phone number
2) Quickly full text search through all cards by name, comment or phone number in one go
3) Instantly preview card detail
4) Instantly make phone call from the card
5) Instantly email card to your partner or friend, especially when they forget theirs
6) Easily delete card that you don’t need
8) Easily add existing card photo from album or taking a new photo
9) Safely storage in your phone, nothing with be exposed and synchronised to public or internet

Navigation Tips:
1) Press and Hold on a card in main screen, to quickly preview your card detail
2) Swipe right or left in card detail view, to easily navigate the previous or next card
3) Swipe up or down in card detail view, to instantly return to main screen

View in App Store

Some Screenshots:

MyVIPCards_01

MyVIPCards_02

MyVIPCards_03

MyVIPCards_04

MyVIPCards_05

Instahist v2.2

23/07/2012

Instahist is a fun educational encyclopedia that lets you learn and enjoy historical and memorable events of the world’s history.

From wars, revolution and famous people in history, Instahist covers it all. Read and learn about major past events that happened today, Instahist gives you a good reference for your knowledge and study.

Main features include:

  1. User friendly navigation interface like reading a book
  2. Historical events cover 5 continents and over 100,000 entries for every days including famous birthdays and deaths
  3. Integration with built-in map helps you to explore the world of history through different angles
  4. OFFLINE knowledge base support, NO internet access is required, reach every drop of historical knowledge instantly just at your finger tip
  5. NO registration is required, start your world history journey in no time
  6. Full text search supported, anyone, anywhere, anytime in the history is just around the corner, never been so easy with fun before
  7. Share history with your friends, and be the first one of your friends to discover history of the day
  8. Add to calendar function to keep you always the first one to know what you want to know for history

Instructions:

  1. Swipe UP or DOWN on Day to change day
  2. Swipe UP or DOWN on Month to change month
  3. TAP on the map to open map view
  4. Swipe from RIGHT to LEFT to back to previous page

View in App Store

Here are some screenshots for your reference:

图像

图像

图像

图像

图像

Autobiography of Alvin – Preface

21/11/2011

Here is the hello from Alvin:

I will let him to draw the rest part of his own future. To be continued…

A lovely day to all of you!

Oracle Database Patching Blocked Stopped by AIX Library File Locking Error

21/10/2011

Recently I encountered a case of Oracle database 11g patching blocked by AIX locked file. 

Here is the story.

 Background:

An issue with regards to files being locked or opatch being unable to copy files even though the databases, listeners and any other Oracle processes associated with the ORACLE_HOME to be patched were stopped:

Patching component oracle.ldap.rsf.ic, 11.1.0.7.0…

Copying file to “/ora01/app/oracle/product/11107xxxxxx/lib/libnnz11.so”

Copy failed from ‘/ora01/app/oracle/product/11107XXXXXX/OPatch/12419384/files/l

ib/libnnz11.so’ to ‘/ora01/app/oracle/product/11107XXXXXX/lib/libnnz11.so’…

Please verify all applications associated with the Oracle Home ‘/ora01/app/oracl

e/product/11107XXXXXX’ are shut down.

Issue:

OPATCH unable to copy library file libnnz11.so

 

Solution:

1)      Tried many other ways to fix this issue and roll back/re-apply patch, but all failed unless using the method below.

2)      Checked the AIX system cache and this file did appear in memory even though there is no related process, however, Oracle (or AIX) did not release the cache from the OS:

 

3)      Realized that this server is IBM 64-bit POWER 6, which matches the symptoms in Oracle Note ID 739963.1.

4)      We manually renamed this file in the target path, and copied it from target path.

5)      Input ‘Y’ to instruct OPATCH to continue the rest of the process

6)      All the rest steps completed successfully.

Comment:

1)      According to Oracle’s note, this error applies to IBM AIX 64 bit POWER systems.

2)      Another way to overcome this error suggested by Oracle is using slibclean to clean the OS system cache, which requires root permission that usually DBA’s do not have.

NLS_LENGTH_SEMANTICS Affects BYTE CHAR Length Semantics

14/09/2011

Recently I worked on a case regarding the BYTE and CHAR length semantics issue, after performing root cause analysis, we found out the victim was Oracle SQL Developer, and the murder was an Oracle parameter – NLS_LENGTH_SEMANTICS.

Here is the detail of the story.

Background:

1)      Customer had a request to create some partitioned tables as usually.

2)      DBA created those tables according to customer’s request.

3)      Customer came back with some application error logs, which indicated the data in those tables were not correct as they have different length as expected.

4)      DBA checked those tables, data displayed on the screen are correct, however, it is abnormal when using length function from application side.

Analysis:

1)      Confirmed with DBA who performed the deployment of those partitioned tables, it was deployed by pure SQL scripts in tool SQL Developer.

2)      Customer and DBA are from different countries with different type of NLS setting with their support tools and applications.

3)      Re-deployed those tables in SQLPLUS, everything worked perfectly

4)      For test purpose, re-deployed those tables in Oracle SQL Developer, application raised the same error.

5)      Checked all NLS settings from database and support tools, observed the database parameter NLS_LENGTH_SEMANTICS has different values: One is using Oracle default value – BYTE, while the other one use the other value – CHAR, which is not default.

6)      DBA adjusted the NLS setting in Oracle SQL Developer, especially setting the value of NLS_LENGTH_SEMANTICS to CHAR, in order to match customer’s environment.  

7)      Recreated those tables via SQLPLUS and SQL Developer, all worked perfectly.

 

Pains and Gains

1)      Refer to Oracle database documentation which has clear information of parameter NLS_LENGTH_SEMANTICS with default value ’BYTE’

2)      Document as below

Property Description
Parameter type String
Syntax NLS_LENGTH_SEMANTICS = stringExample: NLS_LENGTH_SEMANTICS = ‘CHAR’
Default value BYTE
Modifiable ALTER SESSION
Range of values BYTE | CHAR

NLS_LENGTH_SEMANTICS enables you to create CHAR and VARCHAR2 columns using either byte or character length semantics. Existing columns are not affected.

NCHAR, NVARCHAR2, CLOB, and NCLOB columns are always character-based. You may be required to use byte semantics in order to maintain compatibility with existing applications.

NLS_LENGTH_SEMANTICS does not apply to tables in the SYS schema.

In general, like many DBA’s said, Oracle SQL Developer and other 3rd party tools are flexible and convenient to improve developers and DBA’s in everyday life,  however, if we could keep it in mind that the difference and relationship between database side and application side, our life will be better and more relax.

Case Sensitive in Oracle Agent 10g Command Line

13/04/2011

Oracle Agent 10g command line is case sensitive.

 It fails if you use this command ‘emctl upload agenT‘, however, it succeeds if you use ‘emctl upload agent‘.

 Make sure you use the right case for the right thing.

Thinking in ORA-494, ORA-239 Instance Crashes and Hangs

04/02/2011

I am sharing some of my little thought of troubleshooting Oracle database or instance crashes by ORA-00494 or ORA-00239 error from different scenarios. This Oracle database or instance crashes can occur on any OS platform with Oracle Server Enterprise Edition from version 10.2.0.4 to 11.1.0.7.

Note:
• The term INCIDENT referred in the following parts represents the incident caused by ORA-494 or ORA-239 error.

1.1 Root Cause Analysis

This database or instance crashes incident can occur in different services running on Oracle RDBMS with different scenarios, it will terminate the Oracle database processes and leads the database system to an outage status which will disrupt customers’ activities and negatively impact the SLA for database support.

Root Cause Analysis is necessary to prevent the similar incident re-occurring in the future.

1.1.1 Error Messages

Two Oracle error messages could be found for this kind of incident:

  • ORA-00239

Blocker process is killed by ITSELF. For example, LGWR process terminates LGWR process itself, etc.

  • ORA-00494

Blocker process is killed by OTHER process. For example, LGWR process terminates CKPT process, etc.

Note:

  • In some database hangs scenarios, there is no any Oracle error message can be found in ALERT log file.

 

1.1.2 Symptoms

There are two kinds of symptoms can be found on the problematic Oracle database server when this incident occurs:

  • Instance crashes
  • Database hangs

 

1)      Instance crashes

Instance crashes due to background or foreground process terminated by Oracle database server.

Error messages indicating this symptom can be found in the corresponding ALERT log file. If any of the Oracle key background processes is terminated, the Oracle instance crashes and future client/server connection to Oracle database will be refused.

Here is an ORA-494 sample from a cited ALERT log file:

Sat Sep 25 08:06:57 2010

LGWR started with pid=13, OS id=1464

Sat Sep 25 08:06:57 2010

CKPT started with pid=14, OS id=2548

Wed Oct 06 16:52:01 2010

Archived Log entry 17453 added for thread 1 sequence 34734 ID 0xb691ccee dest 1:

Wed Oct 06 16:52:50 2010

Thread 1 advanced to log sequence 34736 (LGWR switch)

  Current log# 2 seq# 34736 mem# 0: L:\ORACLE\REDOLOG\DWHPROD\REDO02A.RDO

  Current log# 2 seq# 34736 mem# 1: D:\ORACLE\REDOLOG\DWHPROD\REDO02B.RDO

Wed Oct 06 16:53:10 2010

Archived Log entry 17454 added for thread 1 sequence 34735 ID 0xb691ccee dest 1:

Wed Oct 06 16:53:56 2010

Thread 1 advanced to log sequence 34737 (LGWR switch)

  Current log# 3 seq# 34737 mem# 0: L:\ORACLE\REDOLOG\DWHPROD\REDO03A.RDO

  Current log# 3 seq# 34737 mem# 1: D:\ORACLE\REDOLOG\DWHPROD\REDO03B.RDO

Wed Oct 06 16:54:17 2010

Archived Log entry 17455 added for thread 1 sequence 34736 ID 0xb691ccee dest 1:

Wed Oct 06 16:55:02 2010

Thread 1 advanced to log sequence 34738 (LGWR switch)

  Current log# 4 seq# 34738 mem# 0: L:\ORACLE\REDOLOG\DWHPROD\REDO04A.RDO

  Current log# 4 seq# 34738 mem# 1: D:\ORACLE\REDOLOG\DWHPROD\REDO04B.RDO

Wed Oct 06 16:55:22 2010

Archived Log entry 17456 added for thread 1 sequence 34737 ID 0xb691ccee dest 1:

Wed Oct 06 16:56:07 2010

Thread 1 advanced to log sequence 34739 (LGWR switch)

  Current log# 5 seq# 34739 mem# 0: L:\ORACLE\REDOLOG\DWHPROD\REDO05A.RDO

  Current log# 5 seq# 34739 mem# 1: D:\ORACLE\REDOLOG\DWHPROD\REDO05B.RDO

Wed Oct 06 16:56:27 2010

Archived Log entry 17457 added for thread 1 sequence 34738 ID 0xb691ccee dest 1:

Wed Oct 06 16:56:45 2010

Errors in file f:\oracle\diag\rdbms\DWHPROD\DWHPROD\trace\DWHPROD_lgwr_1464.trc (incident=427828):

ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by ‘inst 1, osid 2548′

Incident details in: f:\oracle\diag\rdbms\DWHPROD\DWHPROD\incident\incdir_427828\DWHPROD_lgwr_1464_i427828.trc

Killing enqueue blocker (pid=2548) on resource CF-00000000-00000000

 by killing session 555.1

by terminating the process

LGWR (ospid: 1464): terminating the instance due to error 2103 

Here is an ORA-239 sample from a cited ALERT log file:

Errors in file f:\oracle\diag\rdbms\DWHPROD\DWHPROD\trace\DWHPROD_lgwr_1004.trc  (incident=448308):

ORA-00239: timeout waiting for control file enqueue: held by ‘inst 1, osid 4604′ for more than 900 seconds

Incident details in: f:\oracle\diag\rdbms\DWHPROD\DWHPROD\incident\incdir_448308\DWHPROD_lgwr_1004_i448308.trc

opidrv aborting process LGWR ospid (2628_1004) due to error ORA-603

Wed Nov 03 00:18:52 2010

Errors in file f:\oracle\diag\rdbms\DWHPROD\DWHPROD\trace\DWHPROD_pmon_844.trc:

ORA-00470: LGWR process terminated with error

PMON (ospid: 844): terminating the instance due to error 470

Wed Nov 03 00:18:52 2010

Errors in file f:\oracle\diag\rdbms\DWHPROD\DWHPROD\trace\DWHPROD_j000_9792.trc:

ORA-00470: LGWR process terminated with error

Wed Nov 03 00:18:53 2010

Errors in file f:\oracle\diag\rdbms\DWHPROD\DWHPROD\trace\DWHPROD_q001_5764.trc:

ORA-00470: LGWR process terminated with error

Instance terminated by PMON, pid = 844

2)      Database hangs

Database hangs due to system resource busy and database does not respond for a long period of time.

No explicit error message indicating this symptom can be found in the corresponding ALERT log file, but the ALERT log file shows the database is not running expectedly with redo log switching operations for a long period of time. The database system will remain in a degradation state and future client/server connection to Oracle database may be accepted will not be established successfully.

Here is a sample from a cited ALERT log file:

Tue Oct 26 03:45:03 2010

Thread 1 cannot allocate new log, sequence 35778

Private strand flush not complete

  Current log# 5 seq# 35777 mem# 0: L:\ORACLE\REDOLOG\DWHPROD\REDO05A.RDO

  Current log# 5 seq# 35777 mem# 1: D:\ORACLE\REDOLOG\DWHPROD\REDO05B.RDO

Thread 1 advanced to log sequence 35778 (LGWR switch)

  Current log# 6 seq# 35778 mem# 0: L:\ORACLE\REDOLOG\DWHPROD\REDO06A.RDO

  Current log# 6 seq# 35778 mem# 1: D:\ORACLE\REDOLOG\DWHPROD\REDO06B.RDO

Tue Oct 26 03:45:06 2010

Archived Log entry 18496 added for thread 1 sequence 35777 ID 0xb691ccee dest 1:

Tue Oct 26 10:02:38 2010

Starting ORACLE instance (normal)

1.1.3 OEM Monitor

OEM Grid Control infrastructure monitors Oracle database and instance availability and other system health statistics.

Due to different symptoms mentioned above, OEM may not be able to detect or report the potential incident underneath in time.

1)      Instance crashes

OEM is able to detect this incident by the rejected connection return code to the crashed instance. An OEM alarm will be generated once OEM Agent fails to connect to the instance after the crash. Alarm related Support Request will be assigned to database support queue according to the level of impact.

Here is a sample of the OEM alarm:

OEM: DWHPROD: Failed to connect to database instance: ORA-01034: ORACLE not available

2)      Database hangs

OEM may not be able to detect this incident in time due to OEM Agent connection to Oracle database may also hang for a long time of period. One or more OEM alarms will be generated after the timed-out mechanism of client/server connection. One or more alarm related Support Request’s will be assigned to database support queue according to the level of impact.

Here is a sample of the OEM alarm:

OEM: DWHPROD: Failed to connect to database instance: ORA-01034: ORACLE not available

Alarms of listener or other targets monitored by OEM may also be generated depends on the level of degradation:

OEM: pmichlaudwh28.PMINTL.NET: The listener is down: TNS-12571: TNS: packet write failure

1.1.4 Root Cause

The incident caused by ORA-00494 error is identified as an Oracle database server bug which is published on Oracle Support website:

 Bug 7692631 – DATABASE CRASHES WITH ORA-494 AFTER UPGRADE TO 10.2.0.4

Root cause of this bug is Oracle database Kill Blocker Interface feature which has been introduced since Oracle database 10.2.0.4 version. As this is a proactive mechanism to prevent the instance in a cluster wide hang state, Oracle Support finally relates this incident to the unpublished Bug 7914003 – ‘KILL BLOCKER AFTER ORA-494 LEADS TO FATAL BG PROCESS BEING KILLED’.

The following Root Cause Analysis explains the full image of the incident respectively according to the above two symptoms.

1.1.4.1 Instance crashes

The first entry point to diagnose this incident is located in the ALERT log file clearly with further trace file and dump information.

Here is the related trace file for background process LGWR:

*** 2010-10-06 16:31:47.336

Warning: log write time 550ms, size 1KB

*** 2010-10-06 16:32:13.398

Warning: log write time 1260ms, size 4KB

*** 2010-10-06 16:32:38.883

Warning: log write time 1550ms, size 1KB

*** 2010-10-06 16:51:25.329

Warning: log write time 500ms, size 848KB

*** 2010-10-06 16:56:45.686

Unable to get enqueue on resource CF-00000000-00000000 (ges mode req=4 held=6)

Possible local blocker ospid=2548 sid=555 sser=1 time_held=1286377005 secs (ges mode req=6 held=4)

DUMP LOCAL BLOCKER: initiate state dump for KILL BLOCKER

  possible owner[14.2548] on resource CF-00000000-00000000

Dumping process info of pid[14.2548] requested by pid[13.1464]

Incident 427828 created, dump file: f:\oracle\diag\rdbms\DWHPROD\DWHPROD\incident\incdir_427828\DWHPROD_lgwr_1464_i427828.trc

ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by ‘inst 1, osid 2548′

Killing enqueue blocker (pid=2548) on resource CF-00000000-00000000

 by killing session 555.1

Kill session 555.1 failed with status 29

Killing enqueue blocker (pid=2548) on resource CF-00000000-00000000

 by terminating the process

Killing fatal process ospid 1752161928

Issue instance termination

Combining the ALERT log file, trace file and dumped object states, the Five Ws – WHO, WHAT, WHEN, WHY, HOW related to this incident will be addressed step by step as follow.

  • WHO

The key information is Resource Owner and Resource Requestor who initiates the special behavior eventually leads the system to a crash.

 

1)       Resource Owner

This shows who is holding the contention resource before resource request sent by other process. In this case, Oracle CKPT process is the resource owner, whose Oracle PID is 000E in hexadecimal and 14 in decimal and identified in additional trace file.

Here is the valuable information in the additional trace file:

SO: 0x0000000474C98690, type: 7, owner: 0x0000000474AC49A0, flag: INIT/-/-/0×00 if: 0x1c: 0×1

               proc=0x00000004686FD9E0, name=enqueue, file=ksq1.h LINE:234, pg=0

              (enqueue) CF-00000000-00000000 DID: 0000-000E-00000003

              lv: 35 b2 a5 00 00 00 00 00 00 00 00 00 00 00 00 00  res_flag: 0×2

              mode: SS, lock_flag: 0×10, lock: 0x0000000474C986E8, res: 0x0000000474E08448

              own: 0x0000000474AAD770, sess: 0x0000000474AAD770, proc: 0x00000004686FD9E0, prv: 0x0000000474E08458

2)       Resource Requestor

This shows who is requesting the contention resource while blocking by other process. In this case, Oracle LGWR process is the resource requestor, which identified in ALERT log file.

Note:  Depends on different system activities, the resource owner and requestor could be other Oracle background or foreground process, i.e. ARCn process, RMAN backup job, application process, etc.

  • WHAT

The key information is Contention Resource which is considered as the ‘bottleneck’ in the resource allocation flow.

1)       Contention Resource

This shows what database resource is being acquired and blocked by other process which eventually causes the contention. In this case, CF-00000000-00000000 is the contention resource, which is an enqueue of Oracle control file resource and identified in additional trace file.

  • WHEN

The key information is the date and time the incident happens which can be used to measure the SLA afterwards. This information can be found in either Oracle database side like ALERT log file and trace file or OEM alarm. In some cases, this information could also be found in backup reports or application logs. In this case, the instance crashed at the following time showed in ALERT log file:

Wed Oct 06 16:56:45 2010

Errors in file f:\oracle\diag\rdbms\DWHPROD\DWHPROD\trace\DWHPROD_lgwr_1464.trc

  • WHY

Different symptoms could be caused by the same root cause although different scenarios. Finding out the reason is significant for relating different Support Request to the same Instability and prevent the same incident re-occurring in the future.

In this incident, during the problematic period, Oracle CKPT process is updating all the control files and all the 34 data files except the other temporary files, while Oracle LGWR is initiating the log switch operation. Obviously, Oracle database system is running into the typical checkpoint process for updating control files and data files header for consistency and durability purposes:

  • CKPT process updates database files header except temporary files
  • CKPT process updates control files
  • LGWR process performs log switch operation
  • LGWR process waits for control file enqueue

 

In the case, LGWR shows a slow performance which may be caused by I/O subsystem in storage layer or resource contention from other background or foreground processes, and this is also recorded in the above LGWR process trace file with every single redo log write operation longer than 500ms.

Online redo log switch operation takes place while checkpoint operation holding the dedicated enqueue resource which has been accumulated for more than 900 seconds, triggered the Oracle Kill Blocker mechanism and eventually force the database server to kill the background process blocker CKPT. As this incident happened on Microsoft Windows platform, on which Oracle running in multi-threading architecture other than multi-processing on Unix-like platform, the killing session behavior will be escalated to killing ORACLE.EXE process and Oracle instance will be terminated unexpectedly and crashes with an inconsistent state.

The following figure shows the general activities of these two processes caused the incident:

 

Here is the related trace file for dump information of different object states:

Dump of memory from 0x000000046B1A07B8 to 0x000000046B1A0810

        46B1A07B0                   00000101 00000000          [........]

        46B1A07C0 00000109 00000000 686FD9E0 00000004  [..........oh....]

        46B1A07D0 686FD9E0 00000004 6B1A06A8 00000004  [..oh.......k....]

        46B1A07E0 686FDA40 00000004 00000000 00000000  [@.oh............]

        46B1A07F0 00000000 00000000 00000000 00000000  [................]

          Repeat 1 times

            (FOB) flags=2050 fib=000000047747DE88 incno=0 pending i/o cnt=0

             fname=H:\DWHPROD_DATA2\ORADATA\INDEX_SMB2010_2.DBF

             fno=34 lblksz=16384 fsiz=50808 

        Dump of memory from 0x000000046B1A0688 to 0x000000046B1A06E0

        46B1A0680                   00000101 00000000          [........]

        46B1A0690 00000109 00000000 686FD9E0 00000004  [..........oh....]

        46B1A06A0 686FD9E0 00000004 6B1A0578 00000004  [..oh....x..k....]

        46B1A06B0 6B1A07D8 00000004 00000000 00000000  [...k............]

        46B1A06C0 00000000 00000000 00000000 00000000  [................]

          Repeat 1 times

            (FOB) flags=2050 fib=000000047747DA38 incno=0 pending i/o cnt=0

             fname=H:\DWHPROD_DATA1\ORADATA\INDEX_SMB2010_1.DBF

             fno=33 lblksz=16384 fsiz=50808

        Dump of memory from 0x000000046B1A0558 to 0x000000046B1A05B0

        46B1A0550                   00000101 00000000          [........]

        46B1A0560 00000109 00000000 686FD9E0 00000004  [..........oh....]

        46B1A0570 686FD9E0 00000004 6B1A0448 00000004  [..oh....H..k....]

        46B1A0580 6B1A06A8 00000004 00000000 00000000  [...k............]

        46B1A0590 00000000 00000000 00000000 00000000  [................]

          Repeat 1 times

            (FOB) flags=2050 fib=000000047747D600 incno=0 pending i/o cnt=0

             fname=H:\DWHPROD_DATA2\ORADATA\DATA_SMB2010_2.DBF

             fno=32 lblksz=16384 fsiz=126144

  • HOW

The key information is how the instance crashed. In this incident, this is clearly recorded by Oracle database server in ALERT log file because of the 900 second enqueue resource timeout triggered the KILL BLOCKER interface.

1.1.4.2 Database hangs

The other symptom of this kind of incident is database hangs for a long time of period. During the problematic period, all of the connections to Oracle database server are allowed as normal, no further log or trace information will be recorded on Oracle database server side. The whole Oracle database server will be put into a degradation state with all the key processes running normally.

Note:

  • OEM may not be able to detect this potential problem in time.
  • Two or more alarms may be generated for OEM monitored targets other than database

 

  • WHO

In this scenario, this information is not available on Oracle database side. No trace file will directly show this information.

  • WHAT

In this scenario, this information is not available on Oracle database side. No trace file will directly show this information.

  • WHEN

In this scenario, this information is not available on Oracle database side. No trace file will directly show this information. The down time could be calculated according to normal log switch operation interval and OEM alarms.

  • WHY

There are several approaches to find out the reason of the hang and related database activities. Statspack, AWR, ASH or particular trace events can be used to populate the real picture under the scene.

In this incident, AWR report has been generated to figure out the top level waiting event. Log file sync and control file sequential read are the two main contributors to the incident, because both of them consumed more than 96% of database running time.

Here is the sample of top waiting event in Oracle database system wide:

 

To further confirm the above cause, the 99.17% I/O WAIT is also proving the fact that the system was under high pressure of disk I/O activity. It could be because of physical file system contention or slow performance of hardware in storage layer.

Here is the sample of I/O WAIT statistics:

 

Furthermore, from the ALERT log before the time of incident was realized, redo log switch frequency were extremely high in every minute. This will cause one process holds control file resource enqueue or other mode of enqueue, while other process requesting the same enqueue resource. In this scenario, it can be convinced that control file is the contention resource because both of the above events will synchronize at least the Redo Byte Address and SCN information.

This could be cause by heavy client transaction rate, slow disk I/O performance, slow network I/O performance or storage problem.

  • HOW

The instance is not terminated but remains in a degradation state. All the foreground process and PGA have inconsistent information. The instance must be terminated manually and starts up again in order to be back to normal.

1.2 Solution

There are two categories of solution for this incident according to different critical level of circumstances:

  • Solve Incident
  • Prevent Incident

 

1.2.1 Solve Incident

When alarms are generated or support requests are assigned, it is critical that the database system could be brought back to normal for service accessing as soon as possible.

The following step can be followed to troubleshoot this kind of incident and bring the database service back online at the first time.

 

Step Description
1 Check ALERT log file and trace fie 
Shutdown clearly of Oracle instance 
Check database status in OEM

 

1.2.2 Prevent Incident

According to official Oracle support notes, the ORA-00494 or ORA-00239 error is triggered by the new introduced Oracle Kill Blocker interface and can certainly happen on any OS platform running with Oracle database server Enterprise Edition from 10.2.0.4 to 11g.

1.2.2.1 Risk Matrix

The following Risk Matrix is categorized according to impact level of Support Request and Alarm. According to the frequency of this incident, a Risk Matrix should be categorized to better understand the whole system impact. Therefore, a proactive action should be taken to prevent the incident re-occurring in the future. For example, Change coordination.

  

1.2.2.2 Change Coordination

To determine whether there is any other potential issue accompanies in the incident, the following points on incident server should be confirmed before raising a Change to implement the solution:

  • Small redo log member size
  • Few redo log groups
  • Frequent redo log switches
  • Disk I/O statistics via Operation Manager
  • Health status of shared storage device
  • Health status of TSM backup device
  • If server running under virtual machine
  • More detail information can be attained by assistance from TSER and Automation team

 

More related information can be attained by Oracle utilities on the incident server. The followings are two of the most useful utilities for further down the root cause:

  • ADRCI

Instruction of using ADRCI can be attained from Oracle Support website.

Here is a sample of invoking ADRCI for further investigation of the incident.

C:\Documents and Settings\user>adrci

ADRCI: Release 11.1.0.7.0 – Production on Sat Oct 30 13:44:54 2010

Copyright (c) 1982, 2009, Oracle and/or its affiliates.  All rights reserved

ADR base = “f:\oracle”

adrci> show trace -i 427828

Output the results to file: c:\docume~1\user\locals~1\temp\1\utsout_2564_6916_4.ado

  • RDA

Instruction of using RDA can be attained from Oracle Support website.

Here is a sample of invoking RDA for further investigation of the incident.

C:\Documents and Settings\user>rda.cmd

———————————————————————–

RDA Data Collection Started 01-Nov-2010 07:17:40

———————————————————————–

Processing Initialization module …

1.2.2.3 Oracle Database Interface

Two Oracle database hidden parameters can be reconfigured to prevent further incident happens by the same cause – Oracle Kill Blocker Interface mechanism:

1)      _kill_controlfile_enqueue_blocker = { TRUE | FALSE }

  • TRUE.   Default value. Enables this mechanism and kills blocker process in CF enqueue.
  • FALSE. Disables this mechanism and no blocker process in CF enqueue will be killed.

 

2)      _kill_enqueue_blocker = { 0 | 1 | 2 | 3 }

  • 0. Disables this mechanism and no foreground or background blocker process in enqueue will be killed.
  • 1. Enables this mechanism and only kills foreground blocker process in enqueue while background process is not affected.
  • 2.  Enables this mechanism and only kills background blocker process in enqueue.
  • 3.  Default value. Enables this mechanism and kills blocker processes in enqueue.

3)      _controlfile_enqueue_timeout = { INTEGER }

  • 900. Default value.
  • 1800. Optimum value to prevent enqueue timeout.

 

Note:

  • SPFILE can be backed up for roll back purpose before change implementation.

 

A slightly change has been made to the error message of this Oracle error. Here is the example of 10g and 11g. The key word ‘Potential’ is removed in 11g, therefore, the stronger you understand your database system, the more robust your system and service will be.

10g

===

GES: Potential blocker (pid=4840) on resource CF-00000000-00000000;

Killing enqueue blocker (pid=4840) on resource CF-00000000-00000000

11g

===

ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by ‘inst 1, osid 2548′

Killing enqueue blocker (pid=2548) on resource CF-00000000-00000000

2 Reference documents

 

  • Oracle Support Note – Database Crashes With ORA-00494 [ID 753290.1]
  • Oracle Support Note – Disk IO Contention Slow Can Lead to ORA-239 and Instance Crash [ID 1068799.1]
  • Oracle Support Note – ORA-00494 During High Load After 10.2.0.4 Upgrade [ID 779552.1]

Thinking in Resizing Redo Log

14/01/2011

Due to an incident in a critical environment, the following two plans are brought onto the table for open discussion. Furthermore, it is investigated thoroughly that the following two plans are the only options of the workaround.

I have extracted the core idea from the real plans to make it much easier to be understood in a minute.

1)      Plan A – To ADD more online Redo Log Groups

Currently, there are three groups with two members each with 50MB. After adding 3 more groups with the same size as existing log groups, there will be 6groups with two members each with 50MB.

Pros:

  • Easy to implement, and no log switching during the configuration is needed.

Cons:

  • There are too many physical files on disk.

 

2)      Plan B – To RESIZE current online Redo Log Members 

Adds three 3 new groups with larger redo member size, for instance, two members with 100MB each. Then performs log switching followed by deleting the existing three old groups alternatively.

Hypothetically, log_checkpoint_* and other settings are modified to the best suited values.

Pros:

  • The architecture (i.e. 2*3 redo log structure) will be same as other systems and compliant to the organization’s policy.
  • Reduces the frequency of log switching

Cons:

  • Negative performance impact during the configuration due to log switching and subsequent check point activities
  • Instance recovery time increases.

 

3)      Plan C – Plan B + Plan A

Increases redo member size and add new groups.

Pros:

  • Better performance with increased size of existing members and additional 3 new groups.

Cons:

  • See Cons of Plan B.

 

Thinking in a Nutshell:

1)      At time point T1, transaction log entries (e.g. A completed & committed transaction XACT1) have been written to redo log file

2)      At time point T2, in current environment, this batch of redo information has been archived to archived log file after a log switch. Whereas in contrast, in Plan B, it might just store in redo log file which means it has not been archived to archived log file yet.

3)      At time point T3, redo log file corrupted unexpectedly.

4)      At time point T4, in current environment, transaction XACT1 can be recovered from the archived log file. However, in Plan B, transaction XACT1 cannot be recovered because the redo log file containing this information was corrupted before archiving.

5)      There is a compromise between high performance and high availability.

6)      Reasonable value of redo log member size and check point setting should be discussed and decided to achieve the best performance and availability.

Simple thing is not always the easy thing. The more you think the more you learn, why not share your thoughts here? :)

How’s tricks

14/01/2011

1) 背景

早上在办公室的小厨房里面,这里通常是大家一天中第一次见面的地方。

2) 时间

早餐时间。

3) 地点

办公室厨房内。

4) 人物

我和英国老板,以及另外一个在澳大利亚长大的英国人同事

5) 事件

老板用幽默的眼神,微笑着对我说:“How’s tricks?”

虽然我知道应该怎么回答这个问题,或者应该如何衔接这段由对方发起的对话,但由于是第一次亲身经历这个场景,还是迷惑地问了一下老板究竟Trick在这里表达什么意思。老板讲了一大通,总而言之,就是和How are you doing或者What’s up类似。

过了一会儿,遇见另外一个在澳大利亚长大的英国人同事,我卖弄了刚学到的这个短句向他问好,他起初愣了1秒钟,然后回复:“I am good. It is interesting English. ”我好奇地追问着,难道这个不常用吗?由于他的专业是英语,他简单给我讲了一下相关的历史以及由来,然后说这个其实不常用。

如果你用这句话和被人搭讪的时候,这里的别人主要指英国人,估计还有一点爱尔兰人,他们可能会误会你是从远古过来的。

6) 关键词

How’s tricks

7) 总结

How’s tricks: How are you, How are you doing或者what’s up的意思,是英式英语,即British English。不过在这个年代,使用这个句子的人已经不多了。

Google Dictionary解释

没能在Google Dictionary中找到比较合适的答案,下面这个链接简单讲了一下这个短语的含义.

How’s tricks:

http://en.wiktionary.org/wiki/how’s_tricks

Clone Oracle Physical Standby Database to Read Write Mode

24/02/2010

1.   Purpose

Provide a workaround for cloning Oracle physical standby database to a new database followed by open the new database to read write mode while keeping the existing physical standby database running in the same role in the existing data guard configuration.

You can also use this method to HOT backup your physical standby database.

2.   Scope & Application

DBAs and Application Developers might benefit from this document when they need to reproduce the production system scenarios to diagnose the production problem and generate staging report in a separate database.

3.   Symptoms

As there is no Oracle official solution for this kind of operation, according to your particular situation, the following parts should be considered in your mind clearly beforehand, otherwise the CLONE operation may not work properly for your business.

3.1   Database Unique Name

The database name and the database unique name inside the same enterprise environment should be unique in theory. This thoughtful consideration will eliminate unnecessary exceptions and errors of your new system and help to facilitate the development and deployment of your application systems.

Try your best to set the db_name and db_unique_name to a different value rather than the same as the source physical standby database, primary database or other database names inside your enterprise database system environment.

For example, set the new database name to ‘NEWDB’ in PFILE or SPFILE rather than leaving them the same as any of the existing database systems.

*.db_name=’NEWDB’

*.db_unique_name=’NEWDB’

3.2   Ongoing Transaction Frequency

When cloning the new database from physical standby database, there may be some database operations performing on primary database. If your physical standby database is using real time apply mechanism, you need to keep this in mind, because when you are cloning the physical database, the data files of which may be in different SCN which means they may be in the inconsistent status in file headers.

Take a note of in what order you would like to backup the individual data files or in a RUN batch block mode.

3.3   Media Recovery

In many cases of OLTP system, you need to perform a media recovery when cloning the new database system, because each data file may have information up to different SCN.

3.4   Archived Log Destination

You may need to apply archived log files of existing physical standby database, figure out the accessibility of these archived log files, they are crucial for the consistency of existing physical standby database as well as the new database, as they are the durability part of the transaction’s ACID. Missed them means missed transaction, means missed your data.

4.   Solution

In order to make the solution clearer, I have summarised the steps in different sessions in the following table.

Notes:

1)      Primary database in Data Guard  - Primary

2)      Physical standby database in Data Guard – Standby

3)      New database being created – NEWDB

4)      Backup Standby database with RMAN

Timeline Session 1 Session 2 Session 3
Primary Standby NEWDB
T1 Create pfile/spfile, set new database name
T2 BACKUP AS COPY DATAFILE 2
FORMAT ‘/u01/app/oracle/oradata/NEWDB/undotbs01.dbf’;
T3 alter system switch logfile;
T4 alter system checkpoint;
T5 BACKUP AS COPY DATAFILE 3
FORMAT ‘/u01/app/oracle/oradata/NEWDB/sysaux01.dbf’;
T6 alter system switch logfile;
T7 alter system checkpoint;
T8 BACKUP AS COPY DATAFILE 4
T9 FORMAT ‘/u01/app/oracle/oradata/NEWDB/users01.dbf’;
T10 alter system switch logfile;
T11 alter system checkpoint;
T12 BACKUP AS COPY DATAFILE 1
T13 FORMAT ‘/u01/app/oracle/oradata/NEWDB/system01.dbf’;
T14 alter database backup controlfile to trace;
T15 STARTUP NOMOUNT
T16 CREATE CONTROLFILE set DATABASE “NEWDB” RESETLOGS  ARCHIVELOG
MAXLOGFILES 16
MAXLOGMEMBERS 3
MAXDATAFILES 100
MAXINSTANCES 8
MAXLOGHISTORY 292
LOGFILE
GROUP 1 ‘/u01/app/oracle/oradata/NEWDB/redo01.log’  SIZE 50M,
GROUP 2 ‘/u01/app/oracle/oradata/NEWDB/redo02.log’  SIZE 50M,
GROUP 3 ‘/u01/app/oracle/oradata/NEWDB/redo03.log’  SIZE 50M
DATAFILE
‘/u01/app/oracle/oradata/NEWDB/system01.dbf’,
‘/u01/app/oracle/oradata/NEWDB/undotbs01.dbf’,
‘/u01/app/oracle/oradata/NEWDB/sysaux01.dbf’,
‘/u01/app/oracle/oradata/NEWDB/users01.dbf’
CHARACTER SET AL32UTF8;
T17 select * from v$recover_file;
T18 select file#, status, recover, fuzzy from v$datafile_header;
T19 recover database using backup controlfile until cancel;
T20 Manually apply archive log in standby archived log destination
T21 alter database open resetlogs;

Follow

Get every new post delivered to your Inbox.