Problem(Abstract)
Processes or threads may consume unexpected CPU cycles. Output from the AIX tprof command can information about CPU utilization by process ID, thread, library, and function. This technote explores how to interpret tprof data to aid in debugging such issues in a WebSphere Application Server environment.Resolving the problem
Data
from the AIX tprof command is indispensable in debugging 100% CPU, high
CPU usage, and performance degradation issues on WebSphere JVMs
executing on the AIX operating system. In order to map threads in the
tprof data to the Java™ stacks in Java cores, analysts will also need
DBX data taken from the JVM process ID (PID). If the issue is related
to CPU usage, the 100% CPU Mustgather technote will usually provide all the documentation needed to debug the problem.
The 100% CPU Mustgather technote contains a script that executes the tprof command with the following options:
tprof -skex sleep 60
The –skex parameters configure the following profiling:
The tprof utility gathers CPU statistics while the sleep 60 command executes for 60 seconds. On an AIX 5.2 machine, the output from the script is stored in the sleep.prof file and will have the following format:
Process Freq Total Kernel User Shared Other
======= ==== ===== ====== ==== ====== =====
wait 4 70.48 70.48 0.00 0.00 0.00
/usr/java131/jre/bin/java 7 25.04 0.01 0.00 25.02 0.01
/usr/bin/tprof 2 0.15 0.02 0.00 0.13 0.00
/home/db2inst1/sqllib/bin/db2fm 7 0.11 0.08 0.00 0.03 0.00
/usr/bin/sh 11 0.04 0.04 0.00 0.01 0.00
...
Process PID TID Total Kernel User Shared Other
======= === === ===== ====== ==== ====== =====
java131/jre/bin/java 20386 78913 25.00 0.00 0.00 25.00 0.00
wait 1290 1291 23.79 23.79 0.00 0.00 0.00
wait 516 517 23.60 23.60 0.00 0.00 0.00
wait 1032 1033 21.64 21.64 0.00 0.00 0.00
wait 774 775 1.47 1.47 0.00 0.00 0.00
/usr/bin/tprof 19884 112213 0.15 0.02 0.00 0.13 0.00
java131/jre/bin/java 12336 55499 0.01 0.00 0.00 0.00 0.00
java131/jre/bin/java 12336 117689 0.01 0.00 0.00 0.00 0.00
The first section presents a summation of CPU usage based on the process name and breaks the usage into the Kernel, User, Shared, and Other library sections. In this example, the java131/jre/bin/java is consuming 25.04% of the CPU cycles during this period of time, most of which is used in the shared libraries. The “wait” process is using a total of 70.48% of the CPU, all of which is in the kernel.
The second section breaks down CPU usage by PID. The “java131/jre/bin/java” PID 20386 is consuming 25% of the CPU in shared libraries. The Java PID 12336 is using very little CPU.
When debugging WebSphere Application Server issues, CPU usage from a Java process in the shared library generally indicates garbage collection (GC) activity. CPU usage in the Other category usually indicates a problem in Java code. If tprof shows a suspicious process (non-wait) using CPU in the Kernel libraries, you can usually assume the usage is due to an AIX operating system issue.
The next section of data compiled by tprof provides a break down of CPU usage in each major category by Shared Object and Subroutine:
Total % For All Processes (SH-LIBs) = 25.22
Shared Object %
============= ======
/usr/WebSphere/AppServer/java/jre/bin/classic/libjvm.a 25.00
/usr/lib/libc.a[shr.o] 0.10
/usr/lib/libpthreads.a[shr_xpg5.o] 0.08
/usr/lib/libtrace.a[shr.o] 0.03
/usr/WebSphere/AppServer/java/jre/bin/libjitc.a 0.00
/usr/lib/libC.a[shr.o] 0.00
/usr/WebSphere/AppServer/java/bin/../jre/bin/libzip.a 0.00
Profile: /usr/WebSphere/AppServer/java/jre/bin/classic/libjvm.a
Total % For All Processes (/usr/WebSphere/AppServer/java/jre/bin/classic/libjvm.a) = 25.00
Subroutine % Source
========== ====== ======
.verifyHeapAfterCompaction0 25.00 ov/st/msc/gc_trace.c
.xeCreateNativeFrame 0.00 ov/xe/common/stack.c
Profile: /usr/lib/libc.a[shr.o]
Total % For All Processes (/usr/lib/libc.a[shr.o]) = 0.10
Subroutine % Source
========== ====== ======
._rec_mutex_lock 0.02 lib/libc/rec_mutex.c
.strpbrk 0.01 s/lib/libc/strpbrk.c
.strncpy 0.01 strncpy.s
.__divi64 0.01 divi64.s
From this data, you can determine the WebSphere Application Server library and the code being executed. In the above example, from the shared libraries category (SH-LIBS), you can see that usr/WebSphere/AppServer/java/jre/bin/classic/libjvm.a is using 25% of the CPU. The verifyHeapAfterCompaction subroutine from the ov/st/msc/gc_trace.c source is responsible for all of that 25%. In this case, it appears that GC is responsible for 25% of the CPU usage during the period the tprof data was taken. This data can be invaluable in pinpointing the source of CPU usage.
Also note that the process name from the Summary section of the tprof output shows the Java process as originating from the directory path /usr/java131/jre/bin/java instead of the /usr/WebSphere/AppServer/java/jre/bin shown in the Shared Objects and Subroutine sections:
Process Freq Total Kernel User Shared Other
======= ==== ===== ====== ==== ====== =====
wait 4 70.48 70.48 0.00 0.00 0.00
/usr/java131/jre/bin/java 7 29.04 0.01 0.00 29.02 0.01
vs.
Total % For All Processes (SH-LIBs) = 25.22
Shared Object %
============= ======
/usr/WebSphere/AppServer/java/jre/bin/classic/libjvm.a 25.00
This is a defect in the tprof utility addressed by the 5.2.0.51 release of the bos.perf.tools library.
On AIX 5.1 and earlier, tprof data is captured in a _prof.all file and the CPU usage is not quantified as a percentage of the total but the number of ticks the processes use. The first section of the _prof.all file contains the CPU usage of each thread in the system processes:
Process PID TID Total Kernel User Shared Other
======= === === ===== ====== ==== ====== =====
java 18932 49219 5940 117 0 4667 1156
wait 1290 1291 4405 4405 0 0 0
wait 774 775 3991 3991 0 0 0
wait 516 517 3472 3472 0 0 0
...
/usr/bin/grep 30300 100087 1 1 0 0 0
======= === === ===== ====== ==== ====== =====
Total 24005 15755 26 6083 2141
In this example, thread ID 49219 used 5940 of the total 24005 ticks during the sample period (about 25%).
The next section of data in the _prof.all file contains the summation of CPU usage by process name. Once again, instead of percentage CPU, we see the number of ticks. In this case, the Java process used 8551 of the 24005 ticks:
Process FREQ Total Kernel User Shared Other
======= === ===== ====== ==== ====== =====
wait 4 15237 15237 0 0 0
java 42 8551 376 0 6034 2141
PatrolAgent 6 48 17 16 15 0
...
/usr/bin/grep 1 1 1 0 0 0
======= === ===== ====== ==== ====== =====
Total 110 24005 15755 26 6083 2141
Finally, the _prof.all file also contains the breakdown of the CPU usage grouped by category (Shared, Other, etc.), Shared Object, and Subroutine. The format is similar to the one discussed above.
When debugging CPU usage issues, you should first inspect the tprof output to determine which processes/threads are consuming the CPU, usually concentrating on Java processes. Then, determine the libraries from which the process is using the CPU. If it is Java process using CPU from the shared libraries, you can quickly check the Shared Objects and Subroutine sections to determine if GC is the culprit. This can then be verified by analyzing the verbose GC data if it is provided.
If the CPU is consumed from the Other libraries, you should attempt to map the problem thread ID back to the javacore file to determine the source of the problem. You will need dbxtrace data taken from the JVM PID in order to do so. Detailed instructions for mapping the tprof data back to the javacores can be found in Correlating TPROF Output to a Specific Thread in the JAVACORE using DBXTrace -a Output
The 100% CPU Mustgather technote contains a script that executes the tprof command with the following options:
tprof -skex sleep 60
The –skex parameters configure the following profiling:
-s | Enables shared library profiling |
-k | Enables kernel profiling |
-e | Turns on kernel extension profiling |
-x Program | Specifies the program to be executed by tprof. Data collection stops when Program completes or trace is manually stopped with either trcoff or trcstop |
The tprof utility gathers CPU statistics while the sleep 60 command executes for 60 seconds. On an AIX 5.2 machine, the output from the script is stored in the sleep.prof file and will have the following format:
Process Freq Total Kernel User Shared Other
======= ==== ===== ====== ==== ====== =====
wait 4 70.48 70.48 0.00 0.00 0.00
/usr/java131/jre/bin/java 7 25.04 0.01 0.00 25.02 0.01
/usr/bin/tprof 2 0.15 0.02 0.00 0.13 0.00
/home/db2inst1/sqllib/bin/db2fm 7 0.11 0.08 0.00 0.03 0.00
/usr/bin/sh 11 0.04 0.04 0.00 0.01 0.00
...
Process PID TID Total Kernel User Shared Other
======= === === ===== ====== ==== ====== =====
java131/jre/bin/java 20386 78913 25.00 0.00 0.00 25.00 0.00
wait 1290 1291 23.79 23.79 0.00 0.00 0.00
wait 516 517 23.60 23.60 0.00 0.00 0.00
wait 1032 1033 21.64 21.64 0.00 0.00 0.00
wait 774 775 1.47 1.47 0.00 0.00 0.00
/usr/bin/tprof 19884 112213 0.15 0.02 0.00 0.13 0.00
java131/jre/bin/java 12336 55499 0.01 0.00 0.00 0.00 0.00
java131/jre/bin/java 12336 117689 0.01 0.00 0.00 0.00 0.00
The first section presents a summation of CPU usage based on the process name and breaks the usage into the Kernel, User, Shared, and Other library sections. In this example, the java131/jre/bin/java is consuming 25.04% of the CPU cycles during this period of time, most of which is used in the shared libraries. The “wait” process is using a total of 70.48% of the CPU, all of which is in the kernel.
The second section breaks down CPU usage by PID. The “java131/jre/bin/java” PID 20386 is consuming 25% of the CPU in shared libraries. The Java PID 12336 is using very little CPU.
When debugging WebSphere Application Server issues, CPU usage from a Java process in the shared library generally indicates garbage collection (GC) activity. CPU usage in the Other category usually indicates a problem in Java code. If tprof shows a suspicious process (non-wait) using CPU in the Kernel libraries, you can usually assume the usage is due to an AIX operating system issue.
The next section of data compiled by tprof provides a break down of CPU usage in each major category by Shared Object and Subroutine:
Total % For All Processes (SH-LIBs) = 25.22
Shared Object %
============= ======
/usr/WebSphere/AppServer/java/jre/bin/classic/libjvm.a 25.00
/usr/lib/libc.a[shr.o] 0.10
/usr/lib/libpthreads.a[shr_xpg5.o] 0.08
/usr/lib/libtrace.a[shr.o] 0.03
/usr/WebSphere/AppServer/java/jre/bin/libjitc.a 0.00
/usr/lib/libC.a[shr.o] 0.00
/usr/WebSphere/AppServer/java/bin/../jre/bin/libzip.a 0.00
Profile: /usr/WebSphere/AppServer/java/jre/bin/classic/libjvm.a
Total % For All Processes (/usr/WebSphere/AppServer/java/jre/bin/classic/libjvm.a) = 25.00
Subroutine % Source
========== ====== ======
.verifyHeapAfterCompaction0 25.00 ov/st/msc/gc_trace.c
.xeCreateNativeFrame 0.00 ov/xe/common/stack.c
Profile: /usr/lib/libc.a[shr.o]
Total % For All Processes (/usr/lib/libc.a[shr.o]) = 0.10
Subroutine % Source
========== ====== ======
._rec_mutex_lock 0.02 lib/libc/rec_mutex.c
.strpbrk 0.01 s/lib/libc/strpbrk.c
.strncpy 0.01 strncpy.s
.__divi64 0.01 divi64.s
From this data, you can determine the WebSphere Application Server library and the code being executed. In the above example, from the shared libraries category (SH-LIBS), you can see that usr/WebSphere/AppServer/java/jre/bin/classic/libjvm.a is using 25% of the CPU. The verifyHeapAfterCompaction subroutine from the ov/st/msc/gc_trace.c source is responsible for all of that 25%. In this case, it appears that GC is responsible for 25% of the CPU usage during the period the tprof data was taken. This data can be invaluable in pinpointing the source of CPU usage.
Also note that the process name from the Summary section of the tprof output shows the Java process as originating from the directory path /usr/java131/jre/bin/java instead of the /usr/WebSphere/AppServer/java/jre/bin shown in the Shared Objects and Subroutine sections:
Process Freq Total Kernel User Shared Other
======= ==== ===== ====== ==== ====== =====
wait 4 70.48 70.48 0.00 0.00 0.00
/usr/java131/jre/bin/java 7 29.04 0.01 0.00 29.02 0.01
vs.
Total % For All Processes (SH-LIBs) = 25.22
Shared Object %
============= ======
/usr/WebSphere/AppServer/java/jre/bin/classic/libjvm.a 25.00
This is a defect in the tprof utility addressed by the 5.2.0.51 release of the bos.perf.tools library.
On AIX 5.1 and earlier, tprof data is captured in a _prof.all file and the CPU usage is not quantified as a percentage of the total but the number of ticks the processes use. The first section of the _prof.all file contains the CPU usage of each thread in the system processes:
Process PID TID Total Kernel User Shared Other
======= === === ===== ====== ==== ====== =====
java 18932 49219 5940 117 0 4667 1156
wait 1290 1291 4405 4405 0 0 0
wait 774 775 3991 3991 0 0 0
wait 516 517 3472 3472 0 0 0
...
/usr/bin/grep 30300 100087 1 1 0 0 0
======= === === ===== ====== ==== ====== =====
Total 24005 15755 26 6083 2141
In this example, thread ID 49219 used 5940 of the total 24005 ticks during the sample period (about 25%).
The next section of data in the _prof.all file contains the summation of CPU usage by process name. Once again, instead of percentage CPU, we see the number of ticks. In this case, the Java process used 8551 of the 24005 ticks:
Process FREQ Total Kernel User Shared Other
======= === ===== ====== ==== ====== =====
wait 4 15237 15237 0 0 0
java 42 8551 376 0 6034 2141
PatrolAgent 6 48 17 16 15 0
...
/usr/bin/grep 1 1 1 0 0 0
======= === ===== ====== ==== ====== =====
Total 110 24005 15755 26 6083 2141
Finally, the _prof.all file also contains the breakdown of the CPU usage grouped by category (Shared, Other, etc.), Shared Object, and Subroutine. The format is similar to the one discussed above.
When debugging CPU usage issues, you should first inspect the tprof output to determine which processes/threads are consuming the CPU, usually concentrating on Java processes. Then, determine the libraries from which the process is using the CPU. If it is Java process using CPU from the shared libraries, you can quickly check the Shared Objects and Subroutine sections to determine if GC is the culprit. This can then be verified by analyzing the verbose GC data if it is provided.
If the CPU is consumed from the Other libraries, you should attempt to map the problem thread ID back to the javacore file to determine the source of the problem. You will need dbxtrace data taken from the JVM PID in order to do so. Detailed instructions for mapping the tprof data back to the javacores can be found in Correlating TPROF Output to a Specific Thread in the JAVACORE using DBXTrace -a Output
Related information
Mustgather: 100% CPU Usage on AIX PlatformsCorrelating TPROF output to a Javacore Thread
댓글 없음:
댓글 쓰기