HP OpenVMS Systems Documentation |
OpenVMS VAX System Dump Analyzer Utility Manual
8.2.2 Illegal Page FaultsA PGFIPLHI bugcheck occurs when a page fault occurs while the interrupt priority level (IPL) is greater than 2 (IPL$_ASTDEL). When the system fails because of an illegal page fault, the following message appears on the console terminal:
When an illegal page fault occurs, the stack appears as shown in Figure SDA-4. Figure SDA-4 Stack Following an Illegal Page-Fault Error Six longwords describe the error:
If the operating system detects a page fault while the IPL is higher
than IPL$_ASTDEL, you can obtain the address of the instruction that
caused the fault by examining the PC pushed onto the current operating
stack. Follow the steps outlined in Section 9.3 to determine which
module issued the instruction.
This section steps through the analysis of a system failure using, as an example, a printer driver. Three events lead up to this failure:
The following sections describe the actions to take in investigating
the causes of this system crash.
First, invoke SDA to analyze the system dump file. The initialization message indicates the type of bugcheck that occurred as follows:
An exception occurred that caused the system to signal a bugcheck, and
signal and mechanism arrays have been created on the current operating
stack.
Use the SHOW STACK command to display the current operating stack. In this case, it is the interrupt stack. The following example shows the interrupt stack and the signal and mechanism arrays. See the SHOW STACK command for a complete description of the format of the stack display.
The mechanism array begins at address 8006A3A816 and ends at address 8006A3B816. Its first longword contains 0000000416. The signal array begins at address 8006A3C016 and ends at 8006A3D416. Its first longword contains 0000000516 and its second longword contains 0000000C16. Examination of the signal array shows the following:
Issuing the SDA command EVALUATE/PSL 04080000 makes the following information apparent:
Use the SHOW PAGE_TABLE command to display the system page table, as shown in the following example. The page containing location 80069E0016 is not available to any access mode (a null page); thus, the virtual address is not valid.
9.3 Locating the Source of the Exception
Because the printer went off line and then came back on line, as shown
on the console listing in Section 9.2, the problem might exist in the
driver code. SDA can help you determine which driver might contain the
faulty code.
The first step in determining whether the failing instruction is within a driver is to examine the PC in the signal array using the EXAMINE/INSTRUCTION command. This has two results:
In the following example, the instruction that caused the exception is located within the printer driver.
If SDA is unable to find a symbol within FFF16 bytes of the memory location you specify, it displays the location as an absolute address. This often, but not always, means the instruction that caused the exception is not part of a device driver. To determine whether an instruction is part of a driver, use the SHOW DEVICE command to display the starting addresses and lengths of all the drivers in the system. If the address of the failing instruction falls within the range of addresses shown for a given driver, the failing instruction is a part of that driver. The following example shows a partial list of the drivers in the display generated by the SHOW DEVICE command.
9.3.2 Calculating the Offset into the Driver's Program SectionThe offsets that SDA displays from nnDRIVER are actually offsets from the DPT. As such, these offsets do not exactly correspond to the offsets shown in driver listings, which represent offsets from the beginning of the program section (PSECT) in which a given instruction appears. Because a driver usually contains more than one PSECT, you must use the driver's map to determine the location of the failing instruction within the driver listing. To calculate the location of the instruction within the driver listing, refer to the "Program Section Synopsis" section of the driver's map. Determine in which PSECT the offset given by SDA occurs and subtract the base of the PSECT from the offset. You can then use the resulting figure as an index into the driver listing.
If SDA does not display the address as an offset from
nnDRIVER, but the address is within the address range
of a driver in the SHOW DEVICE display, you must first subtract the
address of the DPT from the failing address. Using the result as the
offset, you can then follow the steps previously outlined for
determining the index of the instruction into a driver listing.
To find the problem within the routine, examine the printer's driver code. In the system failure discussed in this example, the instruction that caused the exception is MOVB (R3)+,(R0). To check the contents of R3, use the EXAMINE command as follows:
The invalid virtual address, as recorded in the signal array, is stored in R3. In the following driver code excerpt, the instruction in question appears at line 599. It is likely that the contents of R3 have been incremented too many times.
Explanations of the circled numbers in the example are in Section 9.4.1.
The MOVB instruction is part of a routine that reads characters from a buffer and writes them to the printer. The routine contains the loop of instructions that starts at the label 20$ and ends at 25$. This loop executes once for each character in the buffer, performing these steps:
Steps 1 and 2 are repeated until the contents of R1 are 0 or the printer signals that it is not ready. If the printer signals that it is not ready, the driver transfers control to 30$ (line 598), the beginning of a routine that waits for an interrupt from the printer. When the printer becomes ready, it interrupts the driver and execution of the loop resumes. Examine the code to determine which variables control the loop. The byte count (BCNT) is the number of characters in the buffer. Note that BCNT is set by a function decision table (FDT) routine and that this routine sets the value of BCNT to the number of characters in the buffer. In line 586, the starting address of a buffer that is BCNT bytes in size is moved into R3. Note also that the number of characters left to be printed is represented by the byte offset (BOFF), the offset into the buffer at which the driver finds the next character to be printed. This value controls the number of times the loop is executed. Because the exception is an access violation, either R3 or R0 must contain an incorrect value. You can determine that R0 is probably valid by the following logic:
Thus, the contents of R3 seem to be the cause of the failure.
The most likely reason that the contents of R3 are wrong is that the
MOVB instruction at line 599 executes too many times. You can check
this by comparing the contents of UCB$W_BOFF and UCB$W_BCNT. If
UCB$W_BOFF contains a larger value than that in UCB$W_BCNT, then R3
contains a value that is too large, indicating that the MOVB
instruction has incremented the contents of R3 too many times.
Because the start-I/O routine requires that R5 contain the address of the printer's UCB, and because several other instructions reference R5 without error before any instruction in the loop does, you can assume that R5 contains the address of the right UCB. To compare BOFF and BCNT, use the command FORMAT @R5 to display the contents of the UCB, as shown in the following session.
If you have only one printer in your system configuration, you do not need to use the FORMAT command. Instead, you can use the command SHOW DEVICE LP. Because only one printer is connected to the processor, only one UCB is associated with a printer for SDA to display. The output produced by the FORMAT @R5 command shows that UCB$W_BOFF contains a value greater than that in UCB$W_BCNT; it should be smaller. Therefore, the value stored in BOFF is incorrect.
Thus, the value of BOFF is not the number of characters that remain in
the buffer. This value is used in calculating an address that is
referenced at an elevated IPL. When this address is within a null page
(unreadable in all access modes), an attempt to reference it causes the
system to fail.
Examine the printer driver code to locate all instructions that modify UCB$W_BOFF. The value changes in two circumstances:
When the printer times out, the driver should not modify UCB$W_BOFF. It does so, however, in line 631. The driver should modify the contents of UCB$W_BOFF only when it is certain that the printer printed the character. When the printer times out, this is not the case. Furthermore, the wait-for-interrupt routine preserves only registers R3, R4, and R5, so that only those registers can be used unmodified after the execution of the wait-for-interrupt routine. Thus, the use of R1 in line 631 is an error. To correct the problem, change the WFIKPCH argument (line 616) so that, when the printer times out, the WFIKPCH macro transfers control to 50$ rather than to 40$.
10 Inducing a System FailureIf the operating system is not performing well and you want to create a dump you can examine, you must induce a system failure. Occasionally, a device driver or other user-written, kernel-mode code can cause the system to execute a loop of code at a high priority, interfering with normal system operation. This can occur even though you have set a breakpoint in the code if the loop is encountered before the breakpoint. To gain control of the system in such circumstances, you must cause the system to fail and then reboot it. If the system has suspended all noticeable activity (if it is "hung"), see the examples of causing system failures in Section 10.2.
If you are generating a system crash in response to a system hang, be
sure to record the PC at the time of the system halt as well as the
contents of the general registers. Submit this information to Compaq,
along with the Software Performance Report (SPR) and a copy of the
generated system dump file.
The following requirements must be met before the system can write a complete crash dump:
|