HP OpenVMS Systems Documentation

Content starts here

OpenVMS VAX System Dump Analyzer Utility Manual


Previous Contents Index

8.2.2 Illegal Page Faults

A PGFIPLHI bugcheck occurs when a page fault occurs while the interrupt priority level (IPL) is greater than 2 (IPL$_ASTDEL). When the system fails because of an illegal page fault, the following message appears on the console terminal:


PGFIPLHI, page fault with IPL too high

When an illegal page fault occurs, the stack appears as shown in Figure SDA-4.

Figure SDA-4 Stack Following an Illegal Page-Fault Error


Six longwords describe the error:

Longword Contents
R4 Contents of R4 at the time of the bugcheck.
R5 Contents of R5 at the time of the bugcheck.
Reason mask Longword mask. If bit 0 of this longword is set, the failing instruction (at the PC saved below) caused a length violation. If bit 1 is set, it referred to a location whose page table entry is in an "access" page. Bit 2 indicates the type of access used by the failing instruction: it is set for write and modify operations and clear for read operations.
Virtual address Virtual address being referenced by the instruction that caused the page fault.
PC PC containing the address of the instruction that caused the page fault.
PSL PSL at the time of the page fault.

If the operating system detects a page fault while the IPL is higher than IPL$_ASTDEL, you can obtain the address of the instruction that caused the fault by examining the PC pushed onto the current operating stack. Follow the steps outlined in Section 9.3 to determine which module issued the instruction.

9 A Sample System Failure

This section steps through the analysis of a system failure using, as an example, a printer driver. Three events lead up to this failure:

  1. The line printer goes off line for 3 hours.
  2. The line printer comes back on line.
  3. The operating system signals a bugcheck, writes information to the system dump file, and shuts itself down.

The following sections describe the actions to take in investigating the causes of this system crash.

9.1 Identifying the Bugcheck

First, invoke SDA to analyze the system dump file. The initialization message indicates the type of bugcheck that occurred as follows:



Dump taken on 31-JAN-1993 16:34:31.23
INVEXCEPTN, Exception while above ASTDEL or on interrupt stack

SDA>

An exception occurred that caused the system to signal a bugcheck, and signal and mechanism arrays have been created on the current operating stack.

9.2 Identifying the Exception

Use the SHOW STACK command to display the current operating stack. In this case, it is the interrupt stack. The following example shows the interrupt stack and the signal and mechanism arrays. See the SHOW STACK command for a complete description of the format of the stack display.


CPU 01 Processor stack
----------------------
Current operating stack (INTERRUPT)

        8006A378    8000844B    ACP$WRITEBLK+0A0
   .
   .
   .
  SP => 8006A398    7FFDC340
        8006A39C    8006A3A0
        8006A3A0    80004E7D    EXE$REFLECT+0D4
        8006A3A4    04080009
        8006A3A8    00000004
        8006A3AC    7FFDC368
        8006A3B0    FFFFFFFD
        8006A3B4    8001774E
        8006A3B8    0000074F
        8006A3BC    00000001
        8006A3C0    00000005
        8006A3C4    0000000C
        8006A3C8    00000000
        8006A3CC    80069E00
        8006A3D0    8005D003
        8006A3D4    04080000
        8006A3D8    80009604    EXE$FORKDSPTH+01C
   .
   .
   .

The mechanism array begins at address 8006A3A816 and ends at address 8006A3B816. Its first longword contains 0000000416. The signal array begins at address 8006A3C016 and ends at 8006A3D416. Its first longword contains 0000000516 and its second longword contains 0000000C16. Examination of the signal array shows the following:

  • The exception code is 0C16, indicating an access violation.
  • The reason mask is zero, indicating that the instruction caused a protection violation (instead of a length violation) when it tried to read the location (rather than write to it).
  • The virtual address that the instruction attempted to reference was 80069E0016.
  • The PC of the instruction that referred to the bad virtual address was 8005D00316.

Issuing the SDA command EVALUATE/PSL 04080000 makes the following information apparent:

  • The IPL was 8 at the time of the exception (shown by bits 16 through 20 of the PSL).
  • The current operating stack was the interrupt stack (bit 26 of the PSL is set to 1).
  • The process was executing in kernel mode at the time of the exception (shown by bits 24 and 25 of the PSL).

Use the SHOW PAGE_TABLE command to display the system page table, as shown in the following example. The page containing location 80069E0016 is not available to any access mode (a null page); thus, the virtual address is not valid.


SDA> SHOW PAGE_TABLE


System page table
-----------------

ADDRESS   SVAPTE   PTE       TYPE  PROT  BITS PAGTYP  LOC STATE TYPE REFCNT   BAK       SVAPTE  FLINK  BLINK
   .
   .
   .
80068400  80777B08 7C40FFC8  STX   UR       K
80068600  80777B0C 7C40FFC8  STX   UR       K
80068800  80777B10 7C40FFC8  STX   UR       K
80068A00  80777B14 7C40FFC8  STX   UR       K
80068C00  80777B18 7C40FFC8  STX   UR       K
80068E00  80777B1C 7C40FFC8  STX   UR       K
80069000  80777B20 7C40FFC8  STX   UR       K
80069200  80777B24 7C40FFC8  STX   UR       K
80069400  80777B28 7C40FFC8  STX   UR       K
80069600  80777B2C 7C40FFC8  STX   UR       K
80069800  80777B30 7C40FFC8  STX   UR       K
80069A00  80777B34 780016C9  TRANS UR       K SYSTEM FREELST 00   01    0   0040FFC8   80777B34  03AF  0E15
80069C00  80777B38 78000E15  TRANS UR       K SYSTEM FREELST 00   01    0   0040FFC8   80777B38  16C9  2592
-------- 40 NULL PAGES
   .
   .
   .

9.3 Locating the Source of the Exception

Because the printer went off line and then came back on line, as shown on the console listing in Section 9.2, the problem might exist in the driver code. SDA can help you determine which driver might contain the faulty code.

9.3.1 Finding the Driver by Using the Program Counter

The first step in determining whether the failing instruction is within a driver is to examine the PC in the signal array using the EXAMINE/INSTRUCTION command. This has two results:

  • If possible, it displays the contents of the address as a MACRO instruction.
  • It identifies the address as an offset from the symbol, nnDRIVER, if the address lies within the first FFF16 bytes of such a symbol. SDA defines a symbol in the form of nnDRIVER for each device driver connected to the system. This symbol represents the base of the driver prologue table (DPT). Each DPT is part of the device driver it describes and is immediately followed by that driver's code.

In the following example, the instruction that caused the exception is located within the printer driver.


SDA> EXAMINE/INSTRUCTION 8005D003
LPDRIVER+2B3   MOVB    (R3)+,(R0)

If SDA is unable to find a symbol within FFF16 bytes of the memory location you specify, it displays the location as an absolute address. This often, but not always, means the instruction that caused the exception is not part of a device driver.

To determine whether an instruction is part of a driver, use the SHOW DEVICE command to display the starting addresses and lengths of all the drivers in the system. If the address of the failing instruction falls within the range of addresses shown for a given driver, the failing instruction is a part of that driver. The following example shows a partial list of the drivers in the display generated by the SHOW DEVICE command.


I/O data structures

                           DDB list
                           --------

    Address    Controller     ACP       Driver      DPT   DPT size
    -------    ----------     ---       ------      ---   --------

    80000ECC    HELIUM$DBA    F11XQP    DBDRIVER   800F7AD0  08FD
    80001040    OPA                     OPERATOR   80001622  0061
    8000126C    MBA                     MBDRIVER   800015B0  0578
    80001460    NLA                     NLDRIVER   800015E9  05A3
    801E2800    HELIUM$DMA    F11XQP    DMDRIVER   800B5CB0  0AA0
    801E2980    HELIUM$DLA    F11XQP    DLDRIVER   800B6A50  08D0
   .
   .
   .

9.3.2 Calculating the Offset into the Driver's Program Section

The offsets that SDA displays from nnDRIVER are actually offsets from the DPT. As such, these offsets do not exactly correspond to the offsets shown in driver listings, which represent offsets from the beginning of the program section (PSECT) in which a given instruction appears. Because a driver usually contains more than one PSECT, you must use the driver's map to determine the location of the failing instruction within the driver listing.

To calculate the location of the instruction within the driver listing, refer to the "Program Section Synopsis" section of the driver's map. Determine in which PSECT the offset given by SDA occurs and subtract the base of the PSECT from the offset. You can then use the resulting figure as an index into the driver listing.

If SDA does not display the address as an offset from nnDRIVER, but the address is within the address range of a driver in the SHOW DEVICE display, you must first subtract the address of the DPT from the failing address. Using the result as the offset, you can then follow the steps previously outlined for determining the index of the instruction into a driver listing.

9.4 Finding the Problem Within the Routine

To find the problem within the routine, examine the printer's driver code. In the system failure discussed in this example, the instruction that caused the exception is MOVB (R3)+,(R0). To check the contents of R3, use the EXAMINE command as follows:


SDA> EXAMINE R3
R3: 80069E00 "...."

The invalid virtual address, as recorded in the signal array, is stored in R3. In the following driver code excerpt, the instruction in question appears at line 599. It is likely that the contents of R3 have been incremented too many times.


581 STARTIO:
582      MOVL    UCB$L_IRP(R5),R3     ;Retrieve address of I/O packet
583      MOVW    IRP$L_MEDIA+2(R3),-
584              UCB$W_BOFF(R5)       ;Set number of characters to print
585      MOVL    UCB$L_SVAPTE(R5),R3  ;Get address of system buffer
586      MOVAB   12(R3),R3            ;Get address of data area
587      MOVL    UCB$L_CRB(R5),R4     ;Get address of CRB
588      MOVL    @CRB$L_INTD+VEC$L_IDB(R4),R4 ;Get device CSR address
589 ;
590 ; START NEXT OUTPUT SEQUENCE
591 ;
592
593 10$: ADDL3   #LP_DBR,R4,R0        ;Calculate address of data buffer register
594      MOVZWL  UCB$W_BOFF(R5),R1    ;Get number of characters remaining
595      MOVW    #^X8080,R2           ;Get control register test mask
596      BRB     25$                  ;Start output
597 20$: BITW    R2,(R4) (1)           ;Printer ready or have paper problem?
598      BLEQ    30$                  ;If LEQ not ready or paper problem
599      MOVB    (R3)+,(R0) (2)        ;Output next character
600      ASHL    #1,G^EXE$GL_UBDELAY,-(SP)    ;Delay 3*2 u-seconds
601 24$: SOBGEQ  (SP),24$             ;Delay loop calibrated to machine speed
602      ADDL    #4,SP                ;Pop extra longword off stack
603 25$: SOBGEQ  R1,20$ (3)            ;Any more characters to output?
604      BRW     70$                  ;All done, BRW to set return status

Explanations of the circled numbers in the example are in Section 9.4.1.

9.4.1 Examining the Routine

The MOVB instruction is part of a routine that reads characters from a buffer and writes them to the printer. The routine contains the loop of instructions that starts at the label 20$ and ends at 25$. This loop executes once for each character in the buffer, performing these steps:

  1. The driver checks the printer's status register to see if the printer is ready.
  2. If the printer is ready, the driver gets a character from the buffer and moves it to the printer's data register, to which R0 points.
  3. It then decrements R1, which contains the count of characters left to print. If R1 contains a number greater than 0, control is passed back to the instruction at 20$, and the loop begins again.

Steps 1 and 2 are repeated until the contents of R1 are 0 or the printer signals that it is not ready.

If the printer signals that it is not ready, the driver transfers control to 30$ (line 598), the beginning of a routine that waits for an interrupt from the printer. When the printer becomes ready, it interrupts the driver and execution of the loop resumes.

Examine the code to determine which variables control the loop.

The byte count (BCNT) is the number of characters in the buffer. Note that BCNT is set by a function decision table (FDT) routine and that this routine sets the value of BCNT to the number of characters in the buffer. In line 586, the starting address of a buffer that is BCNT bytes in size is moved into R3.

Note also that the number of characters left to be printed is represented by the byte offset (BOFF), the offset into the buffer at which the driver finds the next character to be printed. This value controls the number of times the loop is executed.

Because the exception is an access violation, either R3 or R0 must contain an incorrect value. You can determine that R0 is probably valid by the following logic:

  • The instruction at 10$ (ADDL3 #LP_DBR,R4,R0) places an address in R0 and R0 is not modified again until the failing instruction (line 599).
  • The value in R4 at the time that the instruction at 10$ is executed was derived from the addresses of the device's unit control block (UCB) (line 587) and CRB (line 599). Although it is possible that these data structures might contain wrong information, it is unlikely.

Thus, the contents of R3 seem to be the cause of the failure.

The most likely reason that the contents of R3 are wrong is that the MOVB instruction at line 599 executes too many times. You can check this by comparing the contents of UCB$W_BOFF and UCB$W_BCNT. If UCB$W_BOFF contains a larger value than that in UCB$W_BCNT, then R3 contains a value that is too large, indicating that the MOVB instruction has incremented the contents of R3 too many times.

9.4.2 Checking the Values of Key Variables

Because the start-I/O routine requires that R5 contain the address of the printer's UCB, and because several other instructions reference R5 without error before any instruction in the loop does, you can assume that R5 contains the address of the right UCB. To compare BOFF and BCNT, use the command FORMAT @R5 to display the contents of the UCB, as shown in the following session.


SDA> READ SYS$SYSTEM:SYSDEF.STB
SDA> FORMAT @R5


8005D160    UCB$L_FQFL      800039A8
            UCB$L_RQFL
            UCB$W_MB_SEED
            UCB$W_UNIT_SEED
8005D164    UCB$L_FQBL      800039A8
            UCB$L_RQBL
8005D168    UCB$W_SIZE          0122
8005D16A    UCB$B_TYPE        10
8005D16B    UCB$B_FIPL      34
            UCB$B_FLCK
   .
   .
   .
8005D1C8    UCB$L_SVAPTE    80062720
8005D1CC    UCB$W_BOFF          0795
8005D1CE    UCB$W_BCNT      006D
8005D1D0    UCB$B_ERTCNT          00
8005D1D1    UCB$B_ERTMAX        00
8005D1D2    UCB$W_ERRCNT    0000
   .
   .
   .
SDA>

If you have only one printer in your system configuration, you do not need to use the FORMAT command. Instead, you can use the command SHOW DEVICE LP. Because only one printer is connected to the processor, only one UCB is associated with a printer for SDA to display.

The output produced by the FORMAT @R5 command shows that UCB$W_BOFF contains a value greater than that in UCB$W_BCNT; it should be smaller. Therefore, the value stored in BOFF is incorrect.

Thus, the value of BOFF is not the number of characters that remain in the buffer. This value is used in calculating an address that is referenced at an elevated IPL. When this address is within a null page (unreadable in all access modes), an attempt to reference it causes the system to fail.

9.4.3 Identifying and Correcting the Defective Code

Examine the printer driver code to locate all instructions that modify UCB$W_BOFF. The value changes in two circumstances:

  • Immediately after the driver detects that the printer is not ready and that the problem is not a paper problem (line 609).
  • When the wait-for-interrupt routine's timeout count of 12 seconds is exhausted (lines 616 and 630). At this time, the contents of R1, plus 1, are stored in UCB$W_BOFF (line 631).

When the printer times out, the driver should not modify UCB$W_BOFF. It does so, however, in line 631. The driver should modify the contents of UCB$W_BOFF only when it is certain that the printer printed the character. When the printer times out, this is not the case. Furthermore, the wait-for-interrupt routine preserves only registers R3, R4, and R5, so that only those registers can be used unmodified after the execution of the wait-for-interrupt routine. Thus, the use of R1 in line 631 is an error.

To correct the problem, change the WFIKPCH argument (line 616) so that, when the printer times out, the WFIKPCH macro transfers control to 50$ rather than to 40$.


607
608 30$: BNEQ    40$                  ;If NEQ paper problem
609      ADDW3   #1,R1,UCB$W_BOFF(R5) ;Save number of characters remaining
610      DEVICELOCK -
611              LOCKADDR=UCB$L_DLCK(R5),-  ;Lock device interrupts
612              SAVIPL=-(SP)         ;Save current IPL
613      BITW    #^X80,LP_CSR(R4)     ;Is it ready now?
614      BNEQ    35$                  ;If NEQ, yes, it's ready
615      BISB    #^X40,LP_CSR(R4)     ;Set interrupt enable
616      WFIKPCH 40$,#12              ;Wait for ready interrupt
617      IOFORK                       ;Create a fork process
618      BRB     10$                  ;  ...and start next output
619
620 35$:
621      DEVICEUNLOCK -
622              LOCKADDR=UCB$L_DLCK(R5),-  ;Unlock device interrupts
623              NEWIPL=(SP)+         ;Restore IPL
624      CLRW    LP_CSR(R4)           ;Disable device interrupts
625      BRB     10$                  ;Go transfer more characters
626 ;
627 ; PRINTER HAS PAPER PROBLEM
628 ;
629
630 40$: CLRL    UCB$L_LP_OFLCNT(R5)  ;Clear offline counter
631      ADDW3   #1,R1,UCB$W_BOFF(R5) ;Save number of characters remaining
632 50$: CLRW    LP_CSR(R4)           ;Disable printer interrupt
633      IOFORK                       ;Lower to fork level
634      BBS     #UCB$V_CANCEL,UCB$W_STS(R5),80$  ;If set, cancel I/O operation
635      TSTW    LP_CSR(R4)           ;Printer still have paper problem?
636      BLSS    55$                  ;If LSS yes
637      MOVL    #15,UCB$L_LP_TIMEOUT(R5)  ;Set timeout value
638      BRB     10$                  ; ...and start next output

10 Inducing a System Failure

If the operating system is not performing well and you want to create a dump you can examine, you must induce a system failure. Occasionally, a device driver or other user-written, kernel-mode code can cause the system to execute a loop of code at a high priority, interfering with normal system operation. This can occur even though you have set a breakpoint in the code if the loop is encountered before the breakpoint. To gain control of the system in such circumstances, you must cause the system to fail and then reboot it.

If the system has suspended all noticeable activity (if it is "hung"), see the examples of causing system failures in Section 10.2.

If you are generating a system crash in response to a system hang, be sure to record the PC at the time of the system halt as well as the contents of the general registers. Submit this information to Compaq, along with the Software Performance Report (SPR) and a copy of the generated system dump file.

10.1 Meeting Crash Dump Requirements

The following requirements must be met before the system can write a complete crash dump:

  • You must not halt the system until the console dump messages have been printed in their entirety and the memory contents have been written to the crash dump file. Be sure to allow sufficient time for these events to take place or make sure that all disk activity has stopped before using the console to halt the system.
  • There must be a crash dump file in SYS$SYSTEM: named either SYSDUMP.DMP or PAGEFILE.SYS.
    This dump file must be either large enough to hold the entire contents of memory (as discussed in Section 1.1) or, if the DUMPSTYLE system parameter is set, large enough to accommodate a subset dump (see Section 1.1.2).
    If SYSDUMP.DMP is not present, the operating system attempts to write crash dumps to PAGEFILE.SYS. In this case, the SAVEDUMP system parameter must be 1 (the default is 0).
  • The DUMPBUG system parameter must be 1 (the default is 1).


Previous Next Contents Index