![]() |
![]() HP OpenVMS Systemsask the wizard |
![]() |
The Question is: When running a system with single CPU, a program calling the system service $QIO has no problem. After installing a second CPU, it gets a completion status 00000000 in iosb when calling the system service $CANTIM (cancel timer for timeout) or $CANCEL (cancel QIO). 1. What does this completion status mean ? 2. How can I overcome this situation ? 3. Do I have to treat it as an error ? The Answer is : The Wizard would tend to expect the program in question has one or more latent errors that are uncovered by the differences in the execution timings between the SMP and uniprocessor platforms, or in the difference in performance. Both moving to SMP and moving to faster processors are notorious for uncovering latent synchronization errors lurking within various applications. Further, upgrading to an OpenVMS releases where the timing of constituent operations has changed can serve as a trigger -- operations that are slower are certainly an obvious cause, but faster completions regularly also trigger synchronization-related errors. The first thing will you want to do is read and understand the OpenVMS Programming Concepts Manual, and particularly -- for this question -- the chapters that cover asynchronous system traps (ASTs), general process synchronization mechanisms, and interlocked memory synchronization. You will then want to look for common sources of synchronization errors. Some of the sources for these errors can include: o Lack of checking for the completion of an asynchronous operation. Variations include: o failure to always use and verify an IOSB. Typically, the IOSB and (as applicable) the EFN is verified with $synch, or other similar synchronization coding technique. o Erroneously assuming that the setting of an event flag is an indication that the asynchronous operation has succeeded. (Use of the lib$get_ef and lib$free_ef calls to allocate unique event flags is a start, but the application should be coded to assume spurious event flag changes can arise.) Ask The Wizard topics discussing event flags include (446), (640), (687), (811), (819), (923), (1170), (1661), (1894), (2637), (2922), (3531), (4325), (6099), (6138). o If you do not want, need, nor use an event flag, do not use event flag zero (the default), rather use EFN$C_ENF; use the Do Not Care event flag. (See enfdef on V7.1 and later.) o failure to use an IOSB that is valid over the lifetime of the asynchronous call. o erroneously sharing the IOSB across multiple asynchronous calls. o failure to allocate the IOSB in memory that is valid over the lifetime of the call. Using subroutine-local storage -- this is often allocated on the stack and is valid only while the call frame is itself valid -- is one classic example. o accessing the buffer that the asynchronous read (such as a read I/O) before the IOSB has been verified as non-zero, and thus before the current read operation has completed. o accessing the buffer that the asynchronous write (such as a write I/O) before the IOSB has been verified as non-zero, and thus before the last write operation has completed. o overwriting the contents of the I/O write buffer before the IOSB from the last write has been verified as non-zero. o erroneously placing the read or write buffer for an asynchronous operation in volatile storage, or in storage that is erroneously shared with other currently-outstanding asynchronous calls. o Assumptions around the synchronous completion of system services not listed as synchronous, accessing the data or buffers involved before completion; without the expected use of $synch or other completion synchronization. o Assumptions around the delivery of or the delivery order of ASTs originating on calls including $setimr, $dclast or $qio, and access to the buffers or the data involved in the call without the expected use of $synch or other synchronization. o Incorrect shared memory synchronization. Variations include: o failure to use interlocked operations. o failure to correctly account for caching. Depending on the operation and the platform, memory barriers may be required. Some of the possible memory caching policies include no caching, write-through caching, and write-back caching. (Please see Ask The Wizard topics (2681), (6984) and (7383) for further information on the correct use of memory barriers on Alpha systems.) o incorrect use of the interlocked queue operations, erroneously adding or removing entries at any location in the queue other than the header of an interlocked queue. o Uninterlocked sharing of any data between any AST(s) and the mainline threads -- all data structures must be interlocked or otherwise entirely re-entrant. o On Alpha, failure to use the memory barrier operators (when necessary) to ensure consistent memory contents -- memory barriers are used to properly control the (expected) read and write reordering normally found on Alpha. The barrier will block execution until all pending memory operations have completed. (Again, see topic (2681) for details, and for discussions of the memory barriers and particularly the granularity of the hardware interlocks, please see topics (6984) and (7383).) o Failure to account for "tearing" when performing non-aligned (non-naturally aligned) memory access on adjacent areas of memory. You will need to know if the particular platform requires naturally-aligned quadword references, naturally-aligned longword references, or some other value. Tearing involves references to memory that are not naturally aligned -- and that are otherwise unsynchronized -- and specifically involves parallel unaligned references to nearby areas of memory. * A variation of tearing involves the use of IPL-based or spinlocked-based synchronization -- and multiple levels of these synchronization mechanisms -- within a single addressable unit of memory. Individual bits within a status value longword, for instance, cannot be safely synchronized using multiple IPLs or multiple spinlocks. * A second and potentially more subtle variation of tearing involves the granularity of reference used by the particular compiler (eg: CC/GRANULARITY), where the compiler can generate code which can read and re-write adjacent values -- if, for instance, the adjacent memory is an (adjacent) device CSR, well, then things can get rather interesting rather quickly. o Random programming errors: Variations include: o Failure to correctly deal with spurious sys$wake requests when using sys$hiber calls. Topics (2637) and (3783) are related. The alternative to a spurious $wake is a lost $wake, and this can cause a process to stall waiting for the lost $wake. The Programming Concepts manual discussion of $hiber and $wake contains further information on this. o Failure to correctly size memory allocated and deallocated. o Failure to insert application-specific debugging into any large or complex application. This includes logging. (See topic (4129) for information on dynamic activation of the debugger; for information that permits generating a traceback using a supported and documented API.) o Failure to centralize error-prone areas of the code into a few routines, particularly the ability to centralize all memory management calls into a few routines. This allows the ability to use "fenceposts" or similar techniques to track down memory pool corruptors -- allocating a "hidden" quadword at the front and the back of any allocation call, filling both quadwords with known patterns unique to the particular memory allocation call, and checking for the pattern on deallocation. (Additional details of using "fenceposts" are included in topic (3257).) o accessing the contents of a descriptor for a dynamic string for write through any means other than the provided string descriptor routines. o Writing data through an uninitialized pointer. o Writing beyond the end of a data structure. o Failure to check for an appropriate return condition value from a subroutine or system service before continuing the execution of a routine. o On VAX, an REI instruction must be executed prior to executing any instructions (code) that were written by the application program. o SYSGEN parameter setting assumptions: Variations include: o Failure to specify the entire required process quota list on a sys$creprc call. o Failure to specify the mailbox size and buffer quota on a call to sys$crembx. o Failure to check the required SYSGEN quota values for the appropriate minimum values on each application or on each system startup. o Incorrect mixing of threads and ASTs: o See topics (4647) and (6099) for details. o And the Access Violation (ACCVIO), a brief introduction: o See previous discussions of the Access Violation (ACCVIO) and decoding the stackdump here in Ask The Wizard, and specifically please see topics (837), (1705), (2195), (2223), (3215), (5533), (6065), (6495), (6776), (7551), and likely a few other topics. o For information on the OpenVMS Debugger, on the "divide and conquer" troubleshooting technique, and for general details on how to debug an application, please see topics (7552) and (4129). Memory allocation routines are commonly referenced as sources of errors by many programmers. The memory allocation and deallocation routines in the OpenVMS libraries see extensive use throughout the operating system, layered products and applications, and throughout the entire customer application base, and the corresponding incidence of errors in these calls -- though while certainly possible -- is exceedingly rare. Most often, the application seeing the error has somehow clobbered a key part of the memory heap or has clobbered part of the stack. This is why the Wizard recommends centralizing the allocation and deallocation routines, and using "fenceposts" (details of fenceposts discussed above). o Related topics include (2624), (2630) and (3748), as well as (3115), (3257), (4808), (5455), (5640), (6536), (7006), etc. Also please see the heap analyzer support in the OpenVMS Debugger -- the Debugger is an invaluable tool for locating and resolving errors, and can even be programmed to lurk waiting for an error, or to activate (via a call to lib$signal with SS$_DEBUG) and then display information on the error and the current application context. Related topics include (2681) and (6099). Also see (2624) and (2630). Also the ACCVIO topics: (837), (1705), (2195), (2223), (3215), (5533), (6065), (6495), (6776), (7551), etc. Also see the ASTs, threads, reentrancy, and shared memory topics: (2681), (4647), (6099), and (6984). And debugging and traceback topics such as (4129) and (7552). For virtual memory debugging and memory heap corruptions, and fenceposts, see (3257).
|