HP OpenVMS Systems Documentation

OpenVMS I/O User's Reference Manual

Contents

Index

9.19 References

The following publications provide more information on local areanetworks.

IEEE Standards 802.1 (A, B, C, and D), 802.2, and 802.3
The Ethernet--Data Link Layer and Physical Layer Specification
ANSI X3t9.5 and X3.139
Digital FDDI Network Architecture
ATM Forum's LANE V1.0 Specification
RFC 1577

Chapter 10
Optional Features for Improving I/O Performance

This chapter includes updated information for OpenVMS Version7.3.

As of Version 7.0, OpenVMS Alpha includes two features that providedramatically improved I/O performance: Fast I/O and Fast Path. Thesefeatures are designed to promote OpenVMS as a leading platform fordatabase systems. Performance improvement results from reducing the CPUcost per I/O request and improving symmetric multiprocessing (SMP)scaling of I/O operations. The CPU cost per I/O is reduced byoptimizing code for high-volume I/O and by using better SMP CPU memorycache. SMP scaling of I/O is increased by reducing the number ofspinlocks taken per I/O and by substituting finer-granularity spinlocksfor global spinlocks.

The improvements follow a natural division that already exists betweenthe device-independent and device-dependent layers in the OpenVMS I/Osubsystem. The device-independent overhead is addressed by Fast I/O,which is a set of lean system services that can substitute for certain$QIO operations. Using these services requires some coding changes inexisting applications, but the changes are usually modest and wellcontained. The device-dependent overhead is addressed by Fast Path,which is an optional performance feature that creates a "fastpath" to the device. It requires no application changes.

Fast I/O and Fast Path can be used independently. However, togetherthey can provide a 45% reduction in CPU cost per I/O on uniprocessorsystems and a 52% reduction on multiprocessor systems.

10.1 Fast I/O

Fast I/O is a set of three system services that were developed as a$QIO alternative built for speed. These services are not a $QIOreplacement; $QIO is unchanged, and $QIO interoperation with theseservices is fully supported. Rather, the services substitute for asubset of $QIO operations, namely, only the high-volume read/write I/Orequests.

The Fast I/O services support 64-bit addresses for data transfers toand from disk and tape devices.

While Fast I/O services are available on OpenVMS VAX, the performanceadvantage applies only to OpenVMS Alpha. OpenVMS VAX has a run-timelibrary (RTL) compatibility package that translates the Fast I/Oservice requests to $QIO system service requests, so one set of sourcecode can be used on both VAX and Alpha systems.

10.1.1 Fast I/O Benefits

The performance benefits of Fast I/O result from streamlininghigh-volume I/O requests. The Fast I/O system service interfaces areoptimized to avoid the overhead of general-purpose services. Forexample, I/O request packets (IRPs) are now permanently allocated andused repeatedly for I/O rather than allocated and deallocated anew foreach I/O.

The greatest benefits stem from having user data buffers and user I/Ostatus structures permanently locked down and mapped using systemspace. This allows Fast I/O to do the following:

For direct I/O, avoid per-I/O buffer lockdown or unlocking.
For buffered I/O, avoid allocation and deallocation of a separate system buffer, since the user buffer is always addressable.
Complete Fast I/O operations at IPL 8, thereby avoiding the interrupt chaining usually required by the more general-purpose $QIO system service. For each I/O, this eliminates the IPL 4 IOPOST interrupt and a kernel AST.

In total, Fast I/O services eliminate four spinlock acquisitions perI/O (two for the MMG spinlock and two for the SCHED spinlock). Thereduction in CPU cost per I/O is 20% for uniprocessor systems and 10%for multiprocessor systems.

10.1.2 Using Buffer Objects

The lockdown of user-process data structures is accomplished by bufferobjects. A "buffer object" is process memory whose physicalpages have been locked in memory and double-mapped into system space.After creating a buffer object, the process remains fully pageable andswappable and the process retains normal virtual memory access to itspages in the buffer object.

If the buffer object contains process data structures to be passed toan OpenVMS system service, the OpenVMS system can use the buffer objectto avoid any probing, lockdown, and unlocking overhead associated withthese process data structures. Additionally, double-mapping into systemspace allows the OpenVMS system direct access to the process memoryfrom system context.

To date, only the $QIO system service and the Fast I/O services havebeen changed to accept buffer objects. For example, a buffer objectallows a programmer to eliminate I/O memory management overhead. Oneach I/O, each page of a user data buffer is probed and then lockeddown on I/O initiation and unlocked on I/O completion. Instead ofincurring this overhead for each I/O, it can be done once at bufferobject creation time. Subsequent I/O operations involving the bufferobject can completely avoid this memory management overhead.

Two system services can be used to create and delete buffer objects,respectively, and can be called from any access mode. To create abuffer object, the $CREATE_BUFOBJ system service is called. Thisservice expects as inputs an existing process memory range and returnsa buffer handle for the buffer object. The buffer handle is an opaqueidentifier used to identify the buffer object on future I/O requests.The $DELETE_BUFOBJ system service is used to delete the buffer objectand accepts as input the buffer handle. Although image rundown deletesall existing buffer objects, it is good form for the application toclean up properly.

A 64-bit equivalent version of the $CREATE_BUFOBJ system service($CREATE_BUFOBJ_64) can be used to create buffer objects from the new64-bit P2 or S2 regions. The $DELETE_BUFOBJ system service can be usedto delete 32-bit or 64-bit buffer objects.

Buffer objects require system management. Because buffer objects tie upphysical memory, extensive use of buffer objects require systemmanagement planning. All the bytes of memory in the buffer object arededucted from a systemwide system parameter called MAXBOBMEM (maximumbuffer object memory). System managers must set this parametercorrectly for the application loads that run on their systems.

The MAXBOBMEM parameter defaults to 100 Alpha pages, but forapplications with large buffer pools it will likely be set much larger.To prevent user-mode code from tying up excessive physical memory,user-mode callers of $CREATE_BUFOBJ must have a new system identifier,VMS$BUFFER_OBJECT_USER, assigned. This new identifier is automaticallycreated in an OpenVMS Version 7.0 upgrade if the fileSYS$SYSTEM:RIGHTSLIST.DAT is present. The system manager can assignthis identifier with the DCL command SET ACL command to a protectedsubsystem or application that creates buffer objects from user mode. Itmay also be appropriate to grant the identifier to a particular userwith the Authorize utility command GRANT/IDENTIFIER (for example, to aprogrammer who is working on a development system).

There is currently a restriction on the type of process memory that canbe used for buffer objects. Global section memory cannot be made into abuffer object.

10.1.3 Differences Between Fast I/O Services and $QIO

The precise definition of high-volume I/O operations optimized by FastI/O services is important. I/O that does not comply with thisdefinition either is not possible with the Fast I/O services or is notoptimized. The characteristics of the high-volume I/O optimized by FastI/O services can be seen by contrasting the operation of Fast I/Osystem services to the $QIO system service as follows:

The $QIO system service I/O status block (IOSB) is replaced by an I/O status area (IOSA) that is larger and quadword aligned. The transfer byte count returned in IOSA is 64 bits, and the field is aligned on a quadword boundary. Unlike the IOSB, which is optional, the IOSA is required.
User data buffers must be aligned to a 512-byte boundary.
All user process structures passed to the Fast I/O system services must reside in buffer objects. This includes the user data buffer and the IOSA.
Only transfers that are multiples of 512 bytes are supported.
Only the following function codes are supported: IO$_READVBLK, IO$_READLBLK, IO$_WRITEVBLK, and IO$_WRITELBLK.
Only I/O to disk and tape devices is optimized for performance.
No event flags are used with Fast I/O services. If application code must use an event flag in relation to a specific I/O, then the Event No Flag EFN (EFN$C_ENF) can be used. This event flag is a no-overhead EFN that can be used in situations when an EFN is required by a system service interface but has no meaning to an application.
For example, Fast I/O services do not use EFNs, so the application cannot specify a valid EFN associated with the I/O to the $SYNCH system service with which to synchronize I/O completion. To resolve this issue, the application can call the $SYNCH system service passing as arguments: EFN$C_ENF and the address of the appropriate IOSA. Specifying EFN$C_ENF signifies to $SYNCH that no EFN is involved in the synchronization of the I/O. Once the IOSA has been written with a status and byte count, return from the $SYNCH call occurs. The IOSA is now the central point of synchronization for a given Fast I/O (and is the only way to determine whether the asynchronous I/O is complete).
To minimize argument passing overhead to these services, the $QIO parameters P3 through P6 are replaced by a single argument that is passed directly by the Fast I/O system services to device drivers. For disk-like devices, this argument is the media address (VBN or LBN) of the transfer. For drivers with complex parameters, this argument is the address of a descriptor or of a buffer specific to the device and function.
Segmented transfers are supported by Fast I/O but are not fully optimized. There are two major causes of segmented transfers. The first is disk fragmenting. While this can be an issue, it is assumed that sites seeking maximum performance have eliminated the overhead of segmenting I/O due to fragmentation.
A second cause of segmenting is issuing an I/O that exceeds the port's maximum limit for a single transfer. Transfers beyond the port maximum limit are segmented into several smaller transfers. Some ports limit transfers to 64K bytes. If the application limits its transfers to less than 64K bytes, this type of segmentation should not be a concern.

10.1.4 Using Fast I/O Services

The three Fast I/O system services are:

$IO_SETUP---Sets up an I/O.
$IO_PERFORM[W]---Performs an I/O request.
$IO_CLEANUP--Cleans up an I/O request.

10.1.4.1 Using Fandles

A key concept behind the operation of the Fast I/O services is the filehandle or fandle. A fandle is an opaque token thatrepresents a "setup" I/O. A fandle is needed for each I/Ooutstanding from a process.

All possible setup, probing, and validation of arguments is performedoff the mainline code path during application startup with calls to the$IO_SETUP system service. The I/O function, the AST address, the bufferobject for the data buffer, and the IOSA buffer object are specified oninput to $IO_SETUP service, and a fandle representing this setup isreturned to the application.

To perform an I/O, the $IO_PERFORM system service is called, specifyingthe fandle, the channel, the data buffer address, the IOSA address, thelength of the transfer, and the media address (VBN or LBN) of thetransfer.

If the asynchronous version of this system service, $IO_PERFORM, isused to issue the I/O, then the application can wait for I/O completionusing a $SYNCH specifying EFN$C_ENF and the appropriate IOSA. Thesynchronous form of the system service, $IO_PERFORMW, is used to issuean I/O and wait for it to complete. Optimum performance comes when theapplication uses AST completion; that is, the application does notissue an explicit wait for I/O completion.

To clean up a fandle, the fandle can be passed to the $IO_CLEANUPsystem service.

10.1.4.2 Modifying Existing Applications

Modifying an application to use the Fast I/O services requires a fewsource-code changes. For example:

A programmer adds code to create buffer objects for the IOSAs and data buffers.
The programmer changes the application to use the Fast I/O services. Not all $QIOs need to be converted. Only high-volume read/write I/O requests should be changed.
A simple example is a "database writer" program, which writes modified pages back to the database. Suppose the writer can handle up to 16 simultaneous writes. At application startup, the programmer would add code to create 16 fandles by 16 $IO_SETUP system service calls.
In the main processing loop within the database writer program, the programmer replaces the $QIO calls with $IO_PERFORM calls. Each $IO_PERFORM call uses one of the 16 available fandles. While the I/O is in progress, the selected fandle is unavailable for use with other I/O requests. The database writer is probably using AST completion and recycling fandle, data buffer, and IOSA once the completion AST arrives.
If the database writer routine cannot return until all dirty buffers are written (that is, it must wait for all I/O completions), then $IO_PERFORMW can be used. Alternatively $IO_PERFORM calls can be followed by $SYNCH system service calls passing the EFN$C_ENF argument to await I/O completions.
The database writer will run faster and scale better because I/O requests now use less CPU time.
When the application exits, an $IO_CLEANUP system service call is done for each fandle returned by a prior $IO_SETUP system service call. Then the buffer objects are deleted. Image rundown performs fandle and buffer object cleanup on behalf of the application, but it is good form for the application to clean up properly.

10.1.4.3 I/O Status Area (IOSA)

The central point of synchronization for a given Fast I/O is its IOSA.The IOSA replaces the $QIO system service's IOSB argument. Larger thanthe IOSB argument, the byte count field in the IOSA is 64 bits andquadword aligned. Unlike the $QIO system service, Fast I/O servicesrequire the caller to supply an IOSA and require the IOSA to be part ofa buffer object.

The IOSA context field can be used in place of the $QIO system serviceASTPRM argument. The $QIO ASTPRM argument is typically used to pass apointer back to the application on the completion AST to locate theuser context needed for resuming a stalled user-thread. However, forthe $IO_PERFORM system service, the ASTPRM on the completion AST isalways the IOSA. Since there is no user-settable ASTPRM, an applicationcan store a pointer to the user thread context for this I/O in the IOSAcontext field and retrieve the pointer from the IOSA in the completionAST.

10.1.4.4 $IO_SETUP

The $IO_SETUP system service performs the setup of an I/O and returns aunique identifier for this setup I/O, called a fandle, to be used onfuture I/Os. The $IO_SETUP arguments used to create a given fandleremain fixed throughout the life of the fandle. This has implicationsfor the number of fandles needed in an application. For example, asingle fandle can be used only for reads or only for writes. If anapplication module has up to 16 simultaneous reads or writes pending,then potentially 32 fandles are needed to avoid any $IO_SETUP callsduring mainline processing.

The $IO_SETUP system service supports an expedite flag, which isavailable to boost the priority of an I/O among the other I/O requeststhat have been handed off to the controller. Unrestrained use of thisargument is useless, because if all I/O is expedited, nothing isexpedited. Note that this flag requires the use of ALTPRI and PHY_IOprivilege.

10.1.4.5 $IO_PERFORM[W]

The $IO_PERFORM[W] system service accepts a fandle and five othervariable I/O parameters for the high-performance I/O operation. Thefandle remains in use to the application until the $IO_PERFORMW returnsor if $IO_PERFORM is used until a completion AST arrives.

The CHAN argument to the fandle contains the data channel returned tothe application by a previous file operation. This argument allows theapplication the flexibility of using the same fandle for different openfiles on successive I/Os. However, if the fandle is used repeatedly forthe same file or channel, then an internal optimization with$IO_PERFORM is taken.

Note that $IO_PERFORM was designed to have no more than six argumentsto take advantage of the OpenVMS Calling Standard, which specifies that calls withup to six arguments can be passed entirely in registers.

10.1.4.6 $IO_CLEANUP

A fandle can be cleaned up by passing the fandle to the $IO_CLEANUPsystem service.

10.1.4.7 Fast I/O FDT Routine (ACP_STD$FASTIO_BLOCK)

Because $IO_PERFORM supports only four function codes, this systemservice does not use the generalized function decision table (FDT)dispatching that is contained in the $QIO system service. Instead,$IO_PERFORM uses a single vector in the driver dispatch table calledDDT$PS_FAST_FDT for all the four supported functions. TheDDT$PS_FAST_FDT field is a FDT routine vector that indicates whetherthe device driver called by $IO_PERFORM is set up to handle Fast I/Ooperations. A nonzero value for this field indicates that the devicedriver supports Fast I/O operations and that the I/O can be fullyoptimized.

If the DDT$PS_FAST_FDT field is zero, then the driver is not set up tohandle Fast I/O operations. The $IO_PERFORM system service toleratessuch device drivers, but the I/O is only slightly optimized in thiscircumstance.

The OpenVMS disk and tape drivers that ship as part of OpenVMS Version7.0 have added the following line to their driver dispatch table(DDTAB) macro:

FAST_FDT=ACP_STD$FASTIO_BLOCK,- ; Fast-IO FDT routine

This line initializes the DDT$PS_FAST_FDT field to the address of thestandard Fast I/O FDT routine, ACP_STD$FASTIO_BLOCK.

If you have a disk or tape device driver that can handle Fast I/Ooperations, you can add this DDTAB macro line to your driver. If youcannot use the standard Fast I/O FDT routine, ACP_STD$FASTIO_BLOCK, youcan develop your own based on the model presented in this routine.

10.1.5 Additional Information

Refer to the OpenVMS System Services Reference Manual for additional information about thefollowing Fast I/O system services:

$CREATE_BUFOBJ
$DELETE_BUFOBJ
$CREATE_BUFOBJ_64
$IO_SETUP
$IO_PERFORM
$IO_CLEANUP

To see a sample program that demonstrates the use of buffer objects andthe Fast I/O system services, refer to the IO_PERFORM.C program in theSYS$EXAMPLES directory.

10.2 Fast Path (Alpha Only)

Fast Path is an optional, high-performance feature designed to improveI/O performance. Fast Path creates a streamlined path to the device.Fast Path is of interest to any application where enhanced I/Operformance is desirable. Two examples are database systems andreal-time applications, where the speed of transferring data to disk isoften a vital concern.

Using Fast Path features does not require source-code changes. Minorinterface changes are available for expert programmers who want tomaximize Fast Path benefits.

The following table lists the supported ports for each OpenVMS Alphaversion:

Version	Ports
7.3	CIXCD, CIPCA, KGPSA, KZPBA
7.1	CIXCD, CIPCA
7.0	CIXCD

Fast Path is not available on the OpenVMS VAX operating system.

10.2.1 Fast Path Features and Benefits

Fast Path achieves dramatic performance gains by reducing CPU time forI/O requests on both uniprocessor and SMP systems. These savings are onthe order of 25% less CPU cost per I/O request on a uniprocessor and35% less on a multiprocessor system. The performance benefits areproduced by:

Reducing code paths through streamlining for the case of high-volume I/O
Substituting port-specific spinlocks for global I/O subsystem spinlocks
Executing I/O requests for a given port on a specific CPU

The performance improvement can best be seen by contrasting the currentOpenVMS I/O scheme to the new Fast Path scheme. While transparent to anOpenVMS user, each disk and tape device is tied to a specific port. AllI/O for a device is sent out over its assigned port. Under the currentOpenVMS I/O scheme, an I/O can be initiated on any CPU, but I/Ocompletion must occur on the primary CPU. Under Fast Path, all I/O forFast Path-capable devices (such as disks) for a given port is assignedto a specific CPU, eliminating the requirement for completing the I/Oon the primary CPU. This means that the entire I/O can be initiated andcompleted on a single CPU. Because I/O operations are no longer splitamong different CPUs, performance increases as memory cache thrashingbetween CPUs decreases.

Fast Path also removes the primary CPU as a possible SMP bottleneck.Without Fast Path, the primary CPU must be involved in all I/O. Oncethis CPU becomes saturated, no further increase in I/O throughput ispossible. Spreading the I/O load evenly among CPUs in a multiprocessorsystem provides greater maximum I/O throughput. This is achieved byassigning each Fast Path port to a specific CPU referred to as theport's preferred CPU.

With most of the I/O code path executing under port-specific spinlocksand on each port's preferred CPU, a highly scalable SMP model ofparallel I/O operation exists. Given multiple ports and CPUs, I/Os canbe issued and processed in parallel to a large degree.

Preferred CPU Selection

All Fast Path ports are assignable to CPUs. You can set a systemparameter specifying the set of CPUs that are allowed to serve aspreferred CPUs. This set is called the set of allowableCPUs. At any point in time, the set of CPUs that currently canhave ports assigned to them, called the set of usableCPUs, is the intersection of the set of allowable CPUS, andthe current set of running CPUs.

Each Fast Path Port is initially assigned to a CPU by theFASTPATH_SERVER process that runs at portinitialization time. This process executes an automatic assignmentalgorithm that spreads Fast Path ports evenly among the usable CPUs.The FASTPATH_SERVER process also runs whenever a secondary CPU isstarted, and whenever the set of system parameters specifying theallowable CPUs is changed.

If the primary CPU is in the set of allowable CPUs, the initialdistribution will be biased against the primary CPU in that a port willonly be assigned to the primary after ports have been assigned to eachof the other usable CPUs.

To identify a device or port's current preferred CPU, you can useeither $GETDVI or the SHOW DEVICE/FULL command. To identify the FastPath ports currently assigned to a CPU, you use the SHOW CPU /FULLcommand.

You can directly assign a Fast Path port to a CPU, or request thesystem to automatically select the port's preferred CPU from a specificset of CPUs. To do this, you either issue a $QIO or use the SETDEVICE/PREFERRED_CPU command. This will also set the port's UserPreferred CPU to be the selected CPU.

You can clear the port's User Preferred CPU by issuing either a $QIO,or by using the SET DEVICE/NOPREFERRED CPU DCL command.

You can redistribute the system assignable Fast Path ports across asubset of the set of usable CPUs by calling the $IO_FASTPATH systemservice.

Optimizing Application Performance

Processes running on a port's preferred CPU have an inherent advantagewhen issuing I/O to a port in that the overhead to assign the I/O tothe preferred CPU can be avoided. An application process can use the$PROCESS_AFFINITY system service to assign itself to the preferred CPUof the device to which the majority of its I/O is sent.

With proper attention to assignment, a process's execution need neverleave the preferred CPU. This presents a scalable process and I/Oscheme for maximizing multiprocessor system operation. Like most RISCsystems, Alpha system performance is highly dependent on theperformance of CPU memory caches. Process assignment and preferred CPUassignment are two keys to minimizing the memory stalls in theapplication and in the operating system, thereby maximizingmultiprocessor system throughput.