HP OpenVMS Systems |
Managing Mission-Critical VMSclusters Johan Michiels Nand Staes Gerrit Woertman Jan Knol Abstract In this paper we present the VMS cockpit concept. The cockpit is a dedicated VMS system that centralizes all management operations. The main task of this cockpit is to monitor the entire OpenVMS production environment and to take care of event notification in a uniform way. The cockpit assists the system manager 24 hours per day and automates routine tasks. Introduction The main responsibility of the VMS system manager is to make sure that his OpenVMS environment always delivers the performance and availability levels the business demands. To achieve this, he needs tools that support and automate his job, and that cover all aspects of the production environment. In many ways this can be compared with the cockpit of a plane or boat. To build such a cockpit, HP Consulting & Integration has made the toolkit CockpitMgr for OpenVMS available to customers. This toolkit has a number of unique and valuable features every VMS system manager needs. The present paper gives a technical overview of those tools. The CockpitMgr toolkit has already been deployed at many important OpenVMS customer sites, where it is well appreciated. CockpitMgr runs entirely on OpenVMS, provides a solution for the entire production environment, and substitutes for the well-known POLYCENTER products. 1. History In the early nineties, Digital launched POLYCENTER, a set of products and services that simplify the management of systems, networks and storage. Implementing POLYCENTER solutions enabled the Information Systems professional to provide enhanced services to end users, while controlling the management costs. The most practical and cost-effective way to deploy these POLYCENTER products was achieved by implementing a dedicated system running the full function component of each product, while the agent components were running on the different VMS production systems. We called this dedicated system, running exclusively management software, the cockpit. Having an in-depth experience with POLYCENTER products on Digitals in-house VMS systems, we started delivering cockpit services to our customers. The objectives were to:
· Add some functionality that was missing in the products. Later on, customers requested new functionality, and it was getting more and more difficult to deliver this within the framework of the existing products. Additionally, technology and cluster configurations changed over the years, demanding new monitoring capabilities. So we decided to gradually write our own monitoring tools, to be able to respond in an easy, flexible and professional way to the requirements of our customers. We kept the original cockpit concept, and continued with the dedicated management system centralizing all management operations. As the cockpit is also the daily working platform for the VMS system manager, we do believe that ideally this system should be running OpenVMS. Implementation of a cockpit is done with two main goals in mind:
All tools we developed during the past 12 years are bundled in a kit CockpitMgr for OpenVMS. The tools run fully independently of any other management product. The purpose of this document is to make an inventory of the most common needs of a VMS system manager, and to show how the CockpitMgr toolkit provides an answer to those requirements. 2. Cockpit Architecture 2.1. Agent CockpitMgr provides several agents, each monitoring certain items related to system, network and storage. Those applications run either on the cockpit itself (local agent), or on a VMS system that is monitored and managed by the cockpit (remote agent). A remote agent must be able to communicate in some way with a server or collector on the cockpit. The different agents are described in chapter 3. 2.2. Event Any detected unexpected behavior, and any subsequent change in this behavior, results in the creation of an event. Such an event can be considered as a set of items, including: Node name The DECnet node name or TCP/IP hostname Event name A descriptive name for the event Subsystem The item of the system being referred to (disk, process, shadow set etc)
2.3. Event Notification System A central application on the cockpit keeps track of all events. This application is called the Event Notification System (ENS). The ENS is the kernel of the cockpit:
2.4. Event Notification A notification application can request all or selected events from the ENS, and send the information to one or more persons by different means. Typical event notification can:
An API is available, which makes it easy to develop additional event notification routines. Event notification is discussed in chapter 4. 2.5. Event transfer Events must be transferred between agents and the ENS, and between the ENS and notification applications. Communication between the ENS and the monitoring and notification utilities occurs through the intra-cluster communication (ICC) system services. ICC is very suitable for large and fast data transfers, and has been available in OpenVMS since V7.2.
3. Monitoring There are four important modules running on the cockpit. Those applications feed events directly into the ENS:
Other agents running on a managed OpenVMS system are:
Those remote agents communicate information to the cockpit by sending OPCOM messages (to be captured by console manager) or by sending SNMPtraps (to be received and interpreted by the SNMPtrap Listener). 3.1. Console Management OPA0, the OpenVMS system console, has a special function in the operations. The console is:
In the past many VMS customers installed a console management product. Typical well-known products are:
Unfortunately, only a few customers are really using all capabilities of those products. Too often the usage is limited to connecting remotely to the systems consoles. This section describes what a console management product can signify to a system manager. The next version of the CockpitMgr toolkit, to be released early 2003, will contain its own built-in console management functionality. But because of the long lasting history with PCM and ConsoleWorks, the integration modules for those products with CockpitMgr will continue to be supported. 3.1.1. Functionality A console management product must provide the following functionality:
system, and to perform system upgrades. This implies that he should be physically present in the computer room to have access to the console. To provide remote access to a system console, the RS232 console line is plugged into a port of a terminal server. The console management product then creates a LAT or Telnet connection to the port on the terminal server. Access to the consoles of the OpenVMS production systems can be obtained via the console management utility. Actually, any RS232 port can be connected and accessed via the console management utility.
A console management product is capable of storing all console output on disk. This logging remains available as long as required, and can be used for reference later on.
3.1.2. Scan profiles From the operational point of view, the scan capabilities are the most important. Lists of specific text strings, the so-called scan profiles, need to be created, maintained and validated for new versions of the operating systems and applications. CockpitMgr comes with several scan profiles, allowing system managers to detect and react on messages originating from the OpenVMS operating system, shadow server, cluster connection manager, VAX and Alpha hardware, and several layered products such as SLS, ABS, Scheduler etc. There is also a scan profile available for storage controllers configured into the console manager. 3.1.3. Remarks
3.2. System Monitor Most system managers periodically perform a number of elementary checks on their systems:
Many system managers developed DCL procedures to automate those checks, and notification of any anomaly is often made by a terminal broadcast or by sending VMSmail. Many people believe system management is no more than performing such elementary checks. Certain so-called enterprise management applications in particular do not provide anything more than frequent pings, and the checking of process availability, free disk space and hardware errors. The CockpitMgr System Monitor takes care of verifying an OpenVMS production system on a continuous basis and in all its aspects. It is fully cluster-aware, and any item can be checked during specific periods of the day or week. It is also capable of triggering corrective actions immediately. This CockpitMgr module significantly increases system management productivity and reduces staff resource usage. 3.2.1. Components The CockpitMgr System Monitor has three main modules:
Communication between System Monitor and Agent is established at regular time intervals, and occurs via either DECnet or TCP/IP. A comprehensive event is generated when no communication can be established between cockpit and managed system, or when a connection is aborted. 3.2.2. What is monitored The System Monitor can instruct the Agent to check the following:
run jobs only at night. The Agent checks whether the status of a batch queue corresponds with the expected status for that moment, and optionally corrects this.
3.2.3. System Monitor key features
3.3. Network and SAN Switch Monitor VMScluster configurations changed drastically during the last years. Two trends can be considered:
Such a disaster-tolerant configuration implies that the network is now an integral part of the VMScluster. Network devices must be operational at all times, and all physical connections must function properly. Appropriate monitoring of network devices and connections is consequently an absolute requirement.
This section describes how network devices, SAN switches, and physical connections can be monitored. 3.3.1. SNMP The Simple Network Management Protocol (SNMP) enables a management station to query other network components or applications for information concerning their status and activities. Such a query is known as an SNMPget. The items that can be polled are called managed elements. 3.3.2. Network Monitor The CockpitMgr toolkit extensively uses SNMPgets for the monitoring of selected network devices, SAN switches and their port states. Configuration is very easy and does not require any MIB expertise and knowledge on SNMP technology. Only the following information is required to configure a device to monitor:
CockpitMgr supports only selected device types. It is obvious that more types will be added whenever these appear in VMS environments.
The CockpitMgr Network Monitor checks for selected ports on those devices:
Additional monitoring is performed on GIGAswitches, including the state of both fans and power supplies, the status of the links, and the status of the slot cards. The monitor takes hunt groups into account.
Several kinds of Ethernet switches need to be monitored here. Any switch that supports the standard bridge MIB can be monitored, and especially the STP port state of selected ports is checked. Extra monitoring capabilities are added for the VNswitch900EX, the VNswitch900XX and the Cisco Catalyst series.
3.3.3. Fibre Channel Switches Storage Area Networks enable organizations to uncouple the application (server) side from the information (data storage) side. Fibre Channel Switches enable enterprise class scalability and high performance connectivity between servers and StorageWorks storage systems. Fibre Channel Switches run a MIB agent. CockpitMgr checks the following managed elements on such a switch:
3.4. SNMPtrap Listener Many SNMP agents on network devices and SAN switches can be configured to transmit messages to well-known addresses in response to specific events. Those unsolicited messages, called SNMPtraps, enable quick and possibly automatic reactions to specific conditions. CockpitMgr provides an SNMPtrap Listener, which receives SNMPtraps from various sources. Each trap is analyzed, and the trap data is compared with a repository of known SNMPtraps. If the trap is recognized, a comprehensive event is generated in the ENS. SNMPtraps may also be generated by certain applications running on OpenVMS systems. A typical application is the StorageWorks Command Console (SWCC) Agent. This application runs on OpenVMS and supervises the storage subsystems. The application is capable of sending SNMPtraps to any listener. The SNMPtrap listener is also useful to achieve integration with the SAN appliance. The SAN appliance is a hardware and software solution to configure and manage a complex SAN. If the SAN appliance detects a problem, it can send an SNMPtrap to the cockpit. As such the SAN appliance provides the required alarming in the cockpit related to the complete SAN infrastructure. 3.5. Performance monitoring OpenVMS systems occasionally may experience performance slowdowns. This performance degradation is typically due to requirements placed on one or more system resources, like CPU, memory and the I/O subsystem. When an end user perceives an abnormal slow system response, he usually calls the system manager and asks to investigate. This search starts very often by executing some basic DCL commands to make a first diagnosis:
The goal of the CockpitMgr Performance Monitor is exactly to perform those actions on a permanent basis. If some abnormal behavior in the systems performance is detected, it warns the system manager, who can investigate and correct before things escalate and production is impacted. Performance-related research certainly benefits from the use of additional products such as DECamds, Availability Manager and ECP. Those products are available from and supported by VMS engineering at no extra cost. 3.5.1. The CockpitMgr Performance Data Collector The Performance Data Collector records OpenVMS system data for subsequent processing by the Performance Analyzer. This data is required to detect certain conditions, and to generate alarms. E.g. when checking whether the CPU utilization exceeds a certain threshold, we use the average CPU utilization during a given interval. To collect performance data on CPU utilization and Direct I/Os, we currently still use the undocumented system service EXE$GETSPI. This system service is also used by the MONITOR utility. It is our intention to use in the future the API of the new TDC product (The Data Collector). The Data Collector stores performance data in an indexed sequential file. Sampling rate is two minutes.
The collected data can be visualized in a motif-based graph application on a daily basis. It is not our goal to develop a sophisticated graphing application; ECP provides this functionality, but the graph utility is only provided to have an easy way to check the collected performance data. 3.5.2. CockpitMgr Performance Alarming The CockpitMgr Performance Alarming is a process which:
The following items are currently checked:
· Has non-paged pool been extended from its original value?
· Are there any processes in a special wait state? Future releases of CockpitMgr will contain more monitoring and alarming capabilities. 3.6. Monitoring Security Events OpenVMS is considered a secure operating system. This does not mean that a system manager should not develop and implement a security plan. Security is more than building a fortress around corporate data. It is also finding the best way to give the right people lawful, predictable and reliable access to the right information at all time. Monitoring security events is part of an overall security strategy. CockpitMgr provides a way to obtain flexible monitoring of the security of the information entrusted to your VMScluster. 3.6.1. OpenVMS Auditing The AUDIT_SERVER process records security-relevant activity as it occurs on the system. Such activity is everything that has to do with user access to the system or to a protected object within the system, and includes:
The AUDIT_SERVER normally writes security event messages to two places:
Our experience with security auditing and alarming is:
3.6.2. The CockpitMgr Security Audit Listener CockpitMgr addresses the concerns mentioned above, and provides a solution to achieve continuous but simplified monitoring of security events. The system manager will more quickly detect suspicious security events, which then can be further examined with the ANALYZE/AUDIT utility. The OPA0 system console can additionally be disabled as security operator terminal, yielding smaller console log files and less processing. The operator terminal and the security audit log are the primary destinations for security event messages. As an additional feature of the security auditing facility, a listener mailbox can be created to receive a binary copy of all security-auditing messages. The CockpitMgr Security Audit Listener is an application that creates and reads this mailbox, processes the auditing information, and generates comprehensive one-line event messages that are sent to the cockpit. A few examples of typical security messages as presented to the system manager, are:
3.7. Logfile Browser Many OpenVMS systems often do a substantial amount of batch processing. E.g. banks run at night the so-called end-of-day jobs, and incorrect termination of those job streams have serious implications on the operations next day. Even when a scheduler application has been deployed, the final status of one job does not necessarily imply that no errors occurred. The best way to verify correct termination is to check the log file. CockpitMgr provides a simple log file browser, with following features:
It is obvious that the usage of the Logfile Browser is not limited to batch log files. If the Logfile Browser can get shared access to a file, it can search it for those specific character strings. 4. Event Notification Utilities CockpitMgr provides several event notification and reporting capabilities:
Further, you can pass an event to the Automatic Pilot, which can initiate any procedure on either the cockpit or on the OpenVMS system that originated the event. Finally, CockpitMgr provides an API that allows you to create your own event notification routines. 4.1. CockpitMgr Event Console The CockpitMgr Event Console is a Motif application that lists all or selected events in the ENS. During normal working hours, this application is the primary source for event notification to the system manager. The event console can be started at any time. The application connects to the ENS, and the ENS then transfers the events to the event console. The event console displays the events in the order they occurred. If new events appear, those will be added to the event console. There is one line per event. The user can choose which attributes of events are to be displayed, and the color of the line reflects its priority. Also background color and font settings can be changed dynamically and saved. The event console also has a number of push buttons. Those push buttons allow the filtering of events, so that it is easy to list only specific classes of events that are of particular interest. The filtering behind those buttons can be customized. The event console allows somebody to take ownership of events and to delete events.
4.2. Notification to Pager or Cellular Phone The CockpitMgr Event Console is without any doubt the main utility to check for new events. But a system manager does not always have direct access to this event console, and sometimes there are events that require his immediate attention. Additionally we see that more and more companies attempt to stop or limit the second and third shifts. After normal working hours, limited or even no operator support is present on-site, and a duty role has been implemented. To make sure that a system manager gets immediately informed on really important events, CockpitMgr provides the possibility of sending messages to a pager or a cellular phone. 4.2.1. The CockpitMgr Pager Engine The CockpitMgr Pager Engine takes care of sending selected event messages to pager and cellular phone. It has the following properties:
4.2.2. Sending a message to a cellular phone To send a message (a.k.a. an SMS) to a cellular phone, CockpitMgr uses a Cellular Engine Terminal from Siemens (the M20 or the TC35T). Such device is a compact GSM modem that can be used for transfer of data, voice, SMS and faxes. It can easily be connected to a terminal port of the cockpit system, or to a port of a terminal server. There are several advantages of using a Cellular Engine Terminal:
4.2.3. Sending a message to a pager Sending a message to a pager can be done by using a modem or via X25. CockpitMgr software modules have been validated to send paging requests to several telecom operators. Unfortunately, we have experienced that most operators implemented the communication between a client and the paging server in slightly different ways. Validating the CockpitMgr Pager Engine for another telecom operator often means a software modification. Sending a message via a modem is rather slow, while X25 software is fast and reliable, but rarely available on the cockpit system. On the other hand, pagers may have better coverage than cellular phones in certain countries. 4.3. Integrating the Cockpit with an Enterprise Manager Many customers implement a so-called Enterprise Management Framework, where one application provides total enterprise management. Typical well-known enterprise managers are HP OpenView, CA Unicenter, BMC Patrol and Tivoli. Some of those solutions also provide some level of support for OpenVMS. Just like a network manager who swears on the utilization of CiscoWorks, and a database administrator who runs Oracle Enterprise Manager to simplify everyday database management tasks, the high-end expert VMS system manager considers the cockpit as his dedicated daily work environment. Nevertheless it might be useful to integrate the cockpit with the companys enterprise manager because
CockpitMgr currently provides a very simple, one-directional way of integration between the cockpit and an enterprise manager. As any enterprise manager has a built-in SNMPtrap Listener, it is very easy to send events via SNMPtraps. The cockpit always uses the same trap data format to send SNMPtraps containing full event details, such as event name, node, time stamp, priority, source and text. As such it is fairly easy to format those SNMPtraps within the enterprise management console, and to take additional actions such as icon coloring. In the future, we consider developing a more robust integration with enterprise managers, especially with HP OpenView:
4.4. Reporting CockpitMgr provides a reporting utility, which generates detailed and customized reports. When generating a report, you may specify:
If the Compaq Secure Web Server for OpenVMS, based on Apache, has been installed on the cockpit, the reports can be generated and displayed in your preferred web-browser. Access to those web pages can be protected by means of the HTACCESS mechanism. CockpitMgr also does reporting on system downtime. 4.5. Autopilot CockpitMgr is capable of triggering procedures on any events. Those procedures can be executed either:
5. Conclusion. CockpitMgr for OpenVMS bundles the experience of many VMS system managers into one toolkit. The challenge to those system managers always has been to do more with less staff, while the service to the end-user needed to be improved. CockpitMgr for OpenVMS is today the most complete toolset in the industry, supporting the VMS system manager in the day-to-day operations. The product has been created by VMS system managers, for VMS system managers. And it runs on OpenVMS.
|