 |
OpenVMS RTL Library (LIB$) Manual
LIB$TPARSE/LIB$TABLE_PARSE
The Table-Driven Finite-State Parser routine is a general-purpose,
table-driven parser implemented as a finite-state automaton, with
extensions that make it suitable for a wide range of applications. It
parses a string and returns a message indicating whether or not the
input string is valid.
Note
No support for arguments passed by 64-bit address reference or the use
of 64-bit descriptors is planned for LIB$TPARSE. On Alpha systems,
LIB$TABLE_PARSE supports arguments passed by 64-bit address reference
and the use of 64-bit descriptors.)
|
LIB$T[ABLE_]PARSE is called with the address of an argument block, the
address of a state table, and the address of a keyword table. The input
string is specified as part of the argument block.
The LIB$ facility supports the following two versions of the
Table-Driven Finite-State Parser:
LIB$TPARSE
|
Available on VAX systems.
|
|
LIB$TPARSE is available on Alpha systems in translated form. In this
form, it is applicable to translated VAX images only.
|
LIB$TABLE_PARSE
|
Available on VAX and Alpha systems.
|
LIB$TPARSE and LIB$TABLE_PARSE differ mainly in the way they pass
arguments to action routines.
The term LIB$T[ABLE_]PARSE is used here to describe concepts that apply
to both LIB$TPARSE and LIB$TABLE_PARSE.
Format
LIB$TPARSE/LIB$TABLE_PARSE argument-block ,state-table ,key-table
RETURNS
OpenVMS usage: |
cond_value |
type: |
longword (unsigned) |
access: |
write only |
mechanism: |
by value |
Arguments
argument-block
OpenVMS usage: |
unspecified |
type: |
unspecified |
access: |
modify |
mechanism: |
by reference |
LIB$T[ABLE_]PARSE argument block. The argument-block
argument contains the address of this argument block.
The LIB$T[ABLE_]PARSE argument block contains information about the
state of the parse operation. It is a means of communication between
LIB$T[ABLE_]PARSE and the user's program. It is passed as an argument
to all action routines.
You must declare and initialize the argument block. Section
1.4 describes the argument block in detail. Section
2.2 illustrates the coding for an argument block declaration
and discusses its initialization.
LIB$T[ABLE_]PARSE supports the following argument blocks:
- A 32-bit argument block that accommodates longword addresses,
values, and input tokens on both VAX and Alpha systems.
On Alpha
systems, this argument block also accommodates a numeric token whose
binary representation is less than or equal to 2**64.
- A 64-bit argument block that accommodates quadword addresses,
values, and input tokens on Alpha systems.
state-table
OpenVMS usage: |
unspecified |
type: |
unspecified |
access: |
read only |
mechanism: |
by reference |
Starting state in the state table. The state-table
argument is the address of this starting state. Usually, the name
appearing as the first argument of the $INIT_STATE macro is used.
You must define the state table for your parser. LIB$T[ABLE_]PARSE
provides macros in the MACRO and BLISS languages for this purpose.
Section 1.3 describes these macros.
key-table
OpenVMS usage: |
unspecified |
type: |
unspecified |
access: |
read only |
mechanism: |
by reference |
Keyword table. The key-table argument is the address
of this keyword table. This name must be the same as that which appears
as the second argument of the $INIT_STATE macro.
You must only assign a name to the keyword table. The LIB$T[ABLE_]PARSE
macros allocate and define the table. See Section 4 for
more information about the keyword table.
Description
The following sections explain in detail how LIB$T[ABLE_]PARSE works
and how to call it from both the MACRO assembly language and high-level
languages:
- How LIB$T[ABLE_]PARSE Works --- Describes the data structures used
by LIB$T[ABLE_]PARSE and how LIB$T[ABLE_]PARSE operates on them.
- Coding and Using a Simple State Table --- Explains how to
construct and use a simple state table.
- Using Advanced LIB$T[ABLE_]PARSE Features --- Explains how to use
subexpressions, abbreviations, action routines, and other advanced
features.
- Data Representation --- Includes information for the
low-level-language programmer, such as the binary representation of
state table data.
1 How LIB$T[ABLE_]PARSE Works
LIB$T[ABLE_]PARSE analyzes an input string according to a set of states
and transitions presented in a state table you define. It determines
whether the input string is valid according to the rules you define for
the input language.
There are three parts to any parsing operation:
- The set of symbol types, or alphabet, from which
you can choose the vocabulary of your language.
You specify a
symbol type for each transition you define. The symbol type specifies
what constitutes a matching substring from the input string.
LIB$T[ABLE_]PARSE recognizes the ASCII character set and provides
symbolic names for the most common combinations of ASCII characters,
such as alphabetic and alphanumeric strings, OpenVMS symbols, and
numbers. See Section 1.2 for a list of the symbol types that
comprise the LIB$T[ABLE_]PARSE alphabet.
- The rules that govern how the alphabet is used---in other words,
the language's grammar.
You specify the rules for a language in a
state table. A LIB$T[ABLE_]PARSE state table lists the possible states
for your language. Each state consists of a list of the transitions to
other states and the operations to be performed when a transition is
executed (see Section 1.3 ).
- The string to be parsed.
The argument block specifies the input
string. It also contains additional information about the state of the
parse---how much of the string has not been interpreted, what the
current token is, and so forth (see Section 1.4 ).
1.1 Overview
Before discussing the alphabet, the state table, and the argument block
in detail, this section provides an overview of how these three parts
work together.
1.1.1 Evaluating the Input String
LIB$T[ABLE_]PARSE evaluates the input string from left to right as it
transitions from state to state. For a particular transition in a
particular state, it evaluates the beginning of the unprocessed part of
the input string against the symbol type you specify for the transition
to determine whether there is a match.
LIB$T[ABLE_]PARSE compares each character of the remaining input
string, from left to right, against the transition's symbol type until
it encounters a character in the input string that does not match. It
takes the substring that matches the symbol type and stores a pointer
to it in the argument block as the current token. In
this way, any character in the input string that does not belong to the
symbol type's constituent character set effectively becomes a separator.
If LIB$T[ABLE_]PARSE finds a match, it executes the transition.
If the input string does not match, LIB$T[ABLE_]PARSE attempts to match
the next transition. It performs the comparison using the transitions
in the order in which you define them for the state.
1.1.2 Executing a Transition
When LIB$T[ABLE_]PARSE finds a match with a transition, it performs the
following steps:
- Stores a pointer to the current token in the argument block. If the
token matches one of the numeric symbol types, it also stores the
token's binary representation in the argument block.
- Calls the action routine, if any, specified by the transition and
passes it the argument block and any additional user-specified
arguments.
You can use an action routine to reject a transition. In
this case, LIB$T[ABLE_]PARSE performs none of the following steps. See
Section 3.1 for more information.
- Performs one of the following operations:
- Stores the mask, if any, specified by the transition in the
location specified by the transition.
- Stores the value of token in the program location specified by the
transition.
- Transfers control to the specified state, if any, or to the next
state in the state table.
1.1.3 Exiting LIB$T[ABLE_]PARSE
LIB$T[ABLE_]PARSE continues to match and execute transitions from state
to state until one of the following occurs:
- For a valid match, it executes a user-specified transition to
TPA$_EXIT at main level. It returns the value SS$_NORMAL.
- A transition requests that LIB$T[ABLE_]PARSE consider the string
invalid by specifying a transition to TPA$_FAIL at main level (rather
than at the level of a subexpression). LIB$T[ABLE_]PARSE returns with
the value LIB$_SYNTAXERR.
You can also request a transition to
TPA$_FAIL from an action routine. The action routine can provide an
alternate failure status.
- An error occurs at the main level. The error can be:
- A syntax error. All transitions in the current state fail to match
the remaining input string. LIB$T[ABLE_]PARSE returns LIB$_SYNTAXERR or
an alternate failure status returned by an action routine.
- A state table format error. One of your state table entries is
invalid. LIB$T[ABLE_]PARSE returns LIB$_INVTYPE.
Note
LIB$T[ABLE_]PARSE generates no signals and establishes no condition
handler; action routines can signal through LIB$T[ABLE_]PARSE back to
the calling program.
|
When LIB$T[ABLE_]PARSE cannot successfully parse the entire string, it
defines the current token, as follows, and stores it in the argument
block before returning:
- If LIB$T[ABLE_]PARSE fails to match a transition in the current
state, it attempts to define the current token as the beginning of the
remaining input string. You can incorporate this token in an error
message or use it to determine the logical flow of your program.
LIB$T[ABLE_]PARSE attempts to match the characters from the
beginning of the remaining input string, one at a time, against the
TPA$_SYMBOL alphabet symbol type until it encounters a character that
does not match. The TPA$_SYMBOL symbol type consists of all the
characters of the standard OpenVMS symbol constituent set.
- If LIB$T[ABLE_]PARSE successfully matches one or more consecutive
characters from the input string against TPA$_SYMBOL, then the
substring that matched TPA$_SYMBOL becomes the current token.
- If the first character of the remaining input string does not match
TPA$_SYMBOL, the first character becomes the current token.
- If LIB$T[ABLE_]PARSE matches the symbol type for a transition that
specifies TPA$_FAIL as the next state, it leaves the token that matched
the transition as the current token.
1.2 Alphabet of LIB$T[ABLE_]PARSE
The LIB$T[ABLE_]PARSE alphabet consists of a set of symbol types
defined in Table lib-9. This alphabet includes strings made up of
elements of the ASCII character set. It provides all the basic building
blocks needed for constructing a grammar using the ASCII character set.
The alphabet also includes symbol types that represent the more complex
constructions found in programming and command language grammar.
Use the symbols types that comprise the LIB$T[ABLE_]PARSE alphabet to
define a vocabulary and grammar for your language. For each transition
you define, you specify one of the alphabet symbol types.
LIB$T[ABLE_]PARSE compares the characters at the beginning of the
remaining input string with this symbol type of each of the possible
transitions. If LIB$T[ABLE_]PARSE finds a match, it enters the state
specified by that transition.
Table lib-9 The Alphabet of LIB$T [ABLE_]PARSE
Symbol Type |
Characters Matched |
'
x'
|
The particular ASCII character. In a state table, it is expressed by
enclosing the character in single quotation marks. The character can be
any member of the 8-bit ASCII code set. LIB$T[ABLE_]PARSE does not
consider uppercase and lowercase alphabetic characters and codes with
different values in bit 7 to be equivalent.
|
TPA$_ANY
|
Any single character.
|
TPA$_ALPHA
|
Any alphabetic character, which includes the DEC multinational
character set.
|
TPA$_DIGIT
|
Any numeric character, that is, 0 through 9.
|
TPA$_STRING
|
Any string of one or more alphanumeric characters, that is, uppercase
or lowercase A through Z, and the numeric characters 0 through 9. The
string can be any length. It is bounded on the right by the first
nonalphanumeric character or by the end of the string.
|
TPA$_SYMBOL
|
Any string of one or more through characters of the standard OpenVMS
symbol constituent set, that is, uppercase and lowercase A through Z
and all DEC multinational characters, in addition to the dollar sign
($) and the underscore (_). The string is bounded on the right by some
character not in the symbol constituent set (usually a blank) or by the
end of the string.
|
'
keyword'
|
The string of characters enclosed in single quotation marks. A keyword
can consist of one or more characters of the OpenVMS symbol constituent
set, that is, uppercase and lowercase A through Z, the numeric
characters 0 through 9, the dollar sign ($), and the underscore (_).
Uppercase and lowercase alphabetics are treated as different characters.
A state table can contain up to 220 keywords. The keyword is
bounded on the right by a character not in the symbol constituent set
or by the end of the string.
Keywords that are one character in length are expressed in the form
'
x*' to distinguish them from the single-character symbol ('
x'). They must be differentiated because they are not the same
in operation. For example, in the input string AB+C, the single
character 'A' would match the first character of this string, whereas
the keyword 'A*' would not, because B in the string is in the symbol
constituent set.
|
TPA$_BLANK
|
Any string of one or more blanks and/or tabs.
|
TPA$_OCTAL
|
Any octal number (that is, any string of one or more numeric characters
0 through 7) whose magnitude is less than 2
32 for a 32-bit argument block or less than 2
64 for a 64-bit argument block.
|
TPA$_DECIMAL
|
Any decimal number (that is, any string of one or more numeric
characters 0 through 9) whose magnitude is less than 2
32 for a 32-bit argument block or less than 2
64 for a 64-bit argument block.
|
TPA$_HEX
|
Any hexadecimal number (that is, any string of one or more numeric
characters 0 through 9, A through F) whose magnitude is less than 2
32 for a 32-bit argument block or less than 2
64 for a 64-bit argument block.
|
(Alpha specific) TPA$_OCTAL_64
|
Any octal number (that is, any string of one or more numeric characters
0 through 7) whose magnitude is less than 2
64.
|
(Alpha specific) TPA$_DECIMAL_64
|
Any decimal number (that is, any string of one or more numeric
characters 0 through 9) whose magnitude is less than 2
64.
|
(Alpha specific) TPA$_HEX_64
|
Any hexadecimal number (that is, any string of one or more numeric
characters 0 through 9, A through F) whose magnitude is less than 2
64.
|
TPA$_FILESPEC
|
Any string that constitutes a valid OpenVMS file specification. The
string is bounded on the right by the first character that either is
not a file specification constituent character or would cause the
string to violate the syntax rules of a file specification.
|
TPA$_NODE
|
Matches a full node specification including the double colon (::).
|
TPA$_NODE_ACS
|
Matches a primary node specification including the access control
string, if any, but not the double colon (::).
|
TPA$_NODE_PRIMARY
|
Matches a primary node specification excluding both the access control
string, if any, and the double colon (::).
|
TPA$_UIC
|
Any string that constitutes a valid OpenVMS numerical UIC
specification, bounded by square brackets or angle brackets. The binary
value of the UIC, converted in octal radix, is placed in the argument
block. The wildcard character (*) is permitted in the group and/or
member fields; its presence results in that field being set to its
largest possible value in the binary representation.
|
TPA$_IDENT
|
Any string that constitutes a valid OpenVMS identifier. Identifiers may
be given as numerical UICs according to the rules for TPA$_UIC, or as
alphabetic identifier names that appear in the system's rights
database. The binary value of the identifier, converted in either octal
or hexadecimal radix or by lookup in the system rights database, is
placed in the argument block. Identifiers can be entered in any of the
following forms:
[n,m] <n,m>
[name1,name2] <name1,name2>
[name] <name>
name
%Xhex-value
You can use a wildcard (*) in place of any occurence of
number or
name in an identifier form.
|
TPA$_LAMBDA
|
The empty string (always matches). As it executes the transition,
LIB$T[ABLE_]PARSE does not remove any characters from the input string.
LAMBDA transitions are useful in getting action routines called under
otherwise awkward circumstances, providing unconditional GOTOs to link
portions of a state table together, and providing default actions in
certain cases.
|
TPA$_EOS
|
The end of the input string.
|
state label
|
The label of a state that functions as a subexpression. A subexpression
is analogous to a subroutine within the state table.
The subexpression facility permits complex syntactic constructs
that appear in many places in grammar to appear only once in the state
table. It also permits a degree of nondeterministic or pushdown parsing
with a parser that is otherwise deterministic and finite-state. See
Section 3.5 for detailed information about subexpressions and
examples of their use.
|
Note
By default, LIB$T[ABLE_]PARSE treats blanks (defined to be either
spaces or tabs), as though they belong to no symbol type constituent
set. Effectively, this makes the blank a separator. LIB$T[ABLE_]PARSE
begins its next comparison with the first nonblank character following
the blanks. To have LIB$T[ABLE_]PARSE evaluate a blank as it would any
other character in the input string, set the TPA$V_BLANKS flag in the
argument block. Section 3.2 provides an example of the use of
this flag.
|
1.3 State Tables
This section describes state table generation and the macros used to
construct state tables. Section 2 explains how to use these
macros.
The state table must be set up using either MACRO or BLISS. Everything
else, including any action routines, can be coded in the language of
your choice. Simply compile the state table separately, then link it
with your program.
The body of the state table consists of one or more states, each of
which defines one or more transitions to the same or other states. The
order of the states and the order of the transitions for each state are
important:
- If a transition does not specify a target state, LIB$T[ABLE_]PARSE
transitions to the next state after the current state in the state
table.
- For a given state, LIB$T[ABLE_]PARSE evaluates the input string
against the transitions in the order in which they are defined and
executes the first transition it matches.
- If a state defines more than one transition with symbol types that
match overlapping sets of tokens, the order of transition definitions
within the state is significant. For example, the characters 123
followed by a comma (,) could match TPA$_DECIMAL, TPA$_OCTAL,
TPA$_STRING, or one of several other symbol types.
- It is best to order transitions in order of increasing generality
of their symbol types. For example, the TPA$_SYMBOL symbol type matches
all keyword strings. In general, LIB$T[ABLE_]PARSE never executes a
keyword transition that follows a TPA$_SYMBOL transition. The symbol
types, in order of increasing generality, are as follows:
'keyword'
'x'
TPA$_EOS
TPA$_ALPHA
TPA$_DIGIT
TPA$_BLANK
TPA$_OCTAL
TPA$_OCTAL_64 (Alpha only)
TPA$_DECIMAL
TPA$_DECIMAL_64 (Alpha only)
TPA$_HEX
TPA$_HEX_64 (Alpha only)
TPA$_STRING
TPA$_SYMBOL
TPA$_UIC
TPA$_IDENT
TPA$_NODE_PRIMARY
TPA$_NODE_ACS
TPA$_NODE
TPA$_FILESPEC
TPA$_ANY
TPA$_LAMBDA
Note
The list of symbol types does not include subexpression calls, because
the generality of these calls depends on the symbol types recognized
within the subexpression. If you use action routines to reject certain
transitions, you can change the order in which that symbol type is
placed in this order. In any case, LIB$T[ABLE_]PARSE executes the first
transition listed in a state that you permit to match the leftmost
portion of the remaining input string.
|
1.3.1 MACRO State Table Generation Macro Calls
The OpenVMS system MACRO library contains a set of assembler macros
that allow convenient and readable coding of a LIB$T[ABLE_]PARSE state
table. These macros generate symbol definitions and tables. They do not
produce any executable code or routine calls.
There are four MACRO state table generation macros:
- $INIT_STATE---Initializes the LIB$T[ABLE_]PARSE macros and declares
the beginning of a state table (see Section 1.3.1.1 )
- $STATE---Defines a state (see Section 1.3.1.2 )
- $TRAN---Defines a state transition (see Section 1.3.1.3 )
- $END_STATE---Ends the state table (see Section 1.3.1.4 )
A state table begins with a call to $INIT_STATE and ends with a call to
$END_STATE. Within the state table, define each state by a call to
$STATE immediately followed by as many calls to $TRAN as you need to
define the transitions from that state.
1.3.1.1 $INIT_STATE---Initializes the LIB$T[ABLE_]PARSE Macros
The $INIT_STATE macro declares the beginning of a state table. It
initializes the internals of the table generator macros and declares
the locations of the state table and the keyword table:
- The state table is the structure containing the definitions of the
states and the transitions between them. LIB$T[ABLE_]PARSE builds the
state table as it processes the $STATE and $TRAN macros you use to
define the table.
- The keyword table contains the text of the keywords used in the
state table. LIB$T[ABLE_]PARSE builds the keyword table as it processes
the calls to $TRAN for each state.
Section 4 provides specific information on the allocation
and binary representations of the state table and the keyword table.
This information may be useful in debugging your program.
$INIT_STATE state-table ,key-table
|
state-table
The name assigned to the state table. LIB$T[ABLE_]PARSE equates this
label to the start of the first state in the state table.
key-table
The name assigned to the keyword table. LIB$T[ABLE_]PARSE equates this
label to the start of the keyword table.
You must supply both the address of the state table and the address of
the keyword table in the call to LIB$T[ABLE_]PARSE to perform a parse.
The $INIT_STATE macro can appear more than once in a program. Each
occurrence defines a separate state table. No part of any state table
can refer to part of any other state table.
1.3.1.2 $STATE---Defines a State
The $STATE macro declares the beginning of a state.
label
An optional label for the state. LIB$T[ABLE_]PARSE equates the label,
if present, to the starting address of the state.
1.3.1.3 $TRAN---Defines a State Transition
The $TRAN macro defines a transition from the state in which it is
defined to some other (or to the same) state. The arguments of the
macro define, among other things, the symbol type that causes the
transition to be executed, the state to which to transfer, and the
action routine to call, if any. The transition defined by a $TRAN macro
belongs to the state defined by the last preceding $STATE macro.
$TRAN type [,label] [,action] [,mask] [,msk-adr] [,argument]
|
type
The symbol type, taken from the LIB$T[ABLE_]PARSE alphabet, that is
recognized by this transition. The transition is taken if the
characters from the beginning of the remaining input string match the
specified symbol type.
|