% Copyright 2012-2024, Alexander Shibakov % This file is part of SPLinT % % SPLinT is free software: you can redistribute it and/or modify % it under the terms of the GNU General Public License as published by % the Free Software Foundation, either version 3 of the License, or % (at your option) any later version. % % SPLinT is distributed in the hope that it will be useful, % but WITHOUT ANY WARRANTY; without even the implied warranty of % MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the % GNU General Public License for more details. % % You should have received a copy of the GNU General Public License % along with SPLinT. If not, see . \input limbo.sty \input frontmatter.sty \def\optimization{5} \input yy.sty \modenormal % multi-column output \input dcols.sty \topskip=9pt % this is a purely aesthetic choice, also negating the \raggedbottom % option in cwebmac.tex % set the typesetting of various token groups \let\currentparsernamespace\parsernamespace \let\currenttokeneq\tokeneq \let\parsernamespace\mainnamespace \let\hostparsernamespace\mainnamespace % for \nameproc in tokeneqpretty \let\tokeneq\tokeneqpretty \let\optstrextra\optstrextraesc %\traceprettytokenstrue \input bo.tok % re-use token equivalence table to set the typesetting of tokens \input btokenset.sty % adjust the typesetting of some tokens \let\parsernamespace\flexnamespace \let\hostparsernamespace\flexnamespace \input fo.tok \input ftokenset.sty % ... for the flex input lexer \let\parsernamespace\flexpseudorenamespace \let\hostparsernamespace\flexpseudorenamespace \input fretokenset.sty % regular expression names % index entries \let\parsernamespace\indexpseudonamespace \input yypretty.sty \prettywordpair{TOKEN}{{\tt TOKEN} {\rm(example)}}% \prettywordpair{token}{{\tt "token"} {\rm(example)}}% \let\tokeneq\currenttokeneq \let\parsernamespace\currentparsernamespace \let\hostparsernamespace\mainnamespace % the namespace where tokens are looked up % by \nameproc and friends for typesetting purposes % \immediate\openout\exampletable=\jobname.exl % file for parser output examples \def\nontitle#1{{\ttl #1}} \def\cite[#1]{% \def\next{#1}\setbox0=\hbox{l}% [\ifx\next\empty$\,$\hbox{\vrule width\wd0 height\ht0 depth\dp0}$\,$\else \locallink{#1bibref}#1\endlink\fi]% } \let\N\textN \let\N\chapterN \let\M\textM \defreserved{Y}{\.{Y}} \showlastactiontrue \initauxstream @**Introduction. \splint\footnote{I was tempted to call the package {\tt ParLALRgram} which stands for Parsing {\sc LALR} Grammars or {\tt PinT} for `Parsing in \TeX' but both sounded too generic.} (Simple Parsing and Lexing in \TeX, or, following the great GNU tradition of creating recursive names, \splint\ Parses Languages in \TeX) is a system (or rather a m\'elange of systems) designed to facilitate the development of parsing macros in \TeX\ and (to a lesser degree) to assist one in documenting parsers written in other languages. As an application, parsers for \bison\ and \flex\ input file syntax have been developed, along with a macro collection that makes it possible to design and pretty print\footnote{The term {\it pretty printing\/} is used here in its technical sense as one might find that there is nothing pretty about the output of the parsing routines presented in this document.} \bison\ grammars and \flex\ automata using \CWEB. The \.{examples} directory contains a few other parsers designed to pretty print various languages (among them is \ld, the language of the \GNU\ linker). @s TeX_ TeX @s TeXa TeX @s TeXb TeX @s TeXf TeX @s TeXfo TeX @s TeXao TeX @*1 \eatone{CWEB}\CWEB\ and literate programming. Writing software in \CWEB\ involves two programs. The first of these is \CTANGLE\ that outputs the actual code, intended to be in \Cee. In reality, \CTANGLE\ cares very little about the language it produces. Among the exceptions are \Cee\ comments and |@[#line@]| directives that might confuse lesser software but \bison\ is all too happy to swallow them (there are also some \Cee\ specific constructs that \CTANGLE\ tries to recognize). \CTANGLE's main function is to rearrange the text of the program as written by the programmer (in a way that, hopefully, emphasizes the internal logic of the code) into an appropriate sequence (e.g.~all variable declaration must textually precede their use). All that is required to adopt \CTANGLE\ to produce \bison\ output is some very rudimentary post- and pre-processing. Our main concern is thus \CWEAVE\ that not only pretty prints the program but also creates an index, cross-references all the sections, etc. Getting \CWEAVE\ to pretty print a language other than \Cee\ requires some additional effort. A true digital warrior would probably try to decipher \CWEAVE's output `in the raw' but, alas, my (C)WebFu is not that strong. The loophole comes in the form of a rarely (for a good reason) used \CWEB\ command: the verbatim (\.{@@=...@@>}) output. The material to be output by this construct undergoes minimal processing and is put inside \.{\\vb\{}$\ldots$\.{\}}. All that is needed now is a way to process this virtually straight text inside \TeX. This manual, as well as nearly every other document that accompanies \splint\ is itself a source for a computer program (or, as is the case with this document, several programs) that is extracted using \CTANGLE. We refer an interested reader to \cite[CWEB] for a detailed description of the syntax and use patterns of \CWEB. The following is merely a brief overview of the approach. Every \CWEB\ document is split into {\em sections}, each divided into three parts (any one of which can be empty): the \TeX\ part, the middle part, and the \Cee\ part (which should more appropriately be called the {\em code\/} or the {\em program\/} part). The \Cee\ part of each\footnote{With the exception of the nameless \.{@@c} (or \.{@@p}) sections.} section carries a name for cross referencing purposes. The sections themselves are automatically numbered by \CWEAVE\ and their code parts may be referenced from other sections, as well as included in other sections' code parts using \CWEB's cross referencing syntax (such as |@|). Using the same name for the \Cee\ portion in several sections has the effect of merging the corresponding code fragments. When the section with such a name is used (included) later, all of the concatenated fragments are included as well, even the ones that appear after the point in the \CWEB\ document where such inclusion takes place. The original \CWEB\ macros (from \.{cwebmac.tex}) used the numbers generated by \CWEAVE\ to refer to specific sections. This was true for the table of contents, as well as the index entries. The macros used by \splint\ adopt a different convention, proposed by N.~Ramsey for his literate programming software \noweb. In the new system (which will be referred to as the \noweb\ style of cross referencing), each section is labelled by the page number where it starts and an alphabetic character that indicates the order of appearance of the section on the page. Also following \noweb, the new macros display links beween the fragments of the same section in the margins. This allows for quicker navigation between sections of the code and lets the reader to get a quick overview of what gets `collected' in a given section. The top level (\.{@@**}) sections, introducing major portions of the code have also been given more prominent appearance. They display the chapter number using a large typeface and omit the marginal section reference. References to such sections are typeset as {\it cnnn\/} where {\it nnn\/} represents the chapter number. While such references obscure the page number, hopefully one keeps the number of chapters, as well as such references, small. This modification improves the appearance of the first chapter pages. \CWEB\ also generates an {\em index\/} of all the identifiers (with some exceptions, such as single letter names) appearing in the \Cee\ portion of each section, {\em except\/} those that appear inside the {\em verbatim@^verbatim block@>\/} portions of the code (i.e.~between \.{@@=} and \.{@@>}). Since \splint\ uses the verbatim blocks extensively, additional indexing facilities have been implemented to provide indexing for the non-\Cee\ languages handled by various \splint\ parsers. @*1 Pretty (and not so pretty) printing. Pretty-printing can be narrowly defined as a way to organize the presentation of the program's text. The range of visual devices used for this purpose is usually limited to indentation and discrete line skips, to mimic the capabilities of an old computer terminal. Some authors (see~\cite[ACM]) have replaced the term pretty printing with {\em program visualization\/} to refer to a much broader range of graphic tools for translating the code (and its meaning) into a richer medium. This manual uses the terms {\em pretty printing\/} and {\em program visualization\/} interchangeably. Pretty printing in the broader sense above has been the subject of research for some time. The monograph~\cite[ACM] develops a methodical (if not formalized) approach to the design of visualization frameworks for programming languages (although the main focus is on procedural \Cee-like languages). A number of papers about pretty printing have appeared since, extending the research to new languages, and suggesting new visualizatin rules. Unfortunately, most of this research is driven by rules of thumb and anecdotes (the approach fully embraced by this manual), although there have been a few rigorous studies investigating isolated visualization techniques (see, for example, the discussion of variable declaration placement in~\cite[Jo]). Perhaps the only firm conclusion one can draw from this discussion is that {\em writing\/} the code and {\em reading\/} it are very different activities so facilitating the former may in turn make the latter more difficult and vice versa. Some well known languages try to arrive at a compromise where the syntax forces a certain style of presentation on the programmer. An example of a successful language in this group is Python with its meaningful white space. The author does not share the enthusiasm some programmers express for this approach. On the other hand, a language like \Cee\ does not enforce any presentation format\footnote{The `feature' so masterfully exploited by the International Obfuscated \Cee\ Code Contest ({\sc IOCCC}) participants.}. The authors of \Cee\ even remarked that semicolons and braces were merely a nod to the compiler (or, one might add, static analysis software, see~\cite[KR]). It may thus seem reasonable that such redundant syntax elements may be replaced by different typographic devices (such as judicially chosen skips and indentation, or the choice of fonts) when (pretty) printing the code. Even the critics of pretty printing usually concede that well indented code is easier to read. The practice of using different typefaces to distinguish between various syntactic elements (such as reserved words and general identifiers) is a subject of some controversy, although not as pronounced as some of the more drastic approaches (such as completely replacing the brace pairs with indentation as practiced by \splint\ for \bison\ input or by the authors of~\cite[ACM] for the control statements in \Cee). The goal of \splint\ was not to force any parcticular `pretty printing philosophy' on the programmer (although, if one uses the macros `as is', some form of quiet approval is assumed $\ldots$) but rather to provide one with the tools necessary to implement one's own vision of making the code readable. One tacit assumption made by the author is that an integral part of any pretty printing strategy is extracting (some) meaning from the raw text. This is done by {\em parsing\/} the program, the subject we discuss next. It should be said that it is the parser design in \TeX\ that \splint\ aims to facilitate, with pretty printing being merely an important application. @*1 Parsing and parsers. At an abstract level, a {\em parser@^parser@>\/} is just a routine that transforms text. Naturally, not every possible transformation is beneficial, so, informally, the value of a parser lies in its ability to expose some {\em meaning\/} in the text. If valid texts are reduced to a small finite set (while each text can be arbitrarily long) one can concievably write a primitive string matching algorithm that recognizes whether any given input is an element of such set, and if it is, which one. Such `parsers' would be rather limited and are only mentioned to illustrate the point that, in general, the texts being parsed are not required to follow any particular specifiction. In practice, however, real world parsers rely on the presence of some structure in the input to do their work. The latter can be introduced by supplying a formal (computable) description of every valid input. The `ridgidity' of this specification directly affects the sophistication of the parsing algorithm required to process a valid input (or reject an invalid one). Parsing algorithms normally follow a model where the text is processed a few symbols at a time and the information about the symbols already seen is carried in some easily accessible form. `A few symbols at a time' often translates to `at most one symbol', while `easily accessible' reduces to using a stack-like data structure for bookkeeping. A popular way of specifying {\em structure\/} is by using a {\em formal grammar@^grammar@>}\footnote{While popular, formal grammars are not the only way of describing a language. For example, `powers of $2$ presented in radix $3$' is a specification that cannot be defined by a context-free grammar, although it is possible to write a (very complex) grammar for it.} that essentially expresses how some (preferably meaningful) parts of the text relate to other parts. Keeping with the principle of making the information about the seen portions of the input easily accessible, practical grammars are normally required to express the meaning of a fragment in a manner that does not depend on the input that surrounds the fragment (i.e.~to be {\em context-free@^context-free@>}). Real-world languages rarely satisfy this requirement\footnote{Processing |typedef|'s in \Cee\ is a well known case of such a language defect.} thus presenting a challenge to parser generating software that assumes the language is context-free. Even the task of parsing all context-free languages is too ambitious in most practical scenarios, so further limitations on the grammar are normally imposed. One may require that the next action of the parsing algorithm must depend exclusively on the next symbol seen and one of the finitely many {\em states\/} the parser may be in. The action here simply refers to the choice of the next state, as well as the possible decision to consume more input or output a portion of the {\em abstract syntax tree@^abstract syntax tree@>\/} which is discussed below. The same language may have more than one grammar and the choice of the latter normally has a profound effect on the selection of the parsing algorithm. Without getting too deep into the parsing theory, consider the following simple sketch. \medskip \beginprod \format{\inline\flatten} pexp: '(' pexp ')' \ ` astring \ ; % astring: \ ` '*' astring \ ; \endprod \medskip \noindent Informally, the language consists of `strings of $n$ \.{*}'s nested $m$ parentheses deep'. After parsing such a string, one might be interested in the values of $m$ and $n$. The three states the parser may be in are `start', `parsing \prodstyle{pexp}' and `parsing \prodstyle{astring}'. A quick glance at the grammar above shows that switching between the states is straightforward (we omit the discussion of the `start' state for brevity): if the next symbol is \.{(}, parse the next~\prodstyle{pexp}, otherwise, if the next symbol is \.{*}, parse~\prodstyle{astring}. Finally, if the next symbol is \.{)} and we are parsing~\prodstyle{pexp}, finish parsing it and look for the next input, otherwise, we are parsing~\prodstyle{astring}, finish parsing it, make it a~\prodstyle{pexp}, finish parsing a~\prodstyle{pexp} started by a parenthesis, and look for more input. This unnecessarily long (as well as incomplete and imprecise) description serves to present a simple fact that the parsing states are most naturally represented by individual {\em functions\/} resulting in what is known as a {\em recursive descent parser@^recursive descent parser@>\/} in which the call stack is the `data structure' responsible for keeping track of the parser's state. One disadvantage of the algorithm above is that the maximal depth of the call stack reaches $m+n$ which may present a problem for longer strings. Computing $m$ and $n$ above now reduces to incrementing an appropriate variable upon exiting the corresponding function. More important, however, is the observation that this parser specification can be extracted from the grammar in a very straightforward fashion. To better illustrate the r\^ole of the grammar in the choice of the parsing algorithm, consider the following syntax for the same language: \medskip \beginprod \format{\inline\flatten} pexp: '(' pexp ')' \ ` astring \ ; % astring: \ ` astring '*' \ ; \endprod \medskip \noindent While the language is unchanged, so the algorithm above still works, the lookahead tokens are not {\em immediately\/} apparent upon looking at the productions. Some preprocessing must take place before one can decide on the choice of the parser states and the appropriate lookahead tokens. Such parser generating algorithms indeed exist and will produce what is known as an {\sc LR} parser for the fragment above (actually, a simpler {\sc LALR} parser may be built for this grammar\footnote{Both of these algorithms will use the parser stack more efficiently, effectively resolving the `call stack depth' issue mentioned earlier.}). One can see that some grammar types may make the selection of the parsing algorithm more involved. Since \splint\ relies on \bison\ for the generation of the parsing algorithm, one must ensure that the grammar is {\sc LALR}$(1)$\footnote{The newest versions of \bison\ are capable of processing a {\em much\/} wider set of grammars, although \splint\ can only handle the \bison\ output for {\sc LALR}$(1)$ parsers.}. @*1 Using the \eatone{bison}\bison\ parser. The process of using \splint\ for writing parsing macros in \TeX\ is treated in considerable detail later in this document. A shorter (albeit somewhat outdated but still applicable) version of this process is outlined in \cite[Sh], included as part of \splint's documentation. We begin, instead, by explaining how one such parser can be used to pretty print a \bison\ grammar. Following the convention mentioned above and putting all non-\Cee\ code inside \CWEAVE's verbatim blocks, consider the following (meaningless) code fragment\footnote{The software included in the package contains a number of preprocessing scripts that reduce the necessity of using the verbatim blocks for every line of the \bison\ code so the snippet above can instead be presented without the distraction of \.{@@=...@@>}, looking more like the `native' \bison\ input}. The fragment contains a mixture of \Cee\ and \bison\ code, the former appears outside of the verbatim blocks. \begindemo ^@@= non_terminal: @@> ^@@= term.1 term.2 {@@> a = b; @@=}@@> ^@@= **H term.3 other_term {@@> $$ = $1; @@=}@@> ^@@= **H still more terms {@@> f($1); @@=}@@> ^@@= ; @@> \enddemo The fragment above will appear as (the output of \CTANGLE\ can be examined in \.{sill.y}) @= @G non_terminal: term.1 term.2 {@> a = b; @=} | term.3 other_term {@> $$ = $1; @=} | still more terms {@> f($1); @=} ; @g @ $\ldots$ if the syntax is correct. In case it is a bit off (note the missing colon after \.{whoops} below), the parser will give up and you will see a different result. The code in the fragment below is easily recognizable, and some parts of it (all of \Cee\ code, in fact) are still pretty printed by \CWEAVE. Only the verbatim portion is left unprocessed. The layout of the original code is reproduced unchanged, including the braces and production separators (i.e.\ \.{\yl}) normally removed by the parser for presentation purposes. @= @G whoops term.1 term.2 {@>@+ a = b; @+@=} | term.3 other_term {@>@+ $$ = $1; @+@=} | still more terms {@>@+ f($1); @+@=} ; @g @ The \TeX\ header that makes such output possible is quite plain. In the case of this document it begins as \begindemo ^\input limbo.sty ^\input frontmatter.sty ^\def\optimization{5} ^\input yy.sty \nooutput \enddemo The first two lines are presented here merely for completeness: there is no parsing-relevant code in them. The third line (\.{\\def\\optimization\{5\}}) may be ignored for now (we discuss some ways the parser code may be sped up \locallink{optimization}later\endlink. The line that follows loads the macros that implement the parsing and scanning machinery. This is enough to set up all the basic mechanisms used by the parsing and lexing macros. The rest of the header provides a few definitions to fine tune the typesetting of grammar productions. It starts with \begindemo ^\let\currentparsernamespace\parsernamespace ^ \let\parsernamespace\mainnamespace ^ \let\currenttokeneq\tokeneq ^ \def\tokeneq#1#2{\prettytoken{#1}} ^ \input bo.tok % re-use token equivalence table to set the typesetting of tokens ^ \let\tokeneq\currenttokeneq ^ \input btokenset.sty \nooutput \enddemo We will have a chance to discuss all the \.{\\}$\ldots$\.{namespace} macros later, at this point it will suffice to say that the lines above are responsible for controlling the typesetting of term names. The file \.{bo.tok} consists of a number of lines like the ones below: \begindemo ^\tokeneq {STRING}{{34}{115}{116}{114}{105}{110}{103}{34}} ^\tokeneq {PERCENT_TOKEN}{{34}{37}{116}{111}{107}{101}{110}{34}} \nooutput \enddemo The cryptic looking sequences of integers above are strings of {\sc ASCII} codes of the letters that form the name that \bison\ uses when it needs to refer to the corresponding token (thus, the second one is \toksa{}\numberstochars{34}{37}{116}{111}{107}{101}{110}{34}\end \.{\the\toksa} which might help explain why such an indirect scheme has been chosen). The macro \.{\\tokeneq} is defined in \.{yymisc.sty}, which in turn is input by \.{yy.sty} but what about the token names themselves? In this case they were extracted automatically from the \CWEB\ source file by the \locallink{bootstrapping}{\em bootstrapping parser\/} \endlink during the \CWEAVE\ processing stage. All of these definitions can be overwritten to get the desired output (say, one might want to typeset \.{ID} in a roman font, as `identifier'; all that needs to be done to make this possible is to provide a macro that says \.{\\prettywordpair\{ID\}\{\{\\rm identifier\}\}} in an appropriate namespace (usually \.{\\hostparternamespace})). The file \.{btokenset.sty} input above contains a number of such definitions. @ To round off this short overview, I must mention a caveat associated with using the macros in this collection: while one of the greatest advantages of using \CWEB\ is its ability to rearrange the code in a very flexible way, the parser will either give up or produce unintended output if this feature is abused while describing the grammar. For example, in the code below @= @G next_term: stuff @> @ @={@> a = f( x ); @=} @g @@; @ the line titled |@| is intended to be a rule defined later. Notice that while it seems that the parser was able to recognize the first code fragment as a valid \bison\ input, it misplaced the |@|, having erroneously assumed it to be a part of the action code for this grammar (later on we will go into the details of why it is necessary to collect all the non-verbatim output of \CWEAVE, even that which contains no interesting \Cee\ code; hint: it has something to do with money (\.{\$}), also known as math and the way \CWEAVE\ processes the `gaps' between verbatim sections). The production line that follows did not fare as well: the parser gave up. There is simply no point in including such a small language fragment as a valid input for the grammar the parser uses to process the verbatim output. @= @G more stuff in this line {@> @[b = g(y);@]@=} @g @ Finally, if you forget that only the verbatim part of the output is looked at by the parser you might get something unrecognizable, such as @= but not all of it @ To correct this, one can provide a more complete grammar fragment to allow the parser to complete its task successfully. In some cases, this imposes too strict a constraint on the programmer. Instead, the parser that pretty prints \bison\ grammars allows one to add {\it hidden context\/} to the code fragments above. The context is added inside \.{\\vb} sections using \CWEB's \.{@@t}$\ldots$\.{@@>} facility. The \CTANGLE\ output is not affected by this while the code above can now be typeset as: @= @G next_term: stuff @> @t}\vb{\formatlocal{\let\peekstash\stashtoterm}}{@> @ @t}\vb{FAKE}{@> @={@> a = f( x ); @=} @g @@; @ $\ldots$ even a single line can now be displayed properly. @= @G @t}\vb{\formatlocal{\skipheader} FAKE:}{@> more stuff in this line {@> b = g( y ); @=} @g @ With enough hidden context, even a small rule fragment can be typeset as intended. The `action star' was inserted to reveal some of the context. @= @G @t}\vb{\formatlocal{\skipheader} FAKE:}{@> but not all of it @t}\vb{\{\stashed{$\star$}\}}{@> @g @ What makes all of this even more confusing is that \CTANGLE\ will have no trouble outputting this as a(n almost, due to the intentionally bad \.{whoops} production above) valid \bison\ file (as can be checked by looking into \.{sill.y}). The author happens to think that one should not fragment the software into pieces that are too small: \bison\ is not \Cee\ so it makes sense to write \bison\ code differently. However, if the logic behind your code organization demands such fine fragmentation, hidden context provides you with a tool to show it off. A look inside the source of this document shows that adding hidden context can be a bit ugly so it is not recommended for routine use. The short example above is output in the file below. @(sill.y@>= @@; @*1 On debugging. This concludes a short introduction to the \bison\ grammar pretty printing using this macro collection. It would be incomplete, however, without a short reference to debugging\footnote{At the moment we are discussing debugging the output produced by \CWEAVE\ when the included \bison\ parser is used, {\it not\/} debugging parsers written with the help of this software: the latter topic is covered in more detail later on.}. There is a fair amount of debugging information that the macros can output, unfortunately, very little of it is tailored to the {\em use\/} of the macros in the \bison\ parser. Most of it is designed to help build a {\em new\/} parser. If you find that the \bison\ parser gives up too often or even crashes (the latter is most certainly a bug in the \splint\ version of the \bison\ parser itself), the first approach is to make sure that your code {\em compiles}, i.e.\ forget about the printed output and try to see if the `real' \bison\ accepts the code (just the syntax, no need to worry about conflicts and such). If this does not shed any light on why the macros seem to fail, turn on the debugging output by saying \.{\\trace$\ldots$true} to activate the appropriate trace macros. This may produce {\it a lot\/} of output, even for small fragments, so turn it on for only a section at a time. If you need still {\it more\/} details of the inner workings of the parser and the lexer, various other debugging conditionals are available. For example, \.{\\yyflexdebugtrue} turns on the debugging output for the scanner. There are a number of such conditionals that are discussed in the commentary for the appropriate \TeX\ macros. Most of these conditionals are documented in \.{yydebug.sty}, which provides a number of handy shortcuts for a few commonly encountered situations, as well. Remember, what you are seeing at this point is the parsing process of the \bison\ input file, not the one for {\it your\/} grammar (which might not even be complete at this point). However, if all of the above fails, you are on your own: drop me a line if you figure out how to fix any bugs you find. @** Terminology. \namedspot{terminology}This short chapter is an informal listing of a few loose definitions of the concepts used repeatedly in this documentation. Most of this terminology is rather standard. Formal precision is not the goal here, instead, intuitive explanations are substituted whenever possible. {% \def\aterm#1{\item{\sqebullet}{\ttl #1}: \ignorespaces}% \setbox0=\hbox{\sqebullet\enspace} \parindent=0pt \advance\parindent by \wd0 \smallskip \aterm{bison {\rm(as well as} flex{\rm)} parser{\rm(}s{\rm)}} while, strictly speaking, not a formally defined term, this combination will always stand for one of the parsers generated by this package designed to parse a subset of the `official' grammar for \bison\ or \flex\ input files. All of these parsers are described later in this documentation. The term {\it main parser\/} will be used as a substitute in example documentation for the same purpose. \aterm{driver} a generic but poorly defined concept. In this documentation it is used predominantly to mean both the \Cee\ code and the resulting executable that outputs the \TeX\ macros that contain the parser tables, token values, etc., for the parsers built by the user. It is understood that the \Cee\ code of the `driver' is unchanged and the information about the parser itself is obtained by {\it including\/} the \Cee\ file produced by \bison\ in the `driver' (see the examples supplied with the package). \aterm{lexer} a synonym for {\it scanner}, a subroutine that performs the {\it lexical analysis\/} phase of the parsing process, i.e.\ groups various characters from the input stream into parser {\it tokens}. \aterm{namespace} this is an overused bit of terminology meaning a set of names grouped together according to some relatively well defined principle. In a language without a well developed type system (such as \TeX) it is usually accompanied by a specially designed naming scheme. {\it Parser namespaces\/} are commonly used in this documentation to mean a collection of all the data structures describing a parser and its state, including tables, stacks, etc., named by using the `root' name (say \.{\\yytable}) and adding the name of the parser (for example, \.{[main]}). To support this naming scheme, a number of macros work in unison to create and rename the `data macros' accordingly\footnote{To be precise, the {\em namespaces\/} in this manual, would more appropriately be referred to as {\em named scopes}. The {\em tag namespace\/} in \Cee\ is an example of a (built-in) language namespace where the {\em grammatical r\^ole\/} of the identifier determines its association with the appropriate set.}. \aterm{parser stack} a collection of parsers, usually derived from a common set of productions, and sharing a common lexer. As the name suggests, the parsers in the collection are tried in order until the input is parsed successfully or every parser has been tried. This terminology may become a source of some confusion, since each parsing algorithm used by \bison\ maintains several stacks. We will always refer to them by naming a specific task the stack is used for (such as the {\em value stack\/} or the {\em state stack}, etc.). \aterm{pretty printing {\rm or} program visualization} The terms above are used interchangeably in this manual to mean typesetting the program code in a way that emphasizes its meaning as seen by the author of the program\footnote{Or the person typesetting the code.}. It is usually assumed that such meaning is extracted by the software (a specially designed {\em parser\/}) and translated into a suitable visual representation. \aterm{symbolic switch} a macro (or an associative array of macros) that let the \TeX\ parser generated by the package associate {\it symbolic term names\/} (called {\it named references\/} in the official \bison\ documentation) with the terms. Unlike the `real' parser, the parser created with this suite requires some extra setup as explained in the included examples (one can also consult the source for this documentation which creates but does not use a symbolic switch). \aterm{symbolic term name} (also refered to as a {\it named reference\/} in the \bison\ manual): a (relatively new) way to refer to stack values in \bison. In addition to using the `positional' names such as \.{\$}$n$ to refer to term values, one can utilize the new syntax: \.{\$}\.{[}\\{name}\.{]} (or even \.{\$}\\{name} when the \\{name} has a tame enough syntax). The `\\{name}' can be assigned by the user or can be the name of the nonterminal or token used in the productions. \aterm{term} in a narrow sense, an `element' of a grammar. Instead of a long winded definition, an example, such as \prodstyle{ID} should suffice. Terms are further classified into {\it terminals\/} (tokens) and {\it nonterminals\/} (which may be intuitively thought of as composite terms). \aterm{token} in short, an element of a set. Usually encoded as an integer by most parsers, a {\em token\/} is an indivisible {\em term\/} produced for the parser by the scanner. \TeX's scanner uses a more sophisticated token classification, for example, $($character code, character category$)$ pairs, etc. } @** Languages, scanners, parsers, and \TeX. % Or $\ldots$ \vtop{\halign to\hsize{\kern-1.5pt\it#\hfil\tabskip0pt plus1fil\cr Tokens and tables keep macros in check.\cr Make 'em with \bison, use \.{WEAVE} as a tool.\cr Add \TeX\ and \CTANGLE, and \Cee\ to the pool.\cr Reduce 'em with actions, look forward, not back.\cr Macros, productions, recursion and stack!\cr \noalign{\vskip2pt} \omit\hfil\eightpoint Computer generated (most likely)\cr}} \bigskip \def\recount#1{${}^{(#1)}$}% \noindent In order to understand the parsing routines in this collection, it would help to gain some familiarity with the internals of the parsers produced by \bison\ for its intended target: \Cee. A person looking inside a parser delivered by \bison\ would quickly discover that the parsing procedure itself (|yyparse|) occupies a rather small portion of the file. If (s)he were to further reduce the size of the file by removing all the preprocessor directives intended to anticipate every conceivable combination of the operating system, compiler, and \Cee\ dialect, and various reporting and error logging functions it would become very clear that the most valuable product of \bison's labor is a collection of integer {\it tables\/} that control the actions of the parser routine. Moreover, the routine itself is an extremely concise and well-structured loop composed of |goto|'s and a number of numerical conditionals. If one could think of a way of accessing arrays and processing conditionals in the language of one's choice, once the tables produced by \bison\ have been converted into a form suitable for the consumption by the appropriate language engine, the parser implementation becomes straightforward. Or nearly so. The {\it scanning\/} (or {\it lexing\/}) step of this process---a way to convert a stream of symbols into a stream of integers, deserves some attention, as well. There are a number of excellent programs written to automate this step in much the same fashion as \bison\ automates the generation of parsers. One such tool, \flex, though (in the opinion of this author) slightly lacking in the simplicity and elegance when compared to \bison, was used to implement the lexer for this software suite. Lexing in \TeX\ will be discussed in considerable detail later in this manual. The language of interest in our case is, of course, \TeX, so our future discussion will revolve around the five elements mentioned above: \recount{1}data structures (mainly arrays and stacks), \recount{2}converting \bison's output into a form suitable for \TeX's consumption, \recount{3}processing raw streams of \TeX's tokens and converting them into streams of parser tokens, \recount{4}the implementation of \bison's |yyparse| in \TeX, and, finally, \recount{5}producing \TeX\ output via {\it syntax-directed translation} (which requires an appropriate abstraction to represent \bison's actions inside \TeX). We shall begin by discussing the parsing process itself. @*1 Arrays, stacks, and the parser. Let us briefly examine the programming environment offered by \TeX. Designed for typesetting, \TeX's remarkable language provides a layer of macro processing atop of a set of commands that produce the output fulfilling its primary mission: delivering page layouts. In The \TeX book, the macro {\it expansion\/} is likened to mastication, whereas \TeX's main product, the typographic output is the result of its `digestion' process. Not everything that goes through \TeX's digestive tract ends up leaving a trace on the final page: a file full of \.{\\relax}'s will produce no output, even though \.{\\relax} is not a macro, and thus would have to be processed by \TeX\ at the lowest level. It is time to describe the details of defining suitable data structures in \TeX. At first glance, \TeX\ provides rather standard means of organizing and using the memory. At the core of its generic programming environment is an array of \.{\\count}$\,n$ {\it registers\/}, which may be viewed as general purpose integer variables that are randomly accessible by their indices. The integer arithmetic machinery offered by \TeX\ is spartan but is very adequate for the sort of operations a parser would perform: mostly additions and comparisons. Is the \.{\\count} array a good way to store tables in \TeX? Probably not. The first factor is the {\it size\/} of this array: only 256 \.{\\count} registers exist in a standard \TeX\ (the actual number of such registers on a typical machine running \TeX\ is significantly higher but this author is a great believer in standards, and to his knowledge, none of the standardization efforts in the \TeX\ world has resulted in anything even close to the definitive masterpiece that is The \TeX book). The issue of size can be mitigated to some extent by using a number of other similar arrays used by \TeX\ (\.{\\catcode}, \.{\\uccode}, \.{\\dimen}, \.{\\sfcode} and others can be used for this purpose as long as one takes care to restore the `sane' values before the control is handed off to \TeX's typesetting mechanisms). If a table has to span several such arrays, however, the complexity of accessing code would have to increase significantly, and the issue of size would still haunt the programmer. The second factor is the utilization of several registers by \TeX\ for special purposes (in addition, some of these registers can only store a limited range of values). Thus, the first 10 \.{\\count} registers are used by the plain \TeX\ for (well, {\it intended\/} for, anyway) the purposes of page accounting: their values would have to be carefully saved and restored before and after each parsing call, respectively. Other registers (\.{\\catcode} in particular) have even more disrupting effects on \TeX's internal mechanisms. While all of this can be managed (after all, using \TeX\ as an arithmetic engine such as a parser suspends the need for any typographic or other specialized functions controlled by these arrays), the added complexity of using several memory banks simultaneously and the speed penalty caused by the need to save and restore register values make this approach much less attractive. What other means of storing arrays are provided by \TeX? Essentially, only three options remain: \.{\\token} registers, macros holding whole arrays, and associative arrays accessed through \.{\\csname}$\,\ldots\,$\.{\\endcsname}. In the first two cases if care is taken to store such arrays in an appropriate form one can use \TeX's \.{\\ifcase} primitive to access individual elements. The trade-off is the speed of such access: it is {\it linear\/} in the size of the array for most operations, and worse than that for others, such as removing the last item of an array. Using clever ways of organizing such arrays, one can improve the linear access time to $O(\log n)$ by simply modifying the access macros but at the moment, a straightforward \.{\\ifcase} is used after expanding a list macro or the contents of a \.{\\token}$\,n$ register in an {\it un\/}optimized parser. An {\it optimized\/} parser uses associative arrays. The array discussion above is just as applicable to {\it stacks\/} (indeed, an array is the most common form of stack implementation). Since stacks pop up and disappear frequently (what else are stacks to do?), list macros are usually used to store them. The optimized parser uses a separate \.{\\count} register to keep track of the top of the stack in the corresponding associative array\footnote{Which means, unfortunately, that making such fully optimized parser {\em reentrant\/} would take an extraordinary amount of effort. Hence, if reentrancy is a requirement, stacks are better kept inside list macros.}. Let us now switch our attention to the code that implements the parser and scanner {\it functions\/}. If one has spent some time writing \TeX\ macros of any sophistication (or any macros, for that matter) (s)he must be familiar with the general feeling of frustration and the desire to `just call a function here and move on'. Macros\footnote{Formally defined as `$\ldots$ special compile-time functions that consume and produce {\em syntax objects}' in~\cite[DHB].} produce {\it tokens\/}, however, and tokens must either expand to nothing or stay and be contributed to your input, or worse, be out of place and produce an error. One way to sustain a stream of execution with macros is {\it tail recursion\/} (i.e.~always expanding the {\it last token left standing}). As we have already discussed, \bison's |yyparse()| is a well laid out loop organized as a sequence of |goto|'s (no reason to become religious about structured programming here). This fact, and the following well known trick, make \Cee\ to \TeX\ translation nearly straightforward. The macro \TeX niques employed by the sample code below are further discussed elsewhere in this manual. % The macro mess below looks painful but this is the only place such layout is used % The approach can be easily generalized and put in limbo.sty but it seems % a bit redundant at this point. \newcount\piccount \newdimen\lasthsize \setbox5=\vtop{ \demomargin=0pt \let\demoastyle\empty \begindemo ^label A: ... \nooutput ^ if**L**Krm(condition)**N ^ goto C; \nooutput ^label B: ... \nooutput ^ goto A; \nooutput ^label C: ... \nooutput \enddemo } \dp5=\z@@ \setbox3=\vtop{ \demomargin=0pt \let\demoastyle\empty \begindemo ^\if**L**Krm(condition)**N ^ \let\next=\labelC ^\else ^ \let\next=\labelAtail \enddemo } \dp3=\z@@ \newdimen\lastdepth \def\startfitpar{% \bgroup \lasthsize=\hsize \advance\lasthsize-1.5in \vsize=\baselineskip \topskip=\z@@ \setbox0\box2 % empty it % this sounds good at first but there is no good way to pull the insertions out after the % box manipulations that follow; % insertions will thus be contributed to whatever page was being worked on when the % picture insertions {\it started}; hence, if these happen to start at the very top of the page, % any insertion that follows will be contributed to the previous page; we correct this for footnotes % below % \holdinginserts=1 \output{% \global\setbox2=\vbox{ \ifvoid2 \else \prevdepth=\dp2 \unvbox2 \fi \lastdepth=\dp255 \unvbox255 % this would be tempting, however, the \eject that follows should disappear % in addition, one really should not be playing with page breaking in the middle of % such tricky insertions % \penalty\outputpenalty % \kern-\lastdepth % to make sure \baselineskip is accounted for }% }\eject \output{% \setbox0=\vbox{% \unvbox255% }% \lastbox would almost work ... if not for insertions \global\advance\piccount1 \global\setbox2=\vbox{% \prevdepth=\dp2 \unvbox2 \hbox to\hsize{% \ifnum\piccount<15 \hbox to1.5in{% \ifnum\piccount=1 \ \box5 \fi \hfill}% \fi \box0 \hfill \ifnum\piccount=1 \box3 \ % \fi \ifvoid\footins % reinsert footnotes \else \insert\footins{\unvbox\footins}% \fi }% }% }% \parshape=15 0pt 2.7in 0pt 2.7in 0pt 2.7in 0pt 2.7in 0pt 2.7in 0pt 2.7in 0pt 2.7in 0pt \lasthsize 0pt \lasthsize 0pt \lasthsize 0pt \lasthsize 0pt \lasthsize 0pt \lasthsize 0pt \lasthsize 0pt \hsize } \def\endfitpar{% \par \eject \egroup % see the comment above % \holdinginserts=0 \prevdepth=\dp2 \unvbox2 } \startfitpar \noindent Given the code on the left (where |goto|'s are the only means of branching but can appear inside conditionals), one way to translate it into \TeX\ is to define a set of macros (call them \.{\\labelA}, \.{\\labelAtail} and so forth for clarity) that end in \.{\\next} (a common name for this purpose). Now, \.{\\labelA} will implement the code that comes between \.{label A:} and \.{goto C;}, whereas \.{\\labelAtail} is responsible for the code after \.{goto C;} and before \.{label B:} (provided no other |goto|'s intervene which can always be arranged). The conditional which precedes \.{goto C;} can now be written in \TeX\ as presented on the right, where (condition) is an appropriate translation of the corresponding condition in the code being translated (usually, one of `$=$' or `$\not=$'). Further details can be extracted from the \TeX\ code that implements these functions where the corresponding \Cee\ code is presented alongside the macros that mimic its functionality% \footnote{Running the risk of overloading the reader with details, the author would like to note that the actual implementation follows a {\it slightly\/} different route in order to avoid any \.{\\let} assignments or changing the meaning of \.{\\next}}. This concludes the overview of the general approach, It is time to consider the way characters get consumed on the lower levels of the macro hierarchy and the interaction between the different layers of the package. \endfitpar @*1 \TeX\ into tokens. Thus far we have covered the ideas behind items \recount{1} and \recount{4} on our list. It is time to discuss the lowest level of processing performed by these macros: converting \TeX's tokens into the tokens consumed by the parser, i.e.\ part \recount{3} of the plan. Perhaps, it would be most appropriate to begin by reviewing the concept of a {\it token}. As commonly defined, a token is simply an element of a set (see the section on \locallink{terminology}terminology\endlink\ earlier in this manual). Depending on how much structure the said set possesses, a token can be represented by an integer or a more complicated data structure. In the discussion below, we will be dealing with two kinds of tokens: the tokens consumed by the parsers and the \TeX\ tokens seen by the input routines. The latter play the r\^ole of {\it characters\/} that combine to become the former. Since \bison's internal representation for its tokens is non-negative integers, this is what the scanner must produce. \TeX's tokens are a good deal more sophisticated: they can be either pairs $(c_{\rm ch}, c_{\rm cat})$, where $c_{\rm ch}$ is the character code and $c_{\rm cat}$ is \TeX's category code ($1$ and $2$ for group characters, $5$ for end of line, etc.), or {\it control sequences\/}, such as \.{\\relax}. Some of these tokens (control sequences and {\it active}, i.e.~category~13 characters) can have complicated internal structure (expansion). The situation is further complicated by \TeX's \.{\\let} facility, which can create `character-like' control sequences, and the lack of conditionals to distinguish them from the `real' characters. Finally, not all pairs can appear as part of the input (say, there is no $(n, 0)$ token for any $n$, in the terminology above). The scanner expects to see {\it characters} in its input, which are represented by their {\sc ASCII} codes, i.e.~integers between $0$ and $255$ (actually, a more general notion of the Unicode character is supported but we will not discuss it further). Before character codes appear as the input to the scanner, however, and make its integer table-driven mechanism `tick', a lot of work must be done to collect and process the stream of \TeX\ tokens produced after \CWEAVE\ is done with your input. This work becomes even more complicated when the typesetting routines that interpret the parser's output must sneak outside of the parsed stream of text (which is structured by the parser) and insert the original \TeX\ code produced by \CWEAVE\ into the page. \splint\ comes with a customizeable input routine of moderate complexity (\.{\\yyinput}) that classifies all \TeX\ tokens into seven categories: `normal' spaces (i.e.~category~10 tokens, skipped by \TeX's parameter scanning mechanism), `explicit' spaces (includes the control sequences \.{\\let} to \.{\ }, as well as \.{\\\ }), groups ({\it avoid} using \.{\\bgroup} and \.{\\egroup} in your input but `real', \.{\{}$\ldots$\.{\}} groups are fine), active characters, normal characters (of all character categories that can appear in \TeX\ input, including \.{\$}, \.{\^}, \.{\#}, \.{a}--\.{Z}, etc.), single letter control sequences, and multi-letter control sequences. Each of these categories can be processed separately to `fine-tune' the input routine to the problem at hand. The input routine is not very fast, instead, flexibility was the main goal. Therefore, if speed is desirable, a customized input routine is a great place to start. As an example, a minimalistic \.{\\yyinputtrivial} macro is included. When \.{\\yyinput} `returns' by calling \.{\\yyreturn} (which is a macro you design), your lexing routines have access to three registers: \.{\\yycp@@}, that holds the character value of the character just consumed by \.{\\yyinput}, \.{\\yybyte}, that most of the time holds the token just removed from the input, and \namedspot{yybytepure-discussion}\.{\\yybytepure}, that (again, with very few exceptions) holds a `normalized' version of the read character (i.e.~a character of the same character code as \.{\\yycp@@}, and category~12 (to be even more precise (and to use nested parentheses), `normalized' characters have the same category code as that of `\.{.}' at the point where \.{yyinput.sty} is read)). Most of the time it is the character code one needs (say, in the case of \.{\\\{}, \.{\\\}}, \.{\\\&} and so on) but under some circumstances the distinction is important (outside of \.{\\vb\{}$\ldots$\.{\}}, the sequence \.{\\1} has nothing to do with the digit `\.{1}'). This mechanism makes it easy to examine the consumed token. It also forms the foundation of the `hidden context' passing mechanism described later. The remainder of this section discusses the internals of \.{\\yyinput} and some of the design trade-offs one has to make while working on processing general \TeX\ token streams. It is typeset in `small print' and can be skipped if desired. \smallskip \begingroup \abovedisplayskip=5pt% \abovedisplayshortskip=2pt% \belowdisplayskip=5pt% \belowdisplayshortskip=2pt% \fnotesstart=1 \fnotesspan=2 \noofcolumns=2 \icgap=1em% \eightpoint \linecount=73 \setmcparams \def\.#1{{\chardef\\=`\\\chardef\&=`\&\tt #1}}% \dsskip=0pt% \begindoublecols To examine every token in its path (including spaces that are easy to skip), the input routine uses one of the two well-known {\sc \TeX}nologies: \.{\\futurelet\\next\\examinenext} or its equivalent \.{\\afterassignment\\examinenext\\let\\next=}\hbox{\tt\char"20}. Recursively inserting one of these sequences, \.{\\yyinput} can go through any list of tokens, as long as it knows where to stop (i.e.~return an end of file character). The signal to stop is provided by the \.{\\yyeof} sequence, which should not appear in any `ordinary' text presented for parsing, other than for the purpose of providing such a stop signal. Even the dependence on \.{\\yyeof} can be eliminated if one is willing to invest the time in writing macros that juggle \TeX's \.{\\token} registers and only limit oneself to input from such registers (which is, aside from an obvious efficiency hit, a strain on \TeX's memory, as you have to store multiple (3 in the general case) copies of your input to be able to back up when the lexer makes a wrong choice). Another approach to avoid the use of stop tokens is to store the whole input as a {\it parameter\/} for the appropriate macro. This scheme is remarkably powerful and can produce {\it expandable\/} versions of very complicated routines, although the amount of effort required to write such macros grows at a frightening rate. As the text inside \.{\\vb\{}$\ldots$\.{\}} is nearly always well structured, the care that \.{\\yyinput} takes in processing such character lists is an overkill. In a more `hostile' environment (such as the one encountered by the now obsolete \.{\\Tex} macros), however, this extra attention to detail pays off in the form of a more robust input mechanism. One subtlety deserves a special mention here, as it can be important to the designer of `higher-level' scanning macros. Two types of tokens are extremely difficult to deal with whenever \TeX's own lexing mechanisms are used: (implicit) spaces and even more so, braces. We will only discuss braces here, however, almost everything that follows applies equally well to spaces (category 10 tokens to be precise), with a few simplifications (or complications, in a couple of places). To understand the difficulty, let's consider one of the approaches above: $$ \.{\\futurelet\\next\\examinenext}. $$ The macro \.{\\examinenext} usually looks at \.{\\next} and inserts another macro (usually also called \.{\\next}) at the very end of its expansion list. This macro usually takes one parameter, to consume the next token. This mechanism works flawlessly, until the lexer encounters a \.{\{}br\.{,}sp\.{\}}ace. The \.{\\next} sequence, seen by \.{\\examinenext} contains a lot of information about the brace ahead: it knows its category code (left brace, so $1$), its character code (in case there was, say a \.{\\catcode`\\[=1{\tt\char`\ }} earlier) but not whether it is a `real' brace (i.e.\ a character \.{\{}$_1$) or an implicit one (a \.{\\bgroup}). There is no way to find that out until the control sequence `launched' by \.{\\examinenext} sees the token as a parameter. If the next token is a `real' brace, however, \.{\\examinenext}'s successor will never see the token itself: the braces are stripped by \TeX's scanning mechanism. Even if it finds a \.{\\bgroup} as the parameter, there is no guarantee that the actual input was not \.{\{\\bgroup\}}. One way to handle this is by applying \.{\\string} before consuming the next token. If prior to expanding \.{\\string} care has been taken to set the \.{\\escapechar} appropriately (remember, we know the character code of the next token in advance), as soon as one sees a character with \.{\\escapechar}'s character code, (s)he knows that an implicit brace has just been seen. One added complication to all this is that a very determined programmer can insert an {\it active\/} character (using, say, the \.{\\uccode} mechanism) that has the {\it same\/} character code as the {\it brace\/} token that it has been \.{\\let} to! Even setting this disturbing possibility aside, the \.{\\string} mechanism (or, its cousin, \.{\\meaning}) is far from perfect: both produce a sequence of category 12 and 10 tokens that are mixed into the original input. If it is indeed a brace character that we just saw, we can consume the next token and move on but what if this was a control sequence? After all, just as easily as \.{\\string} makes a sequence into characters, \.{\\csname}$\,\ldots\,$\.{\\endcsname} pair will make any sequence of characters into a control sequence so determining the end the character sequence produced by \.{\\string} may prove impossible. $$ \hbox{Huh~$\ldots$} $$ What we need is a backup mechanism: keeping a copy of the token sequence ahead, one can use \.{\\string} to see whether the next token is a real brace first, and if it is, consume it and move on (the active character case can be handled as the implicit case below, with one extra backup to count how many tokens have been consumed). At this point the brace has to be {\it reinserted\/} in case, at some point, a future `back up' requires that the rest of the tokens are removed from the output (to avoid `\.{Too many \}'s}' complaints from \TeX). This can be done by using the \.{\\iftrue\{\\else\}\\fi} trick (and a generous sprinkling of \.{\\expandafter}s). Of course, some bookkeeping is needed to keep track of how deep inside the braced groups we are. For an implicit brace, more work is needed: read all the characters that \.{\\string} produced (and maybe more), then remember the number of characters consumed. Remove the rest of the input using the method described above and restart the scanning from the same point knowing that the next token can be scanned as a parameter. Another strategy is to design a general enough macro that counts tokens in a token register and simply recount the tokens after every brace was consumed. Either way, it takes a lot of work. If anyone would like to pursue the counting strategy, simple counting macros are provided in \.{/examples/count/count.sty}. The macros in this example supply a very general counting mechanism that does not depend on \.{\\yyeof} (or {\it any\/} other token) being `special' and can count the tokens in any token register, as long as none of those tokens is an \.{\\outer} control sequence. In other words, if the macro is used immediately after the assignment to the token register, it should always produce a correct count. Needless to say, if such a general mechanism is desired, one has to look elsewhere. The added complications of treating spaces (\TeX\ tends to ignore them most of the time) make this a torturous exercise in \TeX's macro wizardry. The included \.{\\yyinput} has two ways of dealing with braces: strip them or view the whole group as a token. Pick one or write a different \.{\\yyinput}. Spaces, implicit or explicit, are reported as a specially selected character code and consumed with a likeness of \.{\\afterassignment\\moveon\\let\\next={\tt\char`\ }}. This behavior can be adjusted if needed. Now that a steady stream of character codes is arriving at \.{\\yylex} after \.{\\yyreturn} the job of converting it into numerical tokens is performed by the {\it scanner} (or {\it lexer\/}, or {\it tokenizer\/}, or even {\it tokener}), discussed in the next section. \enddoublecols \endgroup @*1 Lexing in \TeX. In a typical system that uses a parser to process text, the parsing pass is usually split into several stages: the raw input, the lexical analysis (or simply {\it lexing}), and the parsing proper. The {\it lexing\/} pass (also called {\it scanning}, we use these terms interchangeably) clumps various sequences of characters into {\it tokens\/} to facilitate the parsing stage. The reasons for this particular hierarchy are largely pragmatic and are partially historic (there is no reason that {\it parsing\/} cannot be done in multiple phases, as well, although it usually isn't). If one recalls a few basic facts from the formal language theory, it becomes obvious that a lexer, that parses {\it regular\/} languages, can be (in theory) replaced by an {\sc LALR} parser, that parses {\it context-free\/} ones (or some subset thereof, which is still a super set of all regular languages). A common justification given for creating specialized lexers is efficiency and speed. The reality is somewhat more subtle. While we do care about the efficiency of parsing in \TeX, having a specialized scanner is important for a number of different reasons. The real advantage of having a dedicated scanner is the ease with which it can match incomplete inputs and back up. A parser can, of course, {\it recognize\/} any valid input that is also acceptable to a lexer, as well as {\it reject\/} any input that does not form a valid token. Between those two extremes, however, lies a whole realm of options that a traditional parser will have great difficulty exploring. Thus, to mention just one example, it is relatively easy to set up a {\sc DFA}\footnote{Which stands for Deterministic Finite Automaton, a common (and mathematically unique) way of implementing a scanner for regular languages. Incidentally {\sc LALR} mentioned above is short for Look Ahead Left to Right.} so that the {\it longest\/} matching input is accepted. The only straightforward way to do this with a traditional parser is to parse longer and longer inputs again and again. While this process can be optimized to a certain degree, the fact that a parser has a {\it stack\/} to maintain limits its ability to back up\footnote{It should be also mentioned that the fact that the lexing pass takes place prior to the parser consuming the resulting tokens allows one to process some grammars that are not context free. See, for example the {\em parser hack\/} used to process |typedef|s in \Cee.}. As an aside, the mechanism by which \CWEB\ assembles its `scraps' into chunks of recognized code is essentially iterative lexing, very similar to what a human does to make sense of complicated texts. Instead of trying to match the longest running piece of text, \CWEB\ simply looks for patterns to combine inputs into larger chunks, which can later be further combined. Note that this is not quite the same as the approach taken by, say {\sc GLR} parsers, where the parser must match the {\it whole\/} input or declare a failure. Where a \CWEB-type parser may settle for the first available match (or the longest available) a {\sc GLR} parser must try {\it all\/} possible matches or use an algorithm to reject the majority of the ones that are bound to fail in the end. This `\CWEB\ way' is also different from a traditional `strict' {\sc LR} parser/scanner approach and certainly deserves serious consideration when the text to be parsed possesses some rigid structure but the parser is only allowed to process it one small fragment at a time. Returning to the present macro suite, the lexer produced by \flex\ uses integer tables similar to those employed by \bison\ so the usual {\sc\TeX}niques used in implementing \.{\\yyparse} are fully applicable to \.{\\yylex}. An additional advantage provided by having a \flex\ scanner implemented as part of the suite is the availability of the original \bison\ scanner written in \Cee\ for the use by the macro package. This said, the code generated by \flex\ contains a few idiosyncrasies not present in the \bison\ output. These `quirks' mostly involve handling of end of input and error conditions. A quick glance at the \.{\\yylex} implementation will reveal a rather extensive collection of macros designed to deal with end of input actions. Another difficulty one has to face in translating \flex\ output into \TeX\ is a somewhat unstructured namespace delivered in the final output (this is partially due to the \POSIX\ standard that \flex\ strives to follow). One consequence of this `messy' approach is that the writer of a \flex\ scanner targeted to \TeX\ has to declare \flex\ `states' (more properly called {\it subautomata}) twice: first for the benefit of \flex\ itself, and then again, in the {\it \Cee\ preamble\/} portion of the code to output the states to be used by the action code in the lexer. \.{Define\_State($\ldots$)} macro is provided for this purpose. This macro can be used explicitly by the programmer or be inserted by a specially designed parser. Using \CWEB\ helps to keep these declarations together. The `hand-off' from the scanner to the parser is implemented through a pair of registers: \.{\\yylval}, a token register containing the value of the returned token and \.{\\yychar}, a \.{\\count} register that contains the numerical value of the token to be returned. Upon matching a token, the scanner passes one crucial piece of information to the programmer: the character sequence representing the token just matched (\.{\\yytext}). This is not the whole story though as there are three more token sequences that are made available to the parser writer whenever a token is matched. The first of these is simply a `normalized' version of \.{\\yytext} (called \.{\\yytextpure}). In most cases it is a sequence of \TeX\ tokens with the same character codes as the one in \.{\\yytext} but with their category codes set to 12 (see the discussion of \.{\\yybytepure} \locallink{yybytepure-discussion}above\endlink). In cases when the tokens in \.{\\yytext} are {\it not} $(c_{\rm ch}, c_{\rm cat})$ pairs, a few simple conventions are followed, some of which will be explained below. This sequence is provided merely for convenience and its typical use is to generate a key for an associative array. The other two sequences are special `stream pointers' that provide access to the extended scanner mechanism in order to implement the passing of the `formatting hints' to the parser, as well as incorporate \CWEAVE\ formatted code into the input, without introducing any changes to the original grammar. As the mechanism itself and the motivation behind it are somewhat subtle, let us spend a few moments discussing the range of formatting options desirable in a generic pretty-printer. Unlike strict parsers employed by most compilers, a parser designed for pretty printing cannot afford being too picky about the structure of its input (\cite[Go] calls such parsers `loose'). To provide a simple illustration, an isolated identifier, such as `\.{lg\_integer}' can be a type name, a variable name, or a structure tag (in a language like \Cee\ for example). If one expects the pretty printer to typeset this identifier in a correct style, some context must be supplied, as well. There are several strategies a pretty printer can employ to get a hold of the necessary context. Perhaps the simplest way to handle this, and to reduce the complexity of the pretty printing algorithm is to insist on the programmer providing enough context for the parser to do its job. For short examples like the one above, this may be an acceptable strategy. Unfortunately, it is easy to come up with longer snippets of grammatically deficient text that a pretty printer should be expected to handle. Some pretty printers, such as the one employed by \CWEB\ and its ilk (the original \.{WEB}, \.{FWEB}), use a very flexible bottom-up technique that tries to make sense of as large a portion of the text as it can before outputting the result (see also \cite[Wo], which implements a similar algorithm in \LaTeX). The expectation is that this algorithm will handle the majority (about 90\%? it would be interesting to carry out a study in the spirit of the ones discussed in \cite[Jo] to find out) of the cases with the remaining few left for the author to correct. The question is, how can such a correction be applied? \CWEB\ itself provides two rather different mechanisms for handling these exceptions. The first uses direct typesetting commands (for example, \.{@@/} and \.{@@\#} for canceling and introducing a line break, resp.) to change the typographic output. The second (preferred) way is to supply {\it hidden context\/} to the pretty-printer. Two commands, \.{@@;} and \.{@@[}$\ldots$\.{@@]} are used for this purpose. The former introduces a `virtual semicolon' that acts in every way like a real one except it is not typeset (it is not output in the source file generated by \CTANGLE\ either but this has nothing to do with pretty printing, so I will not mention \CTANGLE\ anymore). For instance, from the parser's point of view, if the preceding text was parsed as a `scrap' of type {\it exp}, the addition of \.{@@;} will make it into a `scrap' of type {\it stmt\/} in \CWEB's parlance. The second construct (\.{@@[}$\ldots$\.{@@]}), is used to create an {\it exp\/} scrap out of whatever happens to be inside the brackets. This is a powerful tool at the author's disposal. Stylistically, such context hints are the right way to handle exceptions, since using them forces the writer to emphasize the {\it logical\/} structure of the formal text. If the pretty printing style is changed later on, the texts with such hidden contexts should be able to survive intact in the final document (as an example, using a break after every statement in \Cee\ may no longer be considered appropriate, so any forced break introduced to support this convention would now have to be removed, whereas \.{@@;}'s would simply quietly disappear into the background). The same hidden context idea has another important advantage: with careful grammar fragmenting (facilitated by \CWEB's or any other literate programming tool's `hypertext' structure) and a more diverse hidden context (or even arbitrary hidden text) mechanism, it is possible to use a strict parser to parse incomplete language fragments. For example, the productions that are needed to parse \Cee's expressions form a complete subset of the grammar. If the grammar's `start' symbol is changed to \prodstyle{expression} (instead of the \prodstyle{translation-unit} as it is in the full \Cee\ grammar), a variety of incomplete \Cee\ fragments can now be parsed and pretty-printed. Whenever such granularity is still too `coarse', carefully supplied hidden context will give the pretty printer enough information to adequately process each fragment. A number of such {\it sub}-parsers\namedspot{parser.stacks} can be tried on each fragment (this may sound computationally expensive, however, in practice, a carefully chosen hierarchy of parsers will finish the job rather quickly) until a correct parser produced the desired output (this approach is similar to, although not quite the same as the one employed by the {\it General LR parsers}). This somewhat lengthy discussion brings us to the question directly related to the tools described in this manual: how does one provide typographical hints or hidden context to the parser? One obvious solution is to build such hints directly into the grammar. The parser designer can, for instance, add new tokens (say, \.{BREAK\_LINE}) to the grammar and extend the production set to incorporate the new additions. The risk of introducing new conflicts into the grammar is low (although not entirely non-existent, due to the lookahead limitations of {\sc LR}($1$) grammars) and the changes required are easy, although very tedious, to incorporate. In addition to being labor intensive, this solution has two other significant shortcomings: it alters the original grammar and hides its logical structure; it also `bakes in' the pretty-printing conventions into the language structure (making the `hidden' context much less `stealthy'). It does avoid the `synchronicity problem' mentioned below. A marginally better technique is to introduce a new regular expression recognizable by the scanner which will then do all the necessary bookkeeping upon matching the sequence. All the difficulties with altering the grammar mentioned above apply in this case, as well, only at the `lexical analysis level'. At a minimum, the set of tokens matched by the scanner would have to be altered. A much more satisfying approach, however, involves inserting the hints at the input stage and passing this information to the scanner and the parser as part of the token `values'. The hints themselves can masquerade as characters ignored by the scanner (white space\footnote{Or even the `intercharacter space', to make the hints truly invisible to the scanner.}, for example) and preprocessed by a specially designed input routine. The scanner then simply passes on the values to the parser. This makes hints, in effect, invisible. The difficulty now lies in synchronizing the token production with the parser. This subtle complication is very familiar to anyone who has designed \TeX's output routines: the parser and the lexer are not synchronous, in the sense that the scanner might be reading several (in the case of the general {\sc LR}$(n)$ parsers) tokens\footnote{Even if one were to somehow mitigate the effects of the lookahead {\it in the parser\/}, the scanner would still have to read the characters of the current token up to (and, in some cases, beyond) the (token's) boundary which, in most cases, is the whitespace, possibly hiding the next hint.} ahead of the parser before deciding on how to proceed (the same way \TeX\ can consume a whole paragraph's worth of text before exercising its page builder). If we simple-mindedly let the scanner return every hint it has encountered so far, we may end up feeding the parser the hints meant for the token that appears {\it after\/} the fragment the parser is currently working on. In other words, when the scanner `backs up' it must correctly back up the hints as well. This is exactly what the scanner produced by the tools in this package does: along with the main stream of tokens meant for the parser, it produces two\footnote{There would be no difficulty in splitting either of these streams into multiple `substreams' by modifying the stream extraction macros accordingly.} hidden streams\namedspot{parser.streams} (called the \.{\\yyformat} stream and the \.{\\yystash} stream) and provides the parser with two strings (currently only strings of digits are used although arbitrary sequences of \TeX\ tokens can be used as pointers) with the promise that {\it all the `hints' between the beginning of the corresponding stream and the point labeled by the current stream pointer appeared among the characters up to and, possibly, including the ones matched as the current token}. The macros to extract the relevant parts of the streams (\.{\\consumelistupto} and its cousins) are provided for the convenience of the parser designer. What the parser does with these pointers and the information they provide, is up to the parser designer (the parsers for \flex\ and \bison\ syntax in this package use somewhat different strategies). The \.{\\yystash} stream currently collects all the typesetting commands inserted by \CWEB\ to be possibly used in displaying the action code in \bison\ productions, for example. Because of this, it may appear in somewhat unexpected places, introducing spaces where the programmer did not neccessarily intend (such as at the beginning of the line, etc.). To mitigate this problem, the \.{\\yystash} stream macros are implemented to be entirely invisible to the lexer. Making them produce spaces is also possible, and some examples are provided in \.{symbols.sty}. The interested reader can consult the input routine macros in \.{yyinput.sty} for the details of the internal representation of the streams. In the interest of full disclosure, it should be pointed out that this simple technique introduces a significant strain on \TeX's computational resources: the lowest level macros, the ones that handle character input and are thus executed (in some cases multiple times), for {\it every\/} character in the input stream are rather complicated and therefore, slow. Whenever the use of such streams is not desired a simpler input routine can be written to speed up the process (see \.{\\yyinputtrivial} for a working example of such macro). Finally, while probably not directly related to the present discussion, this approach has one more interesting feature: after the parser is finished, the parser output and the streams exist `statically', fully available for any last minute postprocessing or for debugging purposes, if necessary\footnote{One may think of the parser output as an {\it executable abstract syntax tree\/} (\AST\ or \EAST\ if one likes cute abbreviations, or even e\AST\ for an extra touch of modernity). This parser feature is used to implement the facility that allows easy referencing of productions in the section that implements the action code for one. See \.{yyunion.sty} and the source of this file (\.{splint.w}) for details.}. Under most circumstances, the parser output is `executed' and the macros in the output are the ones reading the various streams using the pointers supplied at the parsing stage (at least, this is the case for all the parsers supplied with the package). @*1 Inside semantic actions: switch statements and `functions' in \TeX. So far we have looked at the lexer for your input, and a grammar ready to be put into action (we will talk about actions a few moments later). It is time to discuss how the tables produced by \bison\ get converted into \TeX\ {\it macros\/} that drive the parser in {\it \TeX}. The tables that drive the \bison\ input parsers are collected in \.{\{b,d,f,g,n\}yytab.tex}, \.{small\_tab.tex} and other similarly named files\footnote{Incidentally, one of the names above is not used anymore. See the \.{cweb} directory after a successful build to find out which. Aren't footnotes great?!}. Each one of these files contains the tables that implement a specific parser used during different stages of processing. Their exact function is well explained in the source file produced by \bison\ ({\it how} this is done is detailed elsewhere, see \cite[Ah] for a good reference). It would suffice to mention here that there are three types of tables in this file: \recount{1}numerical tables such as \.{\\yytable} and \.{\\yycheck} (both are either \TeX's token registers in an unoptimized parser or associate arrays in an optimized version of such as discussed below), \recount{2}a string array \.{\\yytname}, and \recount{3}an action switch. The action switch is what gets called when the parser does a {\it reduction}. It is easy to notice that the numerical tables come `premade' whereas the string array consisting of token names is difficult to recognize. This is intentional: this form of initialization is designed to allow the widest range of characters to appear inside names. The macros that do this reside in \.{yymisc.sty}. The generated table files also contain constant and token declarations used by the parser. The description of the process used to output \bison\ tables in an appropriate form continues in the section about \locallink{bsfile}outputting \TeX\ tables\endlink, we pick it up here with the description of the syntax-directed translation and the actions. The line $$ \.{\\switchon\\next\\in\\currentswitch} $$ is responsible for calling an appropriate action in the current switch, as is easy to infer. A {\it switch\/} is also a macro that consists of strings of \TeX\ tokens intermixed with \TeX\ macros inside braces. Each group of macros gets executed whenever the character or the group of characters in \.{\\next} matches a substring preceding the braced group. If there are two different substrings that match, only the earliest group of macros gets expanded. Before a state is used, a special control sequence, \.{\\setspecialcharsfrom\\switchname} can be used to put the \TeX\ tokens in a form suitable for the consumption by \.{\\switchon}'s. The most important step it performs is it {\it turns every token in the list into a character with the same character code and category 12\/}. Thus \.{\\\{} becomes \.{\{}$_{12}$. There are other ways of inserting tokens into a state: enclosing a token or a string of tokens in \.{\\raw...\\raw} adds it to the state macro unchanged. If you have a sequence of category 12 characters you want to add to the state, put it after \.{\\classexpand} (such sequences are usually prepared by the \.{\\setspecialchars} macro that uses the token tables generated by \bison\ from your grammar). You can give a case a readable label (say, \.{brackets}) and enclose this label in \.{\\raw}$\ldots$\.{\\raw}. A word of caution: an `\.{a}' inside of \.{\\raw}$\ldots$\.{\\raw} (which is most likely an \.{a}$_{11}$ unless you played with the category codes before loading the \.{\\switchon} macros) and the one outside it are two different characters, as one is no longer a letter (category 11) in the eyes of \TeX\ whereas the other one still is. For this reason one should not use characters other than letters in h\.{\{}is\.{,}er\.{\}} state {\em names\/} if such characters can form tokens by themselves: the way a state picks an action does not distinguish between, say, a `\.{(}' in `\.{(letter)}' and a stand alone `\.{(}' and may pick an action that you did not intend\footnote{One way to mitigate this is by putting such named states at the end of the switch, {\em after\/} the actions labelled by the standalone characters. There is nothing special about non-letter characters, of course. To continue the \.{letter} example, placing a state named \.{let} {\em after\/} the \.{letter} one will make it essentially invisible to the switch macros for the reasons explained before this footnote.}. This applies even if `\.{(}' is not among the characters explicitly inserted in the state macro: if an action for a given character is not found in the state macro, the \.{\\switchon} macro will insert a current \.{\\default} action instead, which most often you would want to be \.{\\yylex} or \.{\\yyinput} (i.e.\ skip this token). If a single `\.{(}' or `\.{)}' matches the braced group that follows `\.{(letter)}' chaos may ensue (most likely \TeX\ will keep reading past the \.{\\end} or \.{\\yyeof} that should have terminated the input). Make the names of character categories as unique as possible: the \.{\\switchon} is simply a string matching mechanism, with the added differentiation between characters of different categories. Finally, the construct \.{\\statecomment}{\it anything\/}\.{\\statecomment} allows you to insert comments in the state sequence (note that the state {\it name\/} is put at the beginning of the state macro (by \.{\\setspecialcharsfrom}) in the form of a special control sequence that expands to nothing: this elaborate scheme is needed because another control sequence can be \.{\\let} to the state macro which makes the debugging information difficult to decipher). The debugging mode for the lexer implemented with these macros is activated by \.{\\tracedfatrue}. The functionality of the \.{\\switchon} (as well as the \.{\\switchonwithtype}, which is capable of some rudimentary type checking) macros has been implemented in a number of other macro packages (see \cite[Fi] that discusses the well-known and widely used \.{\\CASE} and \.{\\FIND} macros). The macros in this collection have the additional property that the only assignments that persist after the \.{\\switchon} completes are the ones performed by the user code inside the selected case. This last property of the switch macros is implemented using another mechanism that is part of this macro suite: the `subroutine-like' macros, \.{\\begingroup}$\ldots$\.{\\tokreturn}. For examples, an interested reader can take a look at the macros included with the package. A typical use is \.{\\begingroup}$\ldots$\.{\\tokreturn\{\}\{\\toks0 \}\{\}} which will preserve all the changes to \.{\\toks0} and have no other side effects (if, for example, in typical \TeX\ vernacular, \.{\\next} is used to implement tail recursion inside the group, after the \.{\\tokreturn}, \.{\\next} will still have the same value it had before the group was entered). This functionality comes at the expense of some computational efficiency. This covers most of the routine computations inside semantic actions, all that is left is a way to `tap' into the stack automaton built by \bison\ using an interface similar to the special \.{\$$n$} variables utilized by the `genuine' \bison\ parsers (i.e.\ written in \Cee\ or any other target language supported by \bison). This r\^ole is played by the several varieties of \.{\\yy$\,p$} command sequences (for the sake of completeness, $p$ stands for one of \.{($n$)}, \.{[{\rm name}]}, \.{]{\rm name}[} or $n$, here $n$ is a string of digits, and a `name' is any name acceptable as a symbolic name for a term in \bison). Instead of going into the minutia of various flavors of \.{\\yy}-macros, let me just mention that one can get by with only two `idioms' and still be able to write parsers of arbitrary sophistication: \.{\\yy($n$)} can be treated as a token register containing the value of the $n$-th term of the rule's right hand side, $n>0$. The left hand side of a production is accessed through \.{\\yyval}. A convenient shortcut is \.{\\yy0\{{\rm \TeX\space material}\}} which will expand (as in \.{\\edef}) the `\TeX\ material' inside the braces and store the result in \.{\\yyval} (note that this only works if a sequence of \.{0}s, possibly followed or preceeded by spaces are the only tokens between \.{\\yy} and the opening braces; see the discussion of \.{\\bb} macros below for a description of what happens in other cases). Thus, a simple way to concatenate the values of the first two production terms is \.{\\yy0\{\\the\\yy(1)\\the\\yy(2)\}}. The included \bison\ parser can also be used to provide support for `symbolic names', analogous to \bison's \.{{\$}[{\rm name}]} but a bit more effort is required on the user's part to initialize such support. Using symbolic names can make the parser more readable and maintainable, however. There is also a \.{\\bb$\,n$} macro, that has no analogue in the `real' \bison\ parsers, and provides access to the term values in the `natural order' (e.g.~\.{\\bb1} is the last term in the part of the production preceeding the action). Its intended use is with the `inline' rules (see the main parser for such examples). As of version \.{3.0} \bison\ no longer outputs |yyrhs| and |yyprhs|, which makes it impossible to produce the |yyrthree| array necessary for processing such rules in the `left to right' order. One might also note that the new notation is better suited for the inline rules since the value that is pushed on the stack is that of \.{\\bb0}, i.e.~the term implicitly inserted by \bison. Be aware that there are no \.{\\bb[$\cdot$]} or \.{\\bb($\cdot$)} versions of these macros, for obvious reasons\footnote{The obvious reason is the mechanism used by \.{\\yy[$\cdot$]} and \.{\\yy($\cdot$)} to make them expandable. These macros are basically references to the appropriate token registers. Since the \.{\\bb} macros were designed for the environment where \.{\\yylen} is unknown, establishing such references before the action is executed is not possible. A less obvious reason is the author's lazy approach.}. A less obvious feature of this macro is its `nonexpandable' nature. This means they cannot be used inside \.{\\edef} (just like its \.{\\yy$\,n$} counterpart, it makes several assignments which will not be executed by \.{\\edef}). Thus, the most common use pattern is \.{\\bb$\,n$\{\\toks$\,m$\}} (where $n>0$) with a subsequent expansion of \.{\\toks$\,m$}\footnote{Similar to how \.{\\yy$\,n$} macros work, whenever $n>0$, this pattern simply puts the contents of the braced group that follows in front of a (braced) single expansion of the appropriate value on the stack. If, as in the example above, the contents of the braced group are \.{\\toks$\,m$}, this effectively stores the value of the braced group in the token register. In the case $n=0$ the meaning is different: the stack value {\em corresponding to the implicit term\/} becomes the expanded (by \.{\\edef}) contents of the braced group following \.{\\bb$\,n$}.}. Making these macros expandable is certainly possible but does not seem crucial for the intended limited use pattern. Finally, the scripts included with \splint\ include a postprocessor (see the appropriate~\.{Makefile} for further details) that allows the use of the `native' \bison\ term references (i.e.\ of the form \.{\char`\$}$\ldots$) to access the value stack\footnote{Incidentally, \bison\ itself uses a preprocessor (\.{M4}) to turn its term references into valid \Cee.}. Utilizing the postprocessor allows any syntax for term references used by \bison\ to be used inside \.{TeX}$\ldots$ \Cee\ macros. In this case a typical idiom is \.{\\the\char`\$[term\_name]} to get the value of \prodstyle{term_name}. While storing a new value for the term (i.e.\ modifying the value stack) is also possible, it is not very straightforward and thus not recommended. This applies to all forms of term references discussed above. Naturally, a parser writer may need a number of other data abstractions to complete the task. Since these are highly dependent on the nature of the processing the parser is supposed to provide, we refer the interested reader to the parsers included in the package as a source of examples of such specialized data structures. One last remark about the parser operation is worth making here: the parser automaton itself does not make any \.{\\global} assignments. This (along with some careful semantic action writing) can be used to `localize' the effects of the parser operation and, most importantly, to create `reentrant' parsers that can, e.g.\ call {\it themselves\/} recursively. @*1 `Optimization'. \namedspot{optimization}By default, the generated parser and scanner keep all of their tables in separate token registers. Each stack is kept in a single macro (this description is further complicated by the support for parser {\it namespaces\/} that exists even for unoptimized parsers but this subtlety will not be mentioned again---see the macros in the package for further details). Thus, every time a table is accessed, it has to be expanded making the table access latency linear in {\it the size of the table}. The same holds for stacks and the action `switches', of course. While keeping the parser tables (which are immutable) in token registers does not have any better rationale than saving the control sequence memory (the most abundant memory in \TeX), this way of storing {\it stacks} does have an advantage when multiple parsers get to play simultaneously. All one has to do to switch from one parser to another is to save the state by renaming the stack control sequences. When the parser and scanner are `optimized', all these control sequenced are `spread over' appropriate associative arrays. One caveat to be aware of: the action switches for both the parser and the scanner have to be output differently (a command line option is used to control this) for optimized and unoptimized parsers. While it is certainly possible to optimize only some of the parsers (if your document uses multiple) or even only some {\it parts\/} of a given parser (or scanner), the details of how to do this are rather technical and are left for the reader to discover by reading the examples supplied with the package. At least at the beginning it is easier to simply set the highest optimization level and use it consistently throughout the document. @*1 {\it \TeX\/} with a different {\sl slant} or do you C an escape?. Some \TeX\ productions below probably look like alien script. The authors of \cite[Er] cite a number of reasons to view pretty printing of \TeX\ in general as a nearly impossible task. The macros included with the package follow a very straightforward strategy and do not try to be very comprehensive. Instead, the burden of presenting \TeX\ code in a readable form is placed on the programmer. Appropriate hints can be supplied by means of indenting the code, using assignments ($=$) where appropriate, etc. If you would rather look at straight \TeX\ instead, the line \.{\\def\\texnspace\{other\}} at the beginning of this section can be uncommented and {\let\writetexidxentry\writetextxtidxentry |TeX_( "/noexpand/inmath{/yy0{/yy1{}}}" );|} becomes {% \def\texnspace{other}% \def\texispace{other}% for the index \let\writetexidxentry\writetextxtidxentry % for the index appearance |TeX_( "/noexpand/inmath{/yy0{/yy1{}}}" );|}. There is, however, more to this story. A look at the actual file will reveal that the line above was typed as $$ \.{TeX\_( "/noexpand/inmath\{/yy0\{/yy1\{\}\}\}" );} $$ The `escape character' is leaning the other way! The lore of \TeX\ is uncompromising: `\.{\\}' is {\it the\/} escape character. What is the reason to avoid it in this case? The mystery is not very deep: `\.{/}' was chosen as an escape character by the parser macros (a quick glance at \.{?yytab.tex} will reveal as much). There is, of course, nothing sacred (other than tradition, which this author is trying his hardest to follow) about what character code the escape character has. The reason to look for an alternative is straightforward: `\.{\\}' is a special character in \Cee, as well (also an `escape', in fact). The line \.{TeX\_( "..." );} is a {\it macro-call\/} but $\ldots$ in \Cee. This function simply prints out (almost `as-is') the line in parenthesis. An attempt at \.{TeX\_( "\\noexpand" );} would result in \numberlinestrue \begindemo ^ ^oexpand \enddemo \numberlinesfalse Other escape combinations\footnote{Here is a full list of {\it defined\/} escaped characters in \Cee: \.{\\a}, \.{\\b}, \.{\\f}, \.{\\n}, \.{\\r}, \.{\\t}, \.{\\v}, \.{\\}{$[$\it octal digit\/$]$}, \.{\\'}, \.{\\"}, \.{\\?}, \.{\\\\}, \.{\\x}, \.{\\u}, \.{\\U}. Note that the last three combinations must be followed by a specific string of characters to appear in the input without generating errors.} are even worse: most are simply undefined. If anyone feels trapped without an escape, however, the same line can be typed as $$ \.{TeX\_( "\\\\noexpand\\\\inmath\{\\\\yy0\{\\\\yy1\{\}\}\}" );} $$ Twice the escape! If one were to look even closer at the code, another oddity stands out: there are no \.{\$}'s anywhere in sight. The big money, \.{\$} is a beloved character in \bison. It is used in action code to reference the values of the appropriate terms in a production. If mathematics pays your bills, use \.{\\inmath} instead. @i bo.x @i lo.x @i fo.x @i so.x @i np.x @i common.w @i bs.w @i fk.w @i philosophy.w @i checklists.w @i references.w \let\hostparsernamespace\mainnamespace % to typeset examples in the text % properly @**Index. \global\let\secrangedisplay\empty% do not show the current section range anymore This section is, perhaps, the most valuable product of \CWEB's labors. It lists references to definitions (set in {\it italic}) as well as uses for each \Cee\ identifier used in the source. Special facilities have been added to extend the indexing to \bison\ grammar terms, \flex\ regular expression names and state names, as well as \flex\ options, and \TeX\ control sequences encountered in \bison\ actions. Definitions of tokens (via \prodstyle{\%token}, \prodstyle{\%nterm} and \prodstyle{\%type} directives) are %$\underline{\hbox{underlined}}$ typeset in {\bf bold}. The \bison\ and \TeX\ entries are put in distinct sections of the index in order to keep the separation between the \Cee\ entries and the rest. It may be worth noting that the {\it definition\/} of the symbol is listed under both its `macro name' (such as \.{CHAR}, typeset as \prodstyle{CHAR} in the case of the grammar below), as well as its `string' name if present (to continue the previous example, \.{"char"} is synonymous with \prodstyle{CHAR} after a declaration such as `\prodstyle{\%token} \prodstyle{CHAR} \.{"char"}'), while the {\it use\/} of the term lists whichever token form was referenced at the point of use (both forms are accessible when the entry is typeset for the index and a macro can be written to mention the other form as well). The original syntax of \bison\ allows the programmer to declare tokens such as \prodstyle{'\{'} and \prodstyle{'\}'} and the indexing macros honor this convention even though in a typeless environment such as the one the present typesetting parser is operating in such declarations are redundant. The indexing macros also record the use of such character tokens. The quotes indicate that the `string' form of the token's name was used. A section set in {\it italic\/} references the point where the corresponding term appeared on the left hand side of a production. A production: \let\TeXx\TeXxi \def\gatoks{% \omit\hfil&\omit\hfil&\omit\hfil\hbox to2em{\hfil}&\omit\hfil\cr \noalign{\vskip-\baselineskip}% }% \beginmprod left_hand_side: term.1 term.2 term.3 \{\stashed{|TeX_("/do/something/with/yy(1)");|}\} \endmprod inside the \TeX\ part of a \CWEB\ section will generate several index entries, as well, including the entries for any material inside the action, mimicking \CWEB's behavior for the {\it inline \Cee\/} (\.{\yl}$\ldots$\.{\yl}). Such entries (except for the references to the \Cee\ code inside actions) are labeled with $^\circ$, to provide a reminder of their origin. This parser collection, as well as the indexing facilities therein have been designed to showcase the broadest range of options available to the user and thus it does not always exhibit the most sane choices one could make (for example, using a full blown parser for term {\it names\/} is poor design but it was picked to demonstrate multiple parsers in one program). The same applies to the way the index is constructed (it would be easy to only use the `string' name of the token if available, thus avoiding referencing the same token twice). \TeX\ control sequences are listed following the index of all \bison\ and \flex\ entries. The different sections of the index are separated by a {\it dinkus\/} (\dinkus)@^dinkus (\dinkus)@>. Since it is nearly impossible to determine at what point a \TeX\ macro is defined (and most of them are defined outside of the \CWEB\ sources), only their uses are listed (to be more precise, {\it every\/} appearance of a macro is assumed to be its use). In a few cases, a `graphic' representation for a control sequence appears in the index (for example, \texref{/getfirst} represents {\let\writetexidxentry\writetextxtidxentry \def\texnspace{other}\def\texispace{other}\inlineTeXx{/getfirst}$\!$}). The index entries are ordered alphabetically. The latter may not be entirely obvious in the cases when the `graphical representation' of the corresponding token manifests a significant departure from its string version (such as \texref{/yy(1)} instead of {\def\texnspace{other}\def\texispace{other}% \let\writetexidxentry\writetextxtidxentry |TeX_("/yy(1)");|$\!$}). Incidentally, for the examples on this page (as well an example in the section about \TeX\ pretty-printing) both the `graphic' as well as `text' versions of the control sequence are indexed. It is instructive to verify that their location in the index corresponds to the `spelling' of their visual representation (thus, \texref{/getfirst} appears under `p'). One should also be aware that the indexing of some terms has been suppressed, since they appear too often. % TODO: explain the visibility system. Note the anomalous order of \prodstyle{term.1} % vs.~\prodstyle{term0} due to the dot in \.{term.1}, which is otherwise invisible. Underscore the % importance of following a consistent naming scheme, including the `stringized' versions % of token names. @q Include the list of index section markers; this is a hack to get around @> @q the lack of control over the generation of \CWEB's index; the correct order @> @q of index entries depends on the placement of this inclusion @> @i alphas.hx \makeunindexablex{{\csstring\the}{\csstring\nx}{\csstring\yy}{\csstring\yylexnext}% {\csstring\else}{\csstring\fi}{\csstring\yyBEGIN}{\csstring\next}}