@q file: intro.w@> @q% Copyright Dave Bone 1998 - 2015@> @q% /*@> @q% This Source Code Form is subject to the terms of the Mozilla Public@> @q% License, v. 2.0. If a copy of the MPL was not distributed with this@> @q% file, You can obtain one at http://mozilla.org/MPL/2.0/.@> @q% */@> @** Summary of \O2 --- Yacco${_2}$'s nickname.\fbreak The compiler / compiler's formal name is \Yacco2 but call me \O2. \Yacco2 can be morphed many ways; here are some hints: sound of a cold, molecular contortions. Do your own expletives. @*2 Component overview running \O2.\fbreak I use a ``.lex'' extension to distinguish a grammar file. This is not hard coded. You can choose your own memory mnemonics for any of my files. The ``.T'' file extension identifies the Terminal vocabulary. Its components are described later in the document. The Lrk and Rc terminals are pre-assembled and reside in the ``\Whereyacco2/library/grammars'' account. My original thought was to allow the compiler writer to experiment with his own terminal definitions for all classes: LR constants, raw characters, errors, and terminals. From experience, only the last 2 classes are local to each language being defined. \fbreak \fbreak \convertMPtoPDF{"/usr/local/yacco2/diagrams/o2_overview.1"}{1}{1} \fbreak \fbreak Please note, the ``header'' files are absent from the diagram due to space constraints. The salient ones to note are:\fbreak \ptindent{1) the enumeration of the vocabulary's symbols} \ptindent{2) headers for fsm, and the terminal classifications: lrk,error,rc,and T} Per compiled ``xxx.lex'' grammar, the 2 status files ``xxx\_tracings.log'' and ``xxx\_errors.log'' hold the text-only compiled results lodged within the local grammar's folder. ``T-Alphabet'' gets gened when the ``-t'' option is inputted to gen the Terminal vocabulary. It is a file by defined order of all the terminals's literal names used as comments against the outputted lookahead tables to make sense of their compressed set definitions. Its file name is built from the grammar's ``T-enumeration'' construct using its filename and adding a ``.fsc'' extension. \Olinker cross checks the number of terminals defined in the ``no-of-T'' value in each grammar's ``fsc'' file against this file. Out-of-sync values indicates that new terminals have been added to either the ``error'' or ``terminal'' class terminals vocabulary without gening up the grammar with the ``-t'' and/or ``-err'' option(s). The ``yyy.fsc'' file is the grammar writer's handcrafted file containing references to these ``xxx.fsc'' files and the T-Alphabet file for \Olinker to compile. Its name can be anything but i use the ``.fsc'' as a memory jog. Please see \Olinker's documentation on its raison d'\^etre and make file comments. @*3 Tracing facilities.\fbreak Some of the more important tracing facilities are as follows where their mnemonic replaces the ``xx'': \fbreak \ptindent{TH --- dynamic trace of the grammar's parse stack when its ``debug option is true} \ptindent{T --- trace terminals fetched across all grammars} \ptindent{AR --- trace arbitration when grammar's debug switch is true} \ptindent{MSG --- dynamic threading messages between the co-operatives} Please see \O2's library documentation concerning each tracing variable when set to 1 within your program by the programmer: ``yacco2::YACCO2\_xx\_\_ = 1;'' starts their specific scribblings. There are other less important trace variables not listed above. The turned on \O2's library trace facility will log to ``xxx\_tracings.log' file where `xxx' represents the grammar being parsed without its extension. As the ``xxx\_tracings.log'' is text-only content, this allows the use of a general text editor to browse its material. If the editor has indigestion due to its volume, a script can be written to postprocess it for study by the ``sed'' / ``grep'' combo or using just the ``split'' utility. @*2 Grammar anatomy.\fbreak The grammar is composed of your traditional components: start rule, non-terminal vocabulary, terminal vocabulary, and 2 additional parts: fsm and syntax directed code. ``fsm'' (short for finite state machine) is a packaging agent. It houses all the grammar's software generated parts along with the c++ syntax directed code within the grammar associated with their directives. These directives are local to the grammar's rules, subrules, and possibly the grammar's start-run-finish sequence that is handled within the fsm. ``fsm'' supplies the grammar's c++ namespace, class name, and filename prefix to output the components to. @*2 Terminal vocabulary.\fbreak From the diagram below, the ``enumeration'' component is a packaging agent that receives the outputed enumeration definitions for each terminal class. The counting scheme uses the natural numbers starting from 0 listing the ``lrk'' constants, followed by each of the other components's terminals. The last component ``terminals'' is your regular terminal definitions that gets assembled from the lexical or syntactical passes, and possibly out into etherland of abstraction. All vocabulary elements are tagged this way. It is the glue to all the emitted tables. For the record, each grammar's non-terminal vocabulary (rules) are enumerated after the terminal enumeration count and are defined within the grammar's fsm class definition. The rules's subrules are also enumerated and defined there. They are not dependent on the Terminal vocabulary. \fbreak \fbreak \convertMPtoPDF{"/usr/local/yacco2/diagrams/o2_Toverview.1"}{1}{1} \fbreak @*2 Overview of generating the grammar's pdf and postscript(ps) documents.\fbreak There are 2 generated documents using the ``-p'' option emitting ``cweave'' content and associated ``mpost'' diagrams for compilation:\fbreak \INDENT{.35in}{1) grammar with its syntax direct code,} \INDENT{.5in}{emitted \O2linker file, and gened lr1 state network} \INDENT{.35in}{2) various cross references against the grammar, and lr1 state network} \INDENT{.7in}{- symbols used from each rule's subrules symbol string position} \INDENT{.7in}{- additional information supporting the lr1 state network in the 1st document} \INDENT{1in}{$\bullet$ each lr1 state's rules follow sets} \INDENT{1in}{$\bullet$ reducing states subrules with references to their contributors' follow sets} \INDENT{1in}{$\bullet$ global lookahead sets with their yield used by the parse reduce operation} The below diagram shows the manufacturing of a grammar's document. \fbreak \fbreak \fbreak \convertMPtoPDF{"/usr/local/yacco2/diagrams/docgeneration.1"}{1}{1} \fbreak \fbreak Also for each ``pdf'' document generated there is a postscript document to remove the dependency of a ``pdf reader'' and its ``gui'' interaction when wanting to spool the document for print. For example on my Sun Solaris, the program ``pdftops'' takes a ``pdf'' document and creates an equivalent ``ps'' document. Spooling it to a print would use the command line ``lp xxx.ps''. @*2 A sample \O2 script where the options are described.\fbreak \O2's input data template is $[options]$ filename where options are optional as they have preconfigured values. Here are the switches that can be inputted to \O2 using the Unix approach to turn on the specific option. Each option must be inputted with its own $-$ sign.\fbreak {\bf Options}: if they are not present, do not generate\fbreak \ptindent{1) -t --- generate the Terminal vocabulary} \ptindent{2) -err --- generate the Error vocabulary} \ptindent{3) -lrk --- generate the lr k vocabulary: Deprecated not supported} \ptindent{4) -rc --- generate the Raw characters vocabulary: Deprecated not supported} \ptindent{5) -p --- generate the grammar's documents} Points 1 and 2 are usually stable and do not need to be gened. Do so when these vocabularies have been modified. Don't forget to regen all the grammars and re-run \O2{}linker to reprocess the ``fsc'' files if the number of Terminals in the vocabularies has changed as all the Lookahead sets are now different along with the enumeration scheme that ties them together. Points 3 and 4 cannot be used as they reside in ``\Whereyacco2/library/grammars'' pregened. I included them as a memory jog to my experiments; if u try to input these deprecated options u'll get an error message. Other options were experimented with but found boarderline marginal: gen namespace, gen the grammar, and turn on debug of grammar instead of using the editor cycle to modify the grammar to be traced. Now namespace and grammar are always generated, and hello editor. So out damn spot. Here is a batch command file that runs on a Microsoft's NT/XP desktop. The same can be done within a ``Unix'' flavour script language like ``Bash''. Though it does not illustrate a conditional test as to whether the script should continue when the grammar is faulty, \O2 returns a 0 to indicate a healthy grammar and a 1 to indicate a sick grammar: The gory details are in the error log. \let\setuplistinghook = \linenumberedlisting \listing{"/usr/local/yacco2/diagrams/o2run_example.txt"} \fbreak \let\setuplistinghook = \relax The above example uses line numbers delimited by ``:'' at the start of each line for commentary purposes. Line 3 sets the directory where \O2 resides and repository for the temporary files from \O2, ``mpost'' that draws the grammar diagrams, while ``cweave'' generates the ``xxx.tex'' file for ``pdftex'' who completes the document for an ``Adobe'' reader program: for example ``xpdf'' of open source or Adobe's reader. ``cweave'' is one of the programs by Donald E. Knuth and Silvio Levy from their book ``The CWEB System of Structured Documentation''. Go to the web site ``www.tex.org'' for more information on how to obtain ``CWEB''. The same comments apply to ``mpost'' written by John D. Hobby of `Bell Labs''. This is a remake of ``MetaFont''language / ``MetaPost'' program by Donald E. Knuth. These programs are grrrreat! More people should be using them. The emitted grammar files get placed in the same directory of the inputted grammar to \O2. Line 6 runs \O2 with its inputted grammar file ``/yacco2/compiler/grammars/enumerate\_grammar.lex'' and switches to gen up the Terminal and Error vocabularies and gen a printed set of documents. \O2 also generates the documentation files ``enumerate\_grammar.mp'' and ``enumerate\_grammar.w'' files. Lines 7--9 are command lines to create the output document. In the example, ``enumerate\_grammar.pdf'' is the final file document for printing. ``enumerate\_grammar.xx'' are figure files generated by mpost from file ``enumerate\_grammar.mp''. These files are referenced in the ``enumerate\_grammar.w'' file by cweave who produces an ``enumerate\_grammar.tex'' file for program pdftex. All this to say that there can be many generated files before the document is complete. Please note the other cross reference document is not shown but follows the same run pattern. @*2 Some definitions.\fbreak \fbreak Non-terminal:\fbreak This is your normal grammar definition. I interchange this term with ``rule''. They are the same. Depending on the context, i also use rule in the same sense of a grammar's production. To refine the context, the term ``subrule'' indicates one of a rule's productions.\fbreak \fbreak Subrule:\fbreak Equivalent to a grammar's production. It is one of a rule's right-hand-side string of symbols drawn from the non-terminal or Terminal alphabets. The string can be empty indicating epsilon.\fbreak Please see the mavelous book ``Formal Languages and Their Relationn to Automata'' by Hopcroft and Ullman for a complete discussion on grammars and their makeup. Excellent reading for a 1968 vintage on automata.\fbreak \fbreak Here are some basic definitions used by my lr1 generator.\fbreak \fbreak First set:\fbreak Please see |first_set_rules.lex| grammar for a more thorough discussion. First set is the set of terminals that begins a string of symbols. When the symbol is a rule, then all its subrules contribute to the first set. This is a recursive definition as the rule's subrules can also bring in other rules' subrules string of symbols that contribute to it. If the string's start symbol is a rule and its epsilonable, then its right neighbour also contributes. Again if its a rule and epsilonable its right neighbour is a contributor: ahh recursive definitions. \fbreak \fbreak Follow set:\fbreak The rule's first set of production strings to the right of a lr state's configuration. Here is a simple arithmetic grammar to illustrate follow sets for the ``Closure-only'' state where all production strings have their configuration position at their very beginning illustrated by ''$ _.$''. I use a form of Dewey decimal notation to reference the production's configuration. For example ``E.1.2'' means the production of rule ``E'' referencing subrule 1 of its second symbol is being refered to. In this example below the referenced symbol is $+$. How follow sets are arrived at is discussed in ``Overview of \O2's state generated components''. \fbreak \settabs\+\indent&1in\qquad&\qquad&\qquad&\cr \+&\it Rule&&{\it subrule's symbols}\cr \+&S &$\rightarrow$ &$ _.$E $\bot$\cr \+&E &$\rightarrow$ &$ _.$E $+$ T\cr \+&&$\rightarrow$ &$ _.$T\cr \+&T&$\rightarrow$ &$ _.$T $*$ F\cr \+&&$\rightarrow$ &$ _.$F\cr \+&F&$\rightarrow$ &$ _.$( E )\cr \+&&$\rightarrow$ &$ _.$id\cr \fbreak \settabs\+\indent&mmmmmm&\qquad&followset&\qquad&R.SR.Pos follow setxxxxxx &\qquad&transitions&\cr %\settabs 4 \columns \+&\it Rule&&{\it follow set}&& {\it R.SR.Pos of follow set}&&{\it with transitions}\cr \+&E && $\bot\ +$ &&$\bot$ by S.1.2, $+$ from E.1.2&& $\bot\ +$\cr \+&$\uparrow$ && && E.2.2\cr \+&T&& $*$&&$*$ by T.1.2&&$*\ \bot\ +$\cr \+&$\uparrow$ && && T.2.2\cr \+&F&& $\emptyset$&&&&$*\ \bot\ +$\cr \fbreak \hbox{\hskip.5in Table of follow sets for the ``Start state'' of the above grammar}\fbreak There are 3 subtleties that are watched for in the follow set calculation:\fbreak \ptindent{1) rule symbol --- use its ``first set''} \ptindent{2) epsilonable rule symbol --- continue to next symbol in follow string for assessment} \ptindent{3) end of symbol string reached --- transition} Point 3 requires some explanation. Its condition means that the rule's right-hand-side has been consumed (or is epsilon) so what's it follow set? Nothing? No it's the subrule's rule that spawned it that provides more follow set context. This context resides in the ``closure'' state of this rule. So now there is a transition to this rule's follow set. This is the transitive closure of spawning contexts. The Table of follow sets shows these transitions with the $\uparrow$ symbol. Epsilon rules are chameleon in nature: they supply their first sets and also disappear and so u must continue to the next symbol in the follow string to complete the follow set while observing the end-of-string condition to follow its transitions. @*2 Catalogue of \O2's files.\fbreak Cweb Documents:\fbreak \ptindent{1) \Yacco2 parse library} \ptindent{2) \o2{}extern --- external routines} \ptindent{3) \Yacco2{}stbl --- symbol table} \O2's Input files to |cweb|:\fbreak \ptindent{1) |o2.w| --- master file that starts things off} \ptindent{2) |intro.w| --- introduction } \ptindent{3) |defs.w| --- basic definitions to gen lr1 network} \ptindent{4) |prog.w| --- \o2 cweb code } \ptindent{5) |bug.w| --- confessions} \ptindent{6) |o2_defs.w| --- details} \ptindent{7) |includes.w| --- bring in those grammars for the parsing} \ptindent{8) |o2externs.w| --- external routines} |cweb| generated files:\fbreak \ptindent{1) |o2.h| --- compiler definitions} \ptindent{2) |o2.cpp| --- \O2 program} \ptindent{3) |o2_defs.cpp| --- structure implementations} \ptindent{4) |o2_externs.h| --- global definitions used across \o2's source code} \O2's generated files where xxx is the grammar's name being compiled:\fbreak \ptindent{1) |xxx.fsc| --- grammar's first set confessions for Linker} \ptindent{2) |xxx.h| --- grammar's header file} \ptindent{3) |xxx.cpp| --- automaton code} \ptindent{4) |xxxsym.cpp| --- automaton symbols} \ptindent{5) |xxxtbl.cpp| --- automaton's state definitions} \Yacco2 library memorabilia:\fbreak \ptindent{1) yacco2 --- library namespace} \ptindent{2) ``\Whereyacco2/library'' --- yacco2's library directory} \ptindent{3) ${< yacco2.h >}$ --- Yacco2's library header file} \ptindent{4) ``library directory/xxxx'' - xxxx is the debug or release of the object library} Dependency files from \Yacco2 sub-systems:\fbreak \ptindent{|yacco2.h| - basic definitions used by Yacco2} \ptindent{|yacco2_T_enumeration.h| - terminal enumeration for Yacco2's terminal grammar alphabet} \ptindent{|yacco2_err_symbols.h| - error terminal definitions from Yacco2's grammar alphabet} \ptindent{|yacco2_characters.h| - raw character definitions from Yacco2's grammar alphabet} \ptindent{|yacco2_k_symbols.h| - constant terminal definitions from Yacco2's grammar alphabet} \ptindent{|yacco2_terminals.h| - regular terminal definitions from Yacco2's grammar alphabet} \ptindent{|*.h| - assorted grammar definitions from Yacco2 to parse} \ptindent{|o2_externs.h| - external support routines for \O2} Grammars\fbreak \ptindent{|pass3.lex| --- lex and syntactic phase of grammar} \ptindent{|la_expr_source.lex| --- lexical phase of lookahead expression} \ptindent{|la_expr.lex| -- syntactic phase of lookahead expression} \ptindent{|enumerate_T_alphabet.lex| -- logic grammar to assign each T a number from 0..n} \ptindent{|epsilon_rules.lex| -- grammar determines epsilon per rule and pathological conditions} \ptindent{|first_set.lex| -- logic grammar to calculate each rule's first set} \ptindent{|prt_fs_of_rules.lex| -- logic grammar to print each rule's first set} \ptindent{|enumerate_grammar.lex| -- dump aid: enumerate grammar's components} Globals\fbreak \ptindent{|LR1_STATES| --- list of gened lr1 states} \ptindent{|LR1_COMMON_STATES| --- common states map having same vectored into symbol} \ptindent{|START_OF_RULES_ENUM| --- used in shift / reduce conflict evaluation} Comments:\fbreak My external routines use the all upper case approach to names. I know it's like shouting but it clues the reader where the heck the routine comes from. I could have tempered the all caps approach to a capital letter but i'm myopic and becoming visually golden in age. So my excuses to the reader for this tasteless approach. @*2 \O2's language.\fbreak There are 3 languages that are actually parsed: 2 in preparation --- command line and its contents, and the grammar file. A grammar is divided into 4 parts:\fbreak \ptindent{a) Finite automaton definition --- basic statements about the grammar} \ptindent{b) Parallel parse that defines a threading grammar} \ptindent{c) Terminal vocabulary: errors, lr k, raw characters, and terminals} \ptindent{d) Rule definitions} \let\setuplistinghook = \linenumberedlisting \listing{"/usr/local/yacco2/diagrams/eol.txt"} \fbreak \let\setuplistinghook = \relax The above source listing is an example of a threaded grammar. Starting each source line is a line number suffixed by `:' present only for discussion purposes. Line numbers 6--10 defines the fsm component. Lines 11--19 indicates that the grammar is a thread. Though the terminal vocabulary definitions are hidden by line 20, it illustrates the file include feature of \O2. Lines 22--39 are the rule definitions. Each grammar's section has a defining keyword like ``fsm'', ``parallel-parser'', ``rules'' that introduces the part being defined. @*2 C macros.\fbreak Originally there were conditionally defined trace variables that controlled the inclusion of trace code. This was a pain-in-the-seat so now they are global variables that test their values. I felt the slight speed bump merited the facility without the combinatorics of libraries needed for distribution. |YACCO2_define_trace_variables| macro defines these global variables used by \O2's tracing purposes. U can roll your own or just include the macro in your code. These variables are dormant until their values are not zero. Without their inclusion, a linker message of unresolved variable will be regurgitated: they must be present when using the \O2 library. It's an easy way to define them within your program. Please see \O2 library documentation for a discussion on each trace variable. To activate a specific tracing, assignment a non zero value to the selected trace variable: set it to 1. Here is their catalogue:\fbreak \ptindent{|YACCO2_T__| --- trace terminal when fetched} \ptindent{|YACCO2_TLEX__| --- trace macros of emitted grammar: rules and user emergency macros} \ptindent{|YACCO2_MSG__| --- trace thread messages} \ptindent{|YACCO2_MU_TRACING__| --- trace acquire / release of trace mutex} \ptindent{|YACCO2_MU_TH_TBL__| --- trace acquire / release mutex of thread table} \ptindent{|YACCO2_MU_GRAMMAR__| --- trace acquire / release each grammar's mutex} \ptindent{|YACCO2_TH__| --- trace the parse stack: fsa and syntax directed activities } \ptindent{|YACCO2_AR__| --- trace arbitrator procedure} \ptindent{|YACCO2_THP__| --- trace thread performance} They are enrobed by namespace yacco2. To set the trace variable be sure the namespace is delared: either explicitly as in: \fbreak \ptindent{|yacco2::YACCO2_T__| = 1;} or implicitly by a ``using namespace yacco2;'' statement somewhere preceding the assignment:\fbreak \ptindent{using namespace yacco2;} \ptindent{...} \ptindent{|YACCO2_T__| = 1;} \fbreak Each traced output line identifies its type by the trace variable turned on. As tracing can be very very volumnious, post evaluating the output thru a Bash type filter script makes the log output manageable. I say this from experience as some editors blow up due to the size of the traced file. Names withheld to protect the innocent.