@q file: intro.w@>
@q%   Copyright Dave Bone 1998 - 2015@>
@q% /*@>
@q%    This Source Code Form is subject to the terms of the Mozilla Public@>
@q%    License, v. 2.0. If a copy of the MPL was not distributed with this@>
@q%    file, You can obtain one at http://mozilla.org/MPL/2.0/.@>
@q% */@>
@** Summary of \O2 --- Yacco${_2}$'s nickname.\fbreak
The compiler / compiler's formal name is \Yacco2 but call me \O2.
\Yacco2 can be morphed many ways; here are some hints: sound of a cold, molecular contortions.
Do your own expletives.
@*2 Component overview running \O2.\fbreak
I use a ``.lex'' extension to distinguish a grammar file.
This is not hard coded. You can choose your own memory mnemonics for any of my files.
The ``.T'' file extension identifies the Terminal vocabulary.
Its components are described later in the document.
The Lrk and Rc terminals are pre-assembled and reside in the 
``\Whereyacco2/library/grammars'' account.
My original thought was to allow the compiler writer to experiment with his own 
terminal definitions for all classes: LR constants, raw characters, errors, and terminals.
From experience, only the last 2 classes are local to each language being defined.
\fbreak
\fbreak
\convertMPtoPDF{"/usr/local/yacco2/diagrams/o2_overview.1"}{1}{1}
\fbreak
\fbreak
Please note, the ``header'' files are absent from the diagram due to space constraints.
The salient ones to note are:\fbreak
\ptindent{1) the enumeration of the vocabulary's symbols}
\ptindent{2) headers for fsm, and the terminal classifications: lrk,error,rc,and T}
Per compiled ``xxx.lex'' grammar, the 2 status files ``xxx\_tracings.log'' and ``xxx\_errors.log'' 
hold the text-only compiled results lodged within the local grammar's folder.
``T-Alphabet'' gets gened when the ``-t'' option is inputted
to gen the Terminal vocabulary.
It is a file by defined order of all the terminals's literal names
 used as comments against the outputted lookahead tables 
to make sense of their compressed set definitions.
Its file name is built from the grammar's ``T-enumeration'' construct 
using its filename and adding 
a ``.fsc'' extension. 
\Olinker cross checks the number of terminals defined
 in the ``no-of-T'' value in each grammar's ``fsc'' file against this file.
Out-of-sync values indicates that new terminals have been added to
either the ``error'' or ``terminal'' class terminals vocabulary
 without gening up the grammar with the ``-t'' and/or ``-err'' option(s).
The ``yyy.fsc'' file is the grammar writer's handcrafted file containing
references to these ``xxx.fsc'' files 
and the T-Alphabet file for \Olinker to
compile. Its name can be anything but i use the ``.fsc'' as a memory jog.
Please see \Olinker's documentation on its raison d'\^etre and make file comments.

@*3 Tracing facilities.\fbreak
Some of the more important tracing facilities are as follows where their mnemonic
replaces the ``xx'': \fbreak 
\ptindent{TH --- dynamic trace of the grammar's parse stack when its ``debug option is true}
\ptindent{T --- trace terminals fetched across all grammars}
\ptindent{AR --- trace arbitration when grammar's debug switch is true}
\ptindent{MSG --- dynamic threading messages between the co-operatives}
Please see \O2's library documentation concerning 
each tracing variable when set to 1 within your 
program by the programmer: ``yacco2::YACCO2\_xx\_\_ = 1;''
starts their specific scribblings.
There are other less important trace variables not listed above. 
The turned on \O2's library trace facility will log to ``xxx\_tracings.log' file where `xxx' 
represents the grammar being parsed without its extension. 
As the ``xxx\_tracings.log'' is text-only content, this allows the use of a general text editor
to browse its material.
If the editor has indigestion due to its volume, a script can be written to postprocess 
it for study  by the ``sed'' / ``grep'' combo or using just the ``split'' utility. 

@*2 Grammar anatomy.\fbreak
The grammar is composed of your traditional 
components: start rule, non-terminal vocabulary,
terminal vocabulary, and 2 additional parts: fsm and syntax directed code.
``fsm'' (short for finite state machine) is a packaging agent.
It houses all the grammar's software generated parts along with 
the c++ syntax directed code within the grammar
 associated with their directives.
These directives are local to the grammar's rules, subrules, and possibly
 the grammar's start-run-finish sequence that is handled within the fsm. 
``fsm'' supplies the grammar's c++ namespace,
 class name, and filename prefix  to output the components to.

@*2 Terminal vocabulary.\fbreak
From the diagram below, the ``enumeration'' component is a packaging agent that
receives the outputed enumeration definitions for each terminal class.
The counting scheme uses the natural numbers
starting from 0 listing the ``lrk'' constants, followed by each of the 
other components's terminals.
The last component ``terminals'' is your regular terminal definitions that gets assembled from
the lexical or syntactical passes, and possibly out into etherland of abstraction.
All vocabulary elements are tagged this way.
It is the glue to all the emitted tables.
For the record, each grammar's non-terminal vocabulary (rules)
 are enumerated after the terminal enumeration count
and are defined within the grammar's fsm class definition. 
The rules's subrules are also enumerated and defined there. 
They are not dependent on the Terminal vocabulary.
\fbreak
\fbreak
\convertMPtoPDF{"/usr/local/yacco2/diagrams/o2_Toverview.1"}{1}{1}
\fbreak


@*2 Overview of generating the grammar's pdf and postscript(ps) documents.\fbreak
There are 2 generated documents using the ``-p'' option emitting 
``cweave'' content and associated ``mpost'' diagrams for compilation:\fbreak
\INDENT{.35in}{1) grammar with its syntax direct code,}
\INDENT{.5in}{emitted \O2linker file, and gened lr1 state network}
\INDENT{.35in}{2) various cross references against the grammar, and lr1 state network}
\INDENT{.7in}{- symbols used from each rule's subrules symbol string position}
\INDENT{.7in}{- additional information supporting the lr1 state network in the 1st document}
\INDENT{1in}{$\bullet$ each lr1 state's rules follow sets}
\INDENT{1in}{$\bullet$ reducing states subrules with references to their contributors' follow sets}
\INDENT{1in}{$\bullet$ global lookahead sets with their yield used by the parse reduce operation}
The below diagram shows the manufacturing
of a grammar's document.
\fbreak
\fbreak
\fbreak
\convertMPtoPDF{"/usr/local/yacco2/diagrams/docgeneration.1"}{1}{1}
\fbreak
\fbreak
Also for each ``pdf'' document generated there is a postscript document to
remove the dependency of a ``pdf reader'' and its ``gui'' interaction when wanting to 
spool the document for print.
For example on my Sun Solaris, the program ``pdftops'' 
takes a ``pdf'' document and creates an equivalent ``ps'' document.
Spooling it to a print would use the command line ``lp xxx.ps''.

@*2 A sample \O2 script where the options are described.\fbreak
\O2's input data template is $[options]$ filename where options are optional as
they have preconfigured values.
Here are the switches that can be inputted to \O2 using the Unix approach to turn
on the specific option. Each option must be inputted with its own $-$ sign.\fbreak
{\bf Options}: if they are not present, do not generate\fbreak
\ptindent{1) -t --- generate the Terminal vocabulary}
\ptindent{2) -err --- generate the Error vocabulary}
\ptindent{3) -lrk --- generate the lr k vocabulary: Deprecated not supported}
\ptindent{4) -rc --- generate the Raw characters vocabulary: Deprecated not supported}
\ptindent{5) -p --- generate the grammar's documents}
Points 1 and 2 are usually stable and do not need to be gened. Do so when these
vocabularies have been modified. Don't forget to regen all the grammars 
and re-run \O2{}linker to reprocess the ``fsc'' files if the number of
Terminals in the vocabularies has changed as all the Lookahead sets are now different along
with the enumeration scheme that ties them together.  
Points 3 and 4 cannot be used as they reside in ``\Whereyacco2/library/grammars'' pregened.
I included them as a memory jog to my 
experiments; if u try to input these deprecated options u'll get an error message.
Other options were experimented with but found boarderline marginal: 
gen namespace, gen the grammar, and turn on debug of grammar instead of using the editor cycle to 
modify the grammar to be traced. Now namespace and grammar are always generated, and hello editor.
So out damn spot.

Here is a batch command file that runs on a Microsoft's NT/XP desktop.
The same can be done within a ``Unix'' flavour script language like ``Bash''.
Though it does not illustrate a conditional test as to whether 
the script should continue when the grammar is faulty, 
\O2 returns a 0 to indicate a healthy grammar
and a 1 to indicate a sick  grammar:
The gory details are in the error log. 
\let\setuplistinghook = \linenumberedlisting
\listing{"/usr/local/yacco2/diagrams/o2run_example.txt"}
\fbreak
\let\setuplistinghook = \relax
The above example uses line numbers delimited by ``:'' at the start of each
line  for commentary purposes.
Line 3 sets the directory where \O2 resides and repository
for the temporary files from \O2, ``mpost'' that draws the grammar diagrams,
while ``cweave'' generates the ``xxx.tex'' file for ``pdftex'' 
who completes the document for an ``Adobe'' reader program: for example ``xpdf'' of open source
or Adobe's reader.
``cweave'' is one of the programs by Donald E. Knuth and Silvio Levy
from their book ``The CWEB System of Structured Documentation''.
Go to the web site ``www.tex.org'' for more information on how to obtain
``CWEB''. The same comments apply to ``mpost'' written by John D. Hobby of 
`Bell Labs''. 
This is a remake of  ``MetaFont''language / ``MetaPost'' program by Donald E. Knuth. 
These programs are grrrreat! More people should be using them.
The emitted grammar files get placed in the same directory 
of the inputted grammar to \O2.

Line 6 runs \O2 with its inputted grammar file 
``/yacco2/compiler/grammars/enumerate\_grammar.lex'' 
and switches to gen up the Terminal and Error vocabularies 
and gen a printed set of documents.
\O2 also generates the documentation files
 ``enumerate\_grammar.mp'' and ``enumerate\_grammar.w'' files.
Lines 7--9 are command lines to create the output document.
In the example, ``enumerate\_grammar.pdf'' is the final
file document for printing.
``enumerate\_grammar.xx'' are figure files generated by mpost from file
``enumerate\_grammar.mp''.
These files are referenced in the ``enumerate\_grammar.w'' 
file by cweave who produces an ``enumerate\_grammar.tex'' file for program pdftex.
All this to say that there can be many generated files before the document is complete.
Please note the other cross reference document 
is not shown but follows the same run pattern.

@*2 Some definitions.\fbreak
\fbreak
Non-terminal:\fbreak
This is your normal grammar definition. I interchange this term with ``rule''. 
They are the same. Depending on the context, i also use  rule 
in the same sense of a grammar's production. 
To refine the context, the term ``subrule'' indicates one of a rule's productions.\fbreak
\fbreak
Subrule:\fbreak
Equivalent to a grammar's production.
It is one of a rule's right-hand-side string of symbols drawn from 
the non-terminal or Terminal alphabets.
The string can be empty indicating epsilon.\fbreak

Please see the mavelous book ``Formal Languages and Their Relationn to Automata'' 
by Hopcroft and Ullman for a complete discussion on grammars and their makeup.
Excellent reading for a 1968 vintage on automata.\fbreak
\fbreak
Here are some basic definitions used by my lr1 generator.\fbreak
\fbreak
First set:\fbreak
Please see |first_set_rules.lex| grammar for a more thorough discussion.
First set is the set of terminals that begins a string of symbols. 
When the symbol is a rule, then all its subrules contribute to the first set.
This is a recursive definition as the rule's subrules can also bring in other 
rules' subrules string of symbols that contribute to it.
If the string's start symbol is a rule and its epsilonable,
then its right neighbour also contributes. Again if its a rule and epsilonable
its right neighbour is a contributor: ahh recursive definitions.
\fbreak
\fbreak
Follow set:\fbreak
The rule's first set of production strings to the right of a lr state's configuration.
Here is a simple arithmetic grammar to illustrate follow sets for the ``Closure-only'' state
where all production strings have their configuration
position at their very beginning illustrated by ''$ _.$''. 
I use a form of Dewey decimal notation to reference the production's configuration.
For example ``E.1.2'' means the production of rule ``E'' referencing subrule 1 
of its second symbol is being refered to.
In this example below the referenced symbol is $+$.
How follow sets are arrived at is discussed in ``Overview of \O2's state generated components''.
 \fbreak

\settabs\+\indent&1in\qquad&\qquad&\qquad&\cr
\+&\it Rule&&{\it subrule's symbols}\cr
\+&S &$\rightarrow$ &$ _.$E $\bot$\cr
\+&E &$\rightarrow$ &$ _.$E $+$ T\cr
\+&&$\rightarrow$ &$ _.$T\cr
\+&T&$\rightarrow$ &$ _.$T $*$ F\cr
\+&&$\rightarrow$ &$ _.$F\cr
\+&F&$\rightarrow$ &$ _.$( E )\cr
\+&&$\rightarrow$ &$ _.$id\cr
\fbreak
\settabs\+\indent&mmmmmm&\qquad&followset&\qquad&R.SR.Pos follow setxxxxxx &\qquad&transitions&\cr
%\settabs 4 \columns
\+&\it Rule&&{\it follow set}&& {\it R.SR.Pos of follow set}&&{\it with transitions}\cr
\+&E && $\bot\  +$ &&$\bot$ by S.1.2, $+$ from E.1.2&& $\bot\  +$\cr
\+&$\uparrow$ && && E.2.2\cr
\+&T&& $*$&&$*$ by T.1.2&&$*\ \bot\  +$\cr
\+&$\uparrow$ && && T.2.2\cr
\+&F&& $\emptyset$&&&&$*\ \bot\  +$\cr
\fbreak
\hbox{\hskip.5in Table of follow sets for the ``Start state'' of the above grammar}\fbreak

There are 3  subtleties that are watched for in the follow set calculation:\fbreak
\ptindent{1) rule symbol --- use its ``first set''}
\ptindent{2) epsilonable rule symbol --- continue to next symbol in follow string for assessment}
\ptindent{3) end of symbol string reached --- transition}
Point 3 requires some explanation.
Its condition means that the rule's right-hand-side has been consumed (or is epsilon)
 so what's
it follow set? Nothing?
No it's the subrule's rule that spawned it that provides more follow set context.
This context resides in the ``closure'' state of this rule.
So now there is a transition to this rule's follow set.
This is the transitive closure of spawning contexts.
The Table of follow sets shows these transitions with the $\uparrow$ symbol.
Epsilon rules are chameleon in nature: they supply their first sets and
also disappear and so u must continue to the next symbol in the follow string
to complete the follow set while observing the 
end-of-string condition to follow its transitions.

@*2 Catalogue of \O2's files.\fbreak
Cweb Documents:\fbreak
\ptindent{1) \Yacco2 parse library}
\ptindent{2) \o2{}extern --- external routines}
\ptindent{3) \Yacco2{}stbl --- symbol table}

\O2's Input files to |cweb|:\fbreak
\ptindent{1) |o2.w| --- master file that starts things off}
\ptindent{2) |intro.w| --- introduction }
\ptindent{3) |defs.w| --- basic definitions to gen lr1 network}
\ptindent{4) |prog.w| --- \o2 cweb code }
\ptindent{5) |bug.w| --- confessions}
\ptindent{6) |o2_defs.w| --- details}
\ptindent{7) |includes.w| --- bring in those grammars for the parsing}
\ptindent{8) |o2externs.w| --- external routines}
|cweb| generated files:\fbreak
\ptindent{1) |o2.h| --- compiler definitions}
\ptindent{2) |o2.cpp| --- \O2 program}
\ptindent{3) |o2_defs.cpp| --- structure implementations}
\ptindent{4) |o2_externs.h| --- global definitions used across \o2's source code}

\O2's generated files where xxx is the grammar's name being compiled:\fbreak
\ptindent{1) |xxx.fsc| --- grammar's first set confessions for Linker}
\ptindent{2) |xxx.h| --- grammar's header file}
\ptindent{3) |xxx.cpp| --- automaton code}
\ptindent{4) |xxxsym.cpp| --- automaton symbols}
\ptindent{5) |xxxtbl.cpp| --- automaton's state definitions}

\Yacco2 library memorabilia:\fbreak
\ptindent{1) yacco2 --- library namespace}
\ptindent{2) ``\Whereyacco2/library'' --- yacco2's library directory}
\ptindent{3) ${< yacco2.h >}$  --- Yacco2's library header file}
\ptindent{4) ``library directory/xxxx'' - xxxx is the debug or release of the object library}

Dependency files from \Yacco2 sub-systems:\fbreak
\ptindent{|yacco2.h| - basic definitions used by Yacco2}
\ptindent{|yacco2_T_enumeration.h| - terminal enumeration for Yacco2's terminal grammar alphabet}
\ptindent{|yacco2_err_symbols.h| - error terminal definitions from Yacco2's grammar alphabet}
\ptindent{|yacco2_characters.h| - raw character definitions from Yacco2's grammar alphabet}
\ptindent{|yacco2_k_symbols.h| - constant terminal definitions from Yacco2's grammar alphabet}
\ptindent{|yacco2_terminals.h| - regular terminal definitions from Yacco2's grammar alphabet}
\ptindent{|*.h| - assorted grammar definitions from Yacco2 to parse}
\ptindent{|o2_externs.h| - external support routines for \O2}

Grammars\fbreak
\ptindent{|pass3.lex| --- lex and syntactic phase of grammar}
\ptindent{|la_expr_source.lex| --- lexical phase of lookahead expression}
\ptindent{|la_expr.lex| -- syntactic phase of  lookahead expression}
\ptindent{|enumerate_T_alphabet.lex| -- logic grammar to assign each T a number from 0..n}
\ptindent{|epsilon_rules.lex| -- grammar determines epsilon per rule and pathological conditions}
\ptindent{|first_set.lex| -- logic grammar to calculate each rule's first set}
\ptindent{|prt_fs_of_rules.lex| -- logic grammar to print each rule's first set}
\ptindent{|enumerate_grammar.lex| -- dump aid: enumerate grammar's components}

Globals\fbreak
\ptindent{|LR1_STATES| --- list of gened lr1 states}
\ptindent{|LR1_COMMON_STATES| --- common states map having same vectored into symbol}
\ptindent{|START_OF_RULES_ENUM| --- used in shift / reduce conflict evaluation}

Comments:\fbreak
My external routines use the all upper case approach to names. I know it's like shouting but it
clues the reader where the heck the routine comes from. I could have
tempered the all caps approach to a capital letter 
but i'm myopic and becoming
visually golden in age. 
So my excuses to the reader for this tasteless approach.  

@*2 \O2's language.\fbreak
There are 3 languages that are actually parsed: 2 in preparation --- 
command line and its contents, and the grammar file.
A grammar is divided into 4 parts:\fbreak
\ptindent{a) Finite automaton definition --- basic statements about the grammar}
\ptindent{b) Parallel parse that defines a threading grammar}
\ptindent{c) Terminal vocabulary: errors, lr k, raw characters, and terminals}
\ptindent{d) Rule definitions}

\let\setuplistinghook = \linenumberedlisting
\listing{"/usr/local/yacco2/diagrams/eol.txt"}
\fbreak
\let\setuplistinghook = \relax

The above source listing is an example of a threaded grammar.
Starting each source line is a line number suffixed by `:' present 
only for discussion purposes.
Line numbers 6--10 defines the fsm component. 
Lines 11--19 indicates that the grammar is a thread.
Though the terminal vocabulary definitions are hidden
by line 20, it illustrates the file include feature of \O2.
Lines 22--39 are the rule definitions.
Each grammar's section has a defining keyword like ``fsm'', ``parallel-parser'', ``rules''
that introduces the part being defined. 


@*2 C macros.\fbreak
Originally there were conditionally defined trace variables
that controlled the inclusion of trace code.
This was a pain-in-the-seat so now they are global
variables that test their values.
I felt the slight speed bump merited the facility without the 
combinatorics of libraries needed for distribution.
|YACCO2_define_trace_variables| macro defines these global variables used
by \O2's tracing purposes.
U can roll your own or just include the macro in your code.
These variables are dormant until their values are not zero.
Without their inclusion, a linker message of unresolved variable will be
regurgitated: they must be present
when using the \O2 library. It's an easy way to define them within your program.
Please see \O2 library documentation for a discussion on each trace variable.
To activate a specific tracing, assignment a non zero value to
the selected trace variable: set it to 1.
Here is their catalogue:\fbreak
\ptindent{|YACCO2_T__| --- trace terminal when fetched}
\ptindent{|YACCO2_TLEX__| --- trace macros of emitted grammar: rules and user emergency macros}
\ptindent{|YACCO2_MSG__| --- trace thread messages}
\ptindent{|YACCO2_MU_TRACING__| --- trace acquire / release of trace mutex}
\ptindent{|YACCO2_MU_TH_TBL__| --- trace acquire / release mutex of thread table}
\ptindent{|YACCO2_MU_GRAMMAR__| --- trace acquire / release each grammar's mutex}
\ptindent{|YACCO2_TH__| --- trace the parse stack: fsa and syntax directed activities }
\ptindent{|YACCO2_AR__| --- trace arbitrator procedure}
\ptindent{|YACCO2_THP__| --- trace thread performance}
They are enrobed by namespace yacco2.
To set the trace variable be sure the namespace is delared: either explicitly
as in: \fbreak
\ptindent{|yacco2::YACCO2_T__| = 1;}
 or implicitly by a ``using namespace yacco2;'' statement
somewhere preceding the assignment:\fbreak
\ptindent{using namespace yacco2;}
\ptindent{...}
\ptindent{|YACCO2_T__| = 1;} 
\fbreak
Each traced output line identifies its type by the trace variable turned on.
As tracing can be very very volumnious, post evaluating the output thru 
a Bash type filter script makes the log output manageable.
I say this from experience as some editors blow up due to the size of the traced file. 
Names withheld to protect the innocent.