% This program is not copyrighted and can be used freely. % This is PATGEN.WEB in text format, as of August 23, 2004. % Version 1.0 was finished in 1983. % Version 2.0 major revision for `8-bit TeX' (November 8, 1991). % Version 2.1 allows left/right_hypen_min from terminal (April, 1992). % Version 2.2 added `close_in(dictionary)' (August, 1996). % Version 2.3 avoided division by zero - Karl Berry (October, 1996). % Version 2.4 avoided use of uninitialized variables (September, 2009). % A few typos fixed, no version change (April, 2020). % Here is TeX material that gets inserted after \input webmac \def\hang{\hangindent 3em\indent\ignorespaces} \def\PASCAL{Pascal} \def\title{PATGEN} \def\contentspagenumber{45} % should be odd \def\topofcontents{ \line{\tenit Appendix\hfil \mainfont\contentspagenumber} \vfill \null\vskip 40pt \centerline{\titlefont {\ttitlefont PAT}tern {\ttitlefont GEN}eration program} \vskip 8pt \centerline{\titlefont for the \TeX 82 hyphenator} \vskip 15pt \centerline{(Version 2.4, April 2020)} \vfill} \pageno=\contentspagenumber \advance\pageno by 1 @* Introduction. This program takes a list of hyphenated words and generates a set of patterns that can be used by the \TeX 82 hyphenation algorithm. The patterns consist of strings of letters and digits, where a digit indicates a `hyphenation value' for some intercharacter position. For example, the pattern \.{3t2ion} specifies that if the string \.{tion} occurs in a word, we should assign a hyphenation value of 3 to the position immediately before the \.{t}, and a value of 2 to the position between the \.{t} and the \.{i}. To hyphenate a word, we find all patterns that match within the word and determine the hyphenation values for each intercharacter position. If more than one pattern applies to a given position, we take the maximum of the values specified (i.e., the higher value takes priority). If the resulting hyphenation value is odd, this position is a feasible breakpoint; if the value is even or if no value has been specified, we are not allowed to break at this position. In order to find quickly the patterns that match in a given word and to compute the associated hyphenation values, the patterns generated by this program are compiled by \.{INITEX} into a compact version of a finite state machine. For further details, see the \TeX 82 source. The |banner| string defined here should be changed whenever \.{PATGEN} gets modified. @d banner=='This is PATGEN, Version 2.4' {printed when the program starts} @ The original version~1 of \.{PATGEN} was written by Frank~M. Liang @^Liang, Franklin Mark@> in 1982; a major revision (version~2) by Peter Breitenlohner in 1991 @^Breitenlohner, Peter@> is mostly related to the new features of `8-bit \TeX' (version~3 of \TeX 82). The differences between versions~1 and~2 fall into several categories (all of Liang's algorithms have been left essentially unchanged): (1)~enhancements related to 8-bit \TeX, e.g., the introduction of 8-bit |ASCII_code| values and of \.{\\lefthyphenmin} and \.{\\righthyphenmin}; (2)~a modification of the input and output procedures which should make language specific modifications of this program unnecessary (information about the external representation of all `letters' used by a particular language is obtained from the |translate| file); (3)~removal of ANSI standard \PASCAL\ and range check violations; (4)~removal of uninitialized variables; (5)~minor modifications in order to simplify system-dependent modifications. @^range check violations@> @ This program is written in standard \PASCAL, except where it is necessary to use extensions. All places where nonstandard constructions are used have been listed in the index under ``system dependencies.'' @!@^system dependencies@> The program uses \PASCAL's standard |input| and |output| files to read from and write to the user's terminal. @d print(#)==write(output,#) @d print_ln(#)==write_ln(output,#) @d get_input(#)==read(input,#) @d get_input_ln(#)== begin if eoln(input) then read_ln(input); read(input,#); end @# @d end_of_PATGEN=9999 @p @@/ program PATGEN(@!dictionary,@!patterns,@!translate,@!patout); label end_of_PATGEN; const @@/ type @@/ var @@/ procedure initialize; {this procedure gets things started properly} var @@/ begin print_ln(banner);@/ @@/ end; @ The patterns are generated in a series of sequential passes through the dictionary. In each pass, we collect count statistics for a particular type of pattern, taking into account the effect of patterns chosen in previous passes. At the end of a pass, the counts are examined and new patterns are selected. Patterns are chosen one level at a time, in order of increasing hyphenation value. In the sample run shown below, the parameters |hyph_start| and |hyph_finish| specify the first and last levels, respectively, to be generated. Patterns at each level are chosen in order of increasing pattern length (usually starting with length~2). This is controlled by the parameters |pat_start| and |pat_finish| specified at the beginning of each level. Furthermore patterns of the same length applying to different intercharacter positions are chosen in separate passes through the dictionary. Since patterns of length $n$ may apply to $n+1$ different positions, choosing a set of patterns of lengths $2$ through $n$ for a given level requires $(n+1)(n+2)/2-3$ passes through the word list. At each level, the selection of patterns is controlled by the three parameters |good_wt|, |bad_wt|, and |thresh|. A hyphenating pattern will be selected if |good*good_wt-bad*bad_wt>=thresh|, where |good| and |bad| are the number of times the pattern could and could not be hyphenated, respectively, at a particular point. For inhibiting patterns, |good| is the number of errors inhibited, and |bad| is the number of previously found hyphens inhibited. @= @!pat_start, @!pat_finish: dot_type; @!hyph_start, @!hyph_finish: val_type; @!good_wt, @!bad_wt, @!thresh: integer; @ The proper choice of the parameters to achieve a desired degree of hyphenation is discussed in Chapter~4. Below we show part of a sample run of \.{PATGEN}, with the user's inputs underlined. $$\vbox{\halign{\.{#\hfil}\cr $\underline{\smash{\.{ex patgen}}}$\cr DICTIONARY : $\underline{\smash{\.{murray.hyf}}}$\cr PATTERNS : $\underline{\smash{\.{nul:}}}$\cr TRANSLATE : $\underline{\smash{\.{nul:}}}$\cr PATOUT : $\underline{\smash{\.{murray.pat}}}$\cr This is PATGEN, Version 2.0\cr left\_hyphen\_min = 2, right\_hyphen\_min = 3, 26 letters\cr 0 patterns read in\cr pattern trie has 256 nodes, trie\_max = 256, 0 outputs\cr hyph\_start, hyph\_finish: $\underline{\.{1 1}}$\cr pat\_start, pat\_finish: $\underline{\.{2 3}}$\cr good weight, bad weight, threshold: $\underline{\.{1 3 3}}$\cr processing dictionary with pat\_len = 2, pat\_dot = 1\cr \cr 0 good, 0 bad, 3265 missed\cr 0.00 \%, 0.00 \%, 100.00 \%\cr 338 patterns, 466 nodes in count trie, triec\_max = 983\cr 46 good and 152 bad patterns added (more to come)\cr finding 715 good and 62 bad hyphens, efficiency = 10.72\cr pattern trie has 326 nodes, trie\_max = 509, 2 outputs\cr processing dictionary with pat\_len = 2, pat\_dot = 0\cr \cr \hskip 1.5em ...\cr \cr 1592 nodes and 39 outputs deleted\cr total of 220 patterns at hyph\_level 1\cr hyphenate word list? $\underline{\smash{\.{y}}}$\cr writing pattmp.1\cr \cr 2529 good, 243 bad, 736 missed\cr 77.46 \%, 7.44 \%, 22.54 \%\cr}}$$ @ Note that before beginning a pattern selection run, a file of existing patterns may be read in. In order for pattern selection to work properly, this file should only contain patterns with hyphenation values less than |hyph_start|. Each word in the dictionary is hyphenated according to the existing set of patterns (including those chosen on previous passes of the current run) before pattern statistics are collected. Also, a hyphenated word list may be written out at the end of a run. This list can be read back in as the `dictionary' to continue pattern selection from this point. In addition to ordinary hyphens (|'-'|) the new list will contain two additional kinds of ``hyphens'' between letters, namely hyphens that have been found by previously generated patterns, as well as erroneous hyphens that have been inserted by those patterns. These are represented by the symbols |'*'| and |'.'|, respectively. The three characters |'-'|, |'*'|, and |'.'| are, in fact, just the default values used to represent the three kinds of hyphens, the |translate| file may specify different characters to be used instead of them. In addition, a word list can include hyphen weights, both for entire words and for individual hyphen positions. (The syntax for this is explained in the dictionary processing routines.) Thus common words can be weighted more heavily, or, more generally, words can be weighted according to their frequency of occurrence, if such information is available. The use of hyphen weights combined with an appropriate setting of the pattern selection threshold can be used to guarantee hyphenation of certain words or certain hyphen positions within a word. @ Below we show the first few lines of a typical word list, before and after generating a level of patterns. $$\vbox{\halign{\tabskip 1in\.{#\hfil}&\.{#\hfil}\cr abil-i-ty& abil*i*ty\cr ab-sence& ab*sence\cr ab-stract& ab*stract\cr ac-a-dem-ic& ac-a-d.em-ic\cr ac-cept& ac*cept\cr ac-cept-able& ac*cept-able\cr ac-cept-ed& ac*cept*ed\cr \hskip 1.5em ...&\hskip 1.5em ...\cr }}$$ @ We augment \PASCAL 's control structures a bit using |goto|\unskip's and the following symbolic labels. @d exit=10 {go here to leave a procedure} @d continue=22 {go here to resume a loop} @d done=30 {go here to exit a loop} @d found=40 {go here when you've found it} @d not_found=41 {go here when you've found something else} @ Here are some macros for common programming idioms. @d incr(#)==#:=#+1 {increase a variable by unity} @d decr(#)==#:=#-1 {decrease a variable by unity} @# @d Incr_Decr_end(#)==# @d Incr(#)==#:=#+Incr_Decr_end {we use |Incr(a)(b)| to increase \dots} @d Decr(#)==#:=#-Incr_Decr_end {\dots\ and |Decr(a)(b)| to decrease variable |a| by |b|; this can be optimized for some compilers} @# @d loop == @+ while true do@+ {repeat over and over until a |goto| happens} @d do_nothing == {empty statement} @d return==goto exit {terminate a procedure call} @f return==nil @f loop == xclause @ In case of serious problems \.{PATGEN} will give up, after issuing an error message about what caused the error. Such errors might be discovered inside of subroutines inside of subroutines, so a \.{WEB} macro called |jump_out| has been introduced. This macro, which transfers control to the label |end_of_PATGEN| at the end of the program, contains the only non-local |@!goto| statement in \.{PATGEN}. Some \PASCAL\ compilers do not implement non-local |goto| statements. In such cases the |goto end_of_PATGEN| in the definition of |jump_out| should simply be replaced by a call on some system procedure that quietly terminates the program. @^system dependencies@> An overflow stop occurs if \.{PATGEN}'s tables aren't large enough. @d jump_out==goto end_of_PATGEN {terminates \.{PATGEN}} @# @d error(#)==begin print_ln(#); jump_out; end @d overflow(#)==error('PATGEN capacity exceeded, sorry [',#,'].') @.PATGEN capacity exceeded ...@> @ @= @{@&$C-,A+,D-@} {no range check, catch arithmetic overflow, no debug overhead} @^system dependencies@> @* The character set. Since different \PASCAL\ systems may use different character sets, we use the name |text_char| to stand for the data type of characters appearing in external text files. We also assume that |text_char| consists of the elements |chr(first_text_char)| through |chr(last_text_char)|, inclusive. The definitions below should be adjusted if necessary. @^system dependencies@> @^character set dependencies@> Internally, characters will be represented using the type |ASCII_code|. Note, however, that only some of the standard ASCII characters are assigned a fixed |ASCII_code|; all other characters are assigned an |ASCII_code| dynamically when they are first read from the |translate| file specifying the external representation of the `letters' used by a particular language. For the sake of generality the standard version of this program allows for 256 different |ASCII_code| values, but 128 of them would probably suffice for all practical purposes. @d first_text_char=0 {ordinal number of the smallest element of |text_char|} @d last_text_char=255 {ordinal number of the largest element of |text_char|} @# @d last_ASCII_code=255 {the highest allowed |ASCII_code| value} @= @!text_char=char; {the data type of characters in text files} @!ASCII_code=0..last_ASCII_code; {internal representation of input characters} @!text_file=text; @ Some \PASCAL s can store only signed eight-bit quantities (|-128..127|) but not unsigned ones (|0..255|) in one byte. If storage is tight we must, for such \PASCAL s, either restrict |ASCII_code| to the range |0..127| (with some loss of generality) or convert between |ASCII_code| and |packed_ASCII_code| and vice versa by subtracting or adding an offset. (Or we might define |packed_ASCII_code| as |char| and use suitable typecasts for the conversion.) Only the type |packed_ASCII_code| will be used for large arrays and the \.{WEB} macros |si| and |so| will always be used to convert an |ASCII_code| into a |packed_ASCII_code| and vice versa. @^system dependencies@> @d min_packed=0 {change this to `$\\{min\_packed}=-128$' when necessary; and don't forget to change the definitions of |si| and |so| below accordingly} @# @d si(#)==# {converts |ASCII_code| to |packed_ASCII_code|} @d so(#)==# {converts |packed_ASCII_code| to |ASCII_code|} @= @!packed_ASCII_code=min_packed..last_ASCII_code+min_packed; @ We want to make sure that the ``constants'' defined in this program satisfy all the required relations. Some of them are needed to avoid time-consuming checks while processing the dictionary and\slash or to prevent range check and array bound violations. @^range check violations@> Here we check that the definitions of |ASCII_code| and |packed_ASCII_code| are consistent with those of |si| and |so|. @= bad:=0;@/ if last_ASCII_code<127 then bad:=1; if (si(0)<>min_packed)or(so(min_packed)<>0) then bad:=2;@/ @@; if bad>0 then error('Bad constants---case ',bad:1); @.Bad constants@> @ @= @!bad:integer; @!i:text_char; @!j:ASCII_code; @ We convert between |ASCII_code| and the user's external character set by means of arrays |xord| and |xchr| that are analogous to \PASCAL's |ord| and |chr| functions. @= @!xord: array [text_char] of ASCII_code; {specifies conversion of input characters} @!xchr: array [ASCII_code] of text_char; {specifies conversion of output characters} @ The following code initializes the |xchr| array with some of the standard ASCII characters. @= for j:=0 to last_ASCII_code do xchr[j]:=' '; xchr["."]:='.';@/ xchr["0"]:='0'; xchr["1"]:='1'; xchr["2"]:='2'; xchr["3"]:='3'; xchr["4"]:='4'; xchr["5"]:='5'; xchr["6"]:='6'; xchr["7"]:='7'; xchr["8"]:='8'; xchr["9"]:='9';@/ xchr["A"]:='A'; xchr["B"]:='B'; xchr["C"]:='C'; xchr["D"]:='D'; xchr["E"]:='E'; xchr["F"]:='F'; xchr["G"]:='G'; xchr["H"]:='H'; xchr["I"]:='I'; xchr["J"]:='J'; xchr["K"]:='K'; xchr["L"]:='L'; xchr["M"]:='M'; xchr["N"]:='N'; xchr["O"]:='O'; xchr["P"]:='P'; xchr["Q"]:='Q'; xchr["R"]:='R'; xchr["S"]:='S'; xchr["T"]:='T'; xchr["U"]:='U'; xchr["V"]:='V'; xchr["W"]:='W'; xchr["X"]:='X'; xchr["Y"]:='Y'; xchr["Z"]:='Z';@/ xchr["a"]:='a'; xchr["b"]:='b'; xchr["c"]:='c'; xchr["d"]:='d'; xchr["e"]:='e'; xchr["f"]:='f'; xchr["g"]:='g'; xchr["h"]:='h'; xchr["i"]:='i'; xchr["j"]:='j'; xchr["k"]:='k'; xchr["l"]:='l'; xchr["m"]:='m'; xchr["n"]:='n'; xchr["o"]:='o'; xchr["p"]:='p'; xchr["q"]:='q'; xchr["r"]:='r'; xchr["s"]:='s'; xchr["t"]:='t'; xchr["u"]:='u'; xchr["v"]:='v'; xchr["w"]:='w'; xchr["x"]:='x'; xchr["y"]:='y'; xchr["z"]:='z'; @ The following system-independent code makes the |xord| array contain a suitable inverse to the information in |xchr|. @d invalid_code=0 {|ASCII_code| that should not appear} @d tab_char=@'11 {|ord| of tab character; tab characters seem to be unavoidable with files from UNIX systems} @^system dependencies@> @^character set dependencies@> @= for i:=chr(first_text_char) to chr(last_text_char) do xord[i]:=invalid_code; for j:=0 to last_ASCII_code do xord[xchr[j]]:=j; xord[' ']:=" "; xord[chr(tab_char)]:=" "; @ So far each invalid |ASCII_code| has been assigned the character |' '| and all invalid characters have been assigned |ASCII_code=invalid_code|. The |get_ASCII| function, used only while reading the |translate| file, returns the |ASCII_code| corresponding to a character, assigning a new |ASCII_code| first if necessary. @d num_ASCII_codes=last_ASCII_code+1 {number of different |ASCII_code| values} @p function get_ASCII(@!c:text_char):ASCII_code; label found; var i: ASCII_code; begin i:=xord[c]; if i=invalid_code then begin while i" ") then goto found; end; overflow(num_ASCII_codes:1,' characters'); found: xord[c]:=i; xchr[i]:=c; end; get_ASCII:=i; end; @ The \TeX 82 hyphenation algorithm operates on `hyphenable words' converted temporarily to lower case, i.e., they may consist of up to 255 different `letters' corresponding to \.{\\lccode}s |1..255|. These \.{\\lccode}s could, in principle, be language dependent but this might lead to undesirable results when hyphenating multilingual paragraphs. No more than 245 different letters can occur in hyphenation patterns since the characters |'0'..'9'| and |'.'| play a special r\^^Dole when reading patterns. For the purpose of this program each letter is represented internally by a unique |internal_code>=2| (|internal_code=1| is the |edge_of_word| indicator); |internal_code| values |2..127| will probably suffice for all practical purposes, but we allow the range |2..last_ASCII_code| for the sake of generality. Syntactically |internal_code| and |ASCII_code| are the same, we will use one or the other name according to the semantic context. @d edge_of_word=1 {|internal_code| for start and end of a word} @= @!internal_code=ASCII_code; @!packed_internal_code=packed_ASCII_code; @ Note that an |internal_code| used by this program is in general quite different from the |ASCII_code| (or rather \.{\\lccode}) used by \TeX 82. This program allows the input of characters (from the |dictionary| and |patterns| file) corresponding to an |internal_code| in either lower or upper case form; the output (to the |patout| and |pattmp| file) will always be in lower case form. Unfortunately there does not (yet?) exist a standardized and widely accepted 8-bit character set (or a unique one-to-one translation between such sets). On the other hand macro expansion takes place in \TeX 82 when reading hyphenable words and when reading patterns. Thus the lower and upper case versions of all `letters' used by a particular language can (and for the sake of portability should) be represented entirely in terms of the standard ASCII character set; either directly as characters or via macros (or active characters) with or without arguments. The macro definitions for such a representation will in general be language dependent. For the purpose of this program the external representation of the lower and upper case version of a letter (i.e., |internal_code|) consists of a unique sequence of characters (or \\{ASCII\_codes}), the only restriction being that no such sequence must be a subsequence of an other one. Moreover such sequences must not start with |' '|, |'.'|, |'0'..'9'| or with one of the three characters (|'-'|, |'*'|, and |'.'|) representing hyphens in the |dictionary| file; a sequence may, however, end with a mandatory |' '| as, e.g., the sequence |'\ss '|. The language dependent values of \.{\\lefthyphenmin} and \.{\\righthyphenmin} as well as the external representation of the lower and upper case letters and their collating sequence are specified in the |translate| file, thus making any language dependent modifications of this program unnecessary. If the |translate| file is empty (or does not exist) the values \.{\\lefthyphenmin=2} and \.{\\righthyphenmin=3} and |internal_code| values |2..27| with the one character external representations |'a'..'z'| and |'A'..'Z'| will be used as defaults. Incidentally this program can be used to convert a |dictionary| and |patterns| file from one (``upper case'') to another (``lower case'') external representation of letters. @ When reading the |dictionary| (and |patterns|) file sequences of characters must be recognized and converted to their corresponding |internal_code|. This conversion is part of \.{PATGEN}s inner loop and @^inner loop@> must therefore be done as efficient as possible. Thus we will mostly bypass the conversion from character to |ASCII_code| and convert directly to the corresponding |internal_code| using the |xclass| and |xint| arrays. Six types of characters are distinguished by their |xclass|: \yskip\hang |space_class| character |' '| terminates a pattern or word. \yskip\hang |digit_class| characters |'0'..'9'| are hyphen values for a pattern or hyphen weights for a word; their |xint| is the corresponding numeric value |0..9|. \yskip\hang |hyf_class| characters (|'.'|, |'-'|, and |'*'|) are `dots' and indicate hyphens in a word; their |xint| is the corresponding numeric value |err_hyf..found_hyf|. \yskip\hang |letter_class| characters represent a letter; their |xint| is the corresponding |internal_code|. \yskip\hang |escape_class| characters indicate the start of a multi-character sequence representing a letter. \yskip\hang |invalid_class| characters should not occur except as part of multi-character sequences. @d space_class=0 {the character |' '|} @d digit_class=1 {the characters |'0'..'9'|} @d hyf_class=2 {the `hyphen' characters (|'.'|, |'-'|, and |'*'|)} @d letter_class=3 {characters representing a letter} @d escape_class=4 {characters that start a multi-character sequence representing a letter} @d invalid_class=5 {characters that normally should not occur} @# @d no_hyf=0 {no hyphen} @d err_hyf=1 {erroneous hyphen} @d is_hyf=2 {hyphen} @d found_hyf=3 {found hyphen} @= @!class_type=space_class..invalid_class; {class of a character} @!digit=0..9; {a hyphen weight (or word weight)} @!hyf_type=no_hyf..found_hyf; {type of a hyphen} @ In addition we will use the |xext|, |xdig|, and |xdot| arrays to convert from the internal representation to the corresponding characters. @= @!xclass: array [text_char] of class_type; {specifies the class of a character} @!xint: array [text_char] of internal_code; {specifies the |internal_code| for a character} @!xdig: array [0..9] of text_char; {specifies conversion of output characters} @!xext: array [internal_code] of text_char; {specifies conversion of output characters} @!xhyf: array [err_hyf..found_hyf] of text_char; {specifies conversion of output characters} @ @= for i:=chr(first_text_char) to chr(last_text_char) do begin xclass[i]:=invalid_class; xint[i]:=0; end; xclass[' ']:=space_class; for j:=0 to last_ASCII_code do xext[j]:=' '; xext[edge_of_word]:='.'; for j:=0 to 9 do begin xdig[j]:=xchr[j+"0"]; xclass[xdig[j]]:=digit_class; xint[xdig[j]]:=j; end; xhyf[err_hyf]:='.'; xhyf[is_hyf]:='-'; xhyf[found_hyf]:='*'; {default representation for hyphens} @ We assume that words use only the letters |cmin+1| through |cmax|. This allows us to save some time on trie operations that involve searching for packed transitions belonging to a particular state. @d cmin=edge_of_word @= @!cmax: internal_code; {largest |internal_code| or |ASCII_code|} @* Data structures. The main data structure used in this program is a dynamic packed trie. In fact we use two of them, one for the set of patterns selected so far, and one for the patterns being considered in the current pass. For a pattern $p_1\ldots p_k$, the information associated with that pattern is accessed by setting |@t$t_1$@>:=trie_root+@t$p_1$@>| and then, for |1:=trie_link(@t$t_{i-1}$@>)+ @t$p_i$@>|; the pattern information is then stored in a location addressed by |@t$t_k$@>|. Since all trie nodes are packed into a single array, in order to distinguish nodes belonging to different trie families, a special field is provided such that |trie_char@t$(t_i)=si(p_i)$@>| for all |i|. In addition the trie must support dynamic insertions and deletions. This is done by maintaining a doubly linked list of unoccupied cells and repacking trie families as necessary when insertions are made. Each trie node consists of three fields: the character |trie_char|, and the two link fields |trie_link| and |trie_back|. In addition there is a separate boolean array |trie_base_used|. When a node is unoccupied, |trie_char=min_packed| and the link fields point to the next and previous unoccupied nodes, respectively, in the doubly linked list. When a node is occupied, |trie_link| points to the next trie family, and |trie_back| (renamed |trie_outp|) contains the output associated with this transition. The |trie_base_used| bit indicates that some family has been packed at this base location, and is used to prevent two families from being packed at the same location. @ The sizes of the pattern tries may have to be adjusted depending on the particular application (i.e., the parameter settings and the size of the dictionary). The sizes below were sufficient to generate the original set of english \TeX 82 hyphenation patterns (file \.{hyphen.tex}). @= @!trie_size=55000; {space for pattern trie} @!triec_size=26000; {space for pattern count trie, must be less than |trie_size| and greater than the number of occurrences of any pattern in the dictionary} @!max_ops=4080; {size of output hash table, should be a multiple of 510} @!max_val=10; {maximum number of levels$+1$, also used to denote bad patterns} @!max_dot=15; {maximum pattern length, also maximum length of external representation of a `letter'} @!max_len=50; {maximum word length} @!max_buf_len=80; {maximum length of input lines, must be at least |max_len|} @ @= if (triec_size<4096)or(trie_sizetrie_size then bad:=4; if max_val>10 then bad:=5; if max_buf_len= @!q_index=1..last_ASCII_code; {number of transitions in a state} @!val_type=0..max_val; {hyphenation values} @!dot_type=0..max_dot; {dot positions} @!op_type=0..max_ops; {index into output hash table} @!word_index=0..max_len; {index into |word|} @!trie_pointer=0..trie_size; @!triec_pointer=0..triec_size;@/ @!op_word=packed record dot: dot_type; val: val_type; op: op_type end; @ Trie is actually stored with its components in separate packed arrays, in order to save space and time (although this depends on the computer's word size and the size of the trie pointers). @= @!trie_c: packed array[trie_pointer] of packed_internal_code; @!trie_l, @!trie_r: packed array[trie_pointer] of trie_pointer; @!trie_taken: packed array[trie_pointer] of boolean; @!triec_c: packed array[triec_pointer] of packed_internal_code; @!triec_l, @!triec_r: packed array[triec_pointer] of triec_pointer; @!triec_taken: packed array[triec_pointer] of boolean; @!ops: array[op_type] of op_word; {output hash table} @ When some trie state is being worked on, an unpacked version of the state is kept in positions |1..qmax| of the global arrays |trieq_c|, |trieq_l|, and |trieq_r|. The character fields need not be in any particular order. @= @!trieq_c: array[q_index] of internal_code; {character fields of a single trie state} @!trieq_l, @!trieq_r: array[q_index] of trie_pointer; {link fields} @!qmax: q_index; {number of transitions in an unpacked state} @!qmax_thresh: q_index; {controls density of first-fit packing} @ Trie fields are accessed using the following macros. @d trie_char(#)==trie_c[#] @d trie_link(#)==trie_l[#] @d trie_back(#)==trie_r[#] @d trie_outp(#)==trie_r[#] @d trie_base_used(#)==trie_taken[#] @# @d triec_char(#)==triec_c[#] @d triec_link(#)==triec_l[#] @d triec_back(#)==triec_r[#] @d triec_good(#)==triec_l[#] @d triec_bad(#)==triec_r[#] @d triec_base_used(#)==triec_taken[#] @# @d q_char(#)==trieq_c[#] @d q_link(#)==trieq_l[#] @d q_back(#)==trieq_r[#] @d q_outp(#)==trieq_r[#] @# @d hyf_val(#)==ops[#].val @d hyf_dot(#)==ops[#].dot @d hyf_nxt(#)==ops[#].op @* Routines for pattern trie. The pattern trie holds the set of patterns chosen prior to the current pass, including bad or ``hopeless'' patterns at the current level that occur too few times in the dictionary to be of use. Each transition of the trie includes an output field pointing to the hyphenation information associated with this transition. @= @!trie_max: trie_pointer; {maximum occupied trie node} @!trie_bmax: trie_pointer; {maximum base of trie family} @!trie_count: trie_pointer; {number of occupied trie nodes, for space usage statistics} @!op_count: op_type; {number of outputs in hash table} @ Initially, the dynamic packed trie has just one state, namely the root, with all transitions present (but with null links). This is convenient because the root will never need to be repacked and also we won't have to check that the base is nonnegative when packing other states. Moreover in many cases we need not check for a vanishing link field: if |trie_link(t)=0| then a subsequent test for |trie_char(trie_link(t)+c)=si(c)| will always fail due to |trie_root=1|. @d trie_root=1 @p procedure init_pattern_trie; var c: internal_code; @!h: op_type; begin for c:=0 to last_ASCII_code do begin trie_char(trie_root+c):=si(c); {indicates node occupied; fake for |c=0|} trie_link(trie_root+c):=0; trie_outp(trie_root+c):=0; trie_base_used(trie_root+c):=false; end; trie_base_used(trie_root):=true; trie_bmax:=trie_root; trie_max:=trie_root+last_ASCII_code; trie_count:=num_ASCII_codes;@/ qmax_thresh:=5;@/ trie_link(0):=trie_max+1; trie_back(trie_max+1):=0;@/ {|trie_link(0)| is used as the head of the doubly linked list of unoccupied cells} for h:=1 to max_ops do hyf_val(h):=0; {clear output hash table} op_count:=0; end; @ The |first_fit| procedure finds a hole in the packed trie into which the state in |trieq_c|, |trieq_l|, and |trieq_r| will fit. This is normally done by going through the linked list of unoccupied cells and testing if the state will fit at each position. However if a state has too many transitions (and is therefore unlikely to fit among existing transitions) we don't bother and instead just pack it immediately to the right of the occupied region (starting at |trie_max+1|). @p function first_fit: trie_pointer; label found, not_found; var s, @!t: trie_pointer; @!q: q_index; begin @; for q:=1 to qmax do {pack it} begin t:=s+q_char(q);@/ trie_link(trie_back(t)):=trie_link(t); trie_back(trie_link(t)):=trie_back(t); {link around filled cell} trie_char(t):=si(q_char(q)); trie_link(t):=q_link(q); trie_outp(t):=q_outp(q); if t>trie_max then trie_max:=t; end; trie_base_used(s):=true; first_fit:=s end; @ The threshold for large states is initially 5 transitions. If more than one level of patterns is being generated, the threshold is set to 7 on subsequent levels because the pattern trie will be sparser after bad patterns are deleted (see |delete_bad_patterns|). @= if qmax>qmax_thresh then t:=trie_back(trie_max+1) @+else t:=0; loop begin t:=trie_link(t); s:=t-q_char(1); {get next unoccupied cell} @; if trie_base_used(s) then goto not_found; for q:=qmax downto 2 do {check if state fits here} if trie_char(s+q_char(q))<>min_packed then goto not_found; goto found; not_found: end; found: @ The trie is only initialized (as a doubly linked list of empty cells) as far as necessary. Here we extend the initialization if necessary, and check for overflow. @= if s>trie_size-num_ASCII_codes then overflow(trie_size:1,' pattern trie nodes'); while trie_bmax1 then decr(h) @+else h:=max_ops; {try again} end; exit: end; @ @= @!pat: array[dot_type] of internal_code; {current pattern} @!pat_len: dot_type; {pattern length} @ Now that we have provided the necessary routines for manipulating the dynamic packed trie, here is a procedure that inserts a pattern of length |pat_len|, stored in the |pat| array, into the pattern trie. It also adds a new output. @p procedure insert_pattern(@!val: val_type; @!dot: dot_type); var i: dot_type; @!s, @!t: trie_pointer; begin i:=1; s:=trie_root+pat[i]; t:=trie_link(s); while (t>0) and (ipat[i] then @; s:=t; t:=trie_link(s); end; q_link(1):=0; q_outp(1):=0; qmax:=1; while i= begin if trie_char(t)=min_packed then begin {we're lucky, no repacking needed} trie_link(trie_back(t)):=trie_link(t); trie_back(trie_link(t)):=trie_back(t);@/ trie_char(t):=si(pat[i]); trie_link(t):=0; trie_outp(t):=0; if t>trie_max then trie_max:=t; end else begin {whoops, have to repack} unpack(t-pat[i]);@/ q_char(qmax):=pat[i]; q_link(qmax):=0; q_outp(qmax):=0;@/ t:=first_fit; trie_link(s):=t; Incr(t)(pat[i]); end; incr(trie_count); end @* Routines for pattern count trie. The pattern count trie is used to store the set of patterns considered in the current pass, along with the counts of good and bad instances. The fields of this trie are the same as the pattern trie, except that there is no output field, and leaf nodes are also used to store counts (|triec_good| and |triec_bad|). Except where noted, the following routines are analogous to the pattern trie routines. @= @!triec_max, @!triec_bmax, @!triec_count: triec_pointer; {same as for pattern trie} @!triec_kmax: triec_pointer; {shows growth of trie during pass} @!pat_count: integer; {number of patterns in count trie} @ [See |init_pattern_trie|.] The variable |triec_kmax| always contains the size of the count trie rounded up to the next multiple of 4096, and is used to show the growth of the trie during each pass. @d triec_root=1 @p procedure init_count_trie; var c: internal_code; begin for c:=0 to last_ASCII_code do begin triec_char(triec_root+c):=si(c);@/ triec_link(triec_root+c):=0; triec_back(triec_root+c):=0; triec_base_used(triec_root+c):=false; end; triec_base_used(triec_root):=true; triec_bmax:=triec_root; triec_max:=triec_root+last_ASCII_code; triec_count:=num_ASCII_codes; triec_kmax:=4096;@/ triec_link(0):=triec_max+1; triec_back(triec_max+1):=0;@/ pat_count:=0; end; @ [See |first_fit|.] @p function firstc_fit: triec_pointer; label found, not_found; var a, @!b: triec_pointer; @!q: q_index; begin @; for q:=1 to qmax do {pack it} begin a:=b+q_char(q);@/ triec_link(triec_back(a)):=triec_link(a); triec_back(triec_link(a)):=triec_back(a);@/ triec_char(a):=si(q_char(q)); triec_link(a):=q_link(q); triec_back(a):=q_back(q); if a>triec_max then triec_max:=a; end; triec_base_used(b):=true; firstc_fit:=b end; @ The threshold for attempting a first-fit packing is 3 transitions, which is lower than for the pattern trie because speed is more important here. @= if qmax>3 then a:=triec_back(triec_max+1) @+else a:=0; loop begin a:=triec_link(a); b:=a-q_char(1);@/ @; if triec_base_used(b) then goto not_found; for q:=qmax downto 2 do if triec_char(b+q_char(q))<>min_packed then goto not_found; goto found; not_found: end; found: @ @= if b>triec_kmax-num_ASCII_codes then begin if triec_kmax=triec_size then overflow(triec_size:1,' count trie nodes'); print(triec_kmax div 1024:1, 'K '); if triec_kmax>triec_size-4096 then triec_kmax:=triec_size else Incr(triec_kmax)(4096); end; while triec_bmax0) and (sposword[spos] then @; b:=a; a:=triec_link(a); end; q_link(1):=0; q_back(1):=0; qmax:=1; while spos= begin if triec_char(a)=min_packed then {lucky} begin triec_link(triec_back(a)):=triec_link(a); triec_back(triec_link(a)):=triec_back(a); triec_char(a):=si(word[spos]); triec_link(a):=0; triec_back(a):=0; if a>triec_max then triec_max:=a; end else begin {have to repack} unpackc(a-word[spos]);@/ q_char(qmax):=word[spos]; q_link(qmax):=0; q_back(qmax):=0; a:=firstc_fit; triec_link(b):=a; Incr(a)(word[spos]); end; incr(triec_count); end @* Input and output. For some \PASCAL\ systems output files must be closed before the program terminates; it may also be necessary to close input files. Since standard \PASCAL\ does not provide for this, we use \.{WEB} macros and will say |close_out(f)| resp.\ |close_in(f)|; these macros should not produce errors or system messages, even if a file could not be opened successfully. @^system dependencies@> @d close_out(#)==close(#) {close an output file} @d close_in(#)==do_nothing {close an input file} @= @!dictionary, @!patterns, @!translate, @!patout, @!pattmp: text_file; @ When reading a line from one of the input files (|dictionary|, |patterns|, or |translate|) the characters read from that line (padded with blanks if necessary) are to be placed into the |buf| array. Reading lines from the |dictionary| file should be as efficient as possible since this is part of \.{PATGEN}'s ``inner loop''. Standard \PASCAL, unfortunately, does not provide for this; consequently the \.{WEB} macro |read_buf| defined below should be optimized if possible. For many \PASCAL's this can be done with |read_ln(f,buf)| where |buf| is declared as \PASCAL\ string (i.e., as \&{packed} \&{array} |[1..any]| \&{of} |char|), for others a string type with dynamic length can be used. @^inner loop@>@^system dependencies@> @d read_buf(#)== {reads a line from input file |#| into |buf| array} begin buf_ptr:=0; while not eoln(#) do begin if (buf_ptr>=max_buf_len) then bad_input('Line too long'); @.Line too long@> incr(buf_ptr); read(#,buf[buf_ptr]); end; read_ln(#); while buf_ptr= @!buf: array[1..max_buf_len] of text_char; {array to hold lines of input} @!buf_ptr: 0..max_buf_len; {index into |buf|} @ When an error is caused by bad input data we say |bad_input(#)| in order to display the contents of the |buf| array before terminating with an error message. @d print_buf== {print contents of |buf| array} begin buf_ptr:=0; repeat incr(buf_ptr); print(buf[buf_ptr]); until buf_ptr=max_buf_len; print_ln(' '); end @d bad_input(#)==begin print_buf; error(#); end @ The |translate| file may specify the values of \.{\\lefthyphenmin} and \.{\\righthyphenmin} as well as the external representation and collating sequence of the `letters' used by the language. In addition replacements may be specified for the characters |'-'|, |'*'|, and |'.'| representing hyphens in the word list. If the |translate| file is empty (or does not exist) default values will be used. @p procedure read_translate; label done; var c: text_char; @!n: integer; @!j: ASCII_code; @!bad: boolean; @!lower: boolean; @!i: dot_type; @!s, @!t: trie_pointer; begin imax:=edge_of_word; reset(translate); if eof(translate) then @ else begin read_buf(translate); @; cmax:=last_ASCII_code-1; while not eof(translate) do @; end; close_in(translate); print_ln('left_hyphen_min = ',left_hyphen_min:1, ', right_hyphen_min = ',right_hyphen_min:1, ', ',imax-edge_of_word:1,' letters'); cmax:=imax; end; @ @= @!imax: internal_code; {largest |internal_code| assigned so far} @!left_hyphen_min, @!right_hyphen_min: dot_type; @ @= begin left_hyphen_min:=2; right_hyphen_min:=3; for j:="A" to "Z" do begin incr(imax); c:=xchr[j+"a"-"A"]; xclass[c]:=letter_class; xint[c]:=imax; xext[imax]:=c; c:=xchr[j]; xclass[c]:=letter_class; xint[c]:=imax; end; end @ The first line of the |translate| file must contain the values of \.{\\lefthyphenmin} and \.{\\righthyphenmin} in columns 1--2 and 3--4. In addition columns~5, 6, and~7 may (optionally) contain replacements for the default characters |'.'|, |'-'|, and |'*'| respectively, representing hyphens in the word list. If the values specified for \.{\\lefthyphenmin} and \.{\\righthyphenmin} are invalid (e.g., blank) new values are read from the terminal. @= bad:=false; n:=0; if buf[1]=' ' then do_nothing else if xclass[buf[1]]=digit_class then n:=xint[buf[1]]@+ else bad:=true; if xclass[buf[2]]=digit_class then n:=10*n+xint[buf[2]]@+ else bad:=true; if (n>=1)and(n=1)and(n=1)and(n1=1)and(n20; end; for j:=err_hyf to found_hyf do begin if buf[j+4]<>' ' then xhyf[j]:=buf[j+4]; if xclass[xhyf[j]]=invalid_class then xclass[xhyf[j]]:=hyf_class@+ else bad:=true; end; xclass['.']:=hyf_class; {in case the default has been changed} if bad then bad_input('Bad hyphenation data') @.Bad hyphenation data@> @ Each following line is either a comment or specifies the external representations for one `letter' used by the language. Comment lines start with two equal characters (e.g., are blank) and are ignored. Other lines contain the external representation of the lower case version and an arbitrary number of `upper case versions' of a letter preceded and separated by a delimiter and followed by two consecutive delimiters; the delimiter may be any character not occurring in either version. @= begin read_buf(translate); buf_ptr:=1; lower:=true; while not bad do {lower and then upper case version} begin pat_len:=0; repeat if buf_ptrinvalid_class then bad:=true; xclass[c]:=letter_class; xint[c]:=imax; end else @; end else if pat_len=max_dot then bad:=true else begin incr(pat_len); pat[pat_len]:=get_ASCII(buf[buf_ptr]); end; until (buf[buf_ptr]=buf[1])or bad; lower:=false; end; done: if bad then bad_input('Bad representation'); @.Bad representation@> end @ When the (lower or upper case) external representation of a letter consists of more than one character and the corresponding |ASCII_code| values have been placed into the |pat| array we store them in the pattern trie. [See |insert_pattern|.] Since this `external subtrie' starts at |trie_link(trie_root)| it does not interfere with normal patterns. The output field of leaf nodes contains the |internal_code| and the link field distinguishes between lower and upper case letters. @= begin if xclass[c]=invalid_class then xclass[c]:=escape_class; if xclass[c]<>escape_class then bad:=true; i:=0; s:=trie_root; t:=trie_link(s); while (t>trie_root) and (ipat[i] then @ else if trie_outp(t)>0 then bad:=true; s:=t; t:=trie_link(s); end; if t>trie_root then bad:=true; q_link(1):=0; q_outp(1):=0; qmax:=1; while ixord[c] then bad_input('Bad representation'); @.Bad representation@> if trie_outp(t)<>0 then begin #:=trie_outp(t); goto done; end; if buf_ptr=max_buf_len then c:=' ' else begin incr(buf_ptr); c:=buf[buf_ptr]; end; end; done: end @ In order to prepare for the output phase we store all but the last of the \\{ASCII\_codes} of the external representation of each `lower case letter' in the pattern count trie which is no longer used at that time. The recursive |find_letters| procedure traverses the `external subtrie'. @p procedure find_letters(@!b: trie_pointer; @!i: dot_type);@/ {traverse subtries of family |b|; |i| is current depth in trie} var c: ASCII_code; {a local variable that must be saved on recursive calls} @!a: trie_pointer; {does not need to be saved} @!j: dot_type; {loop index} @!l: triec_pointer; begin if i=1 then init_count_trie; for c:=cmin to last_ASCII_code do {find transitions belonging to this family} begin a:=b+c; if so(trie_char(a))=c then {found one} begin pat[i]:=c; if trie_outp(a)=0 then find_letters(trie_link(a),i+1) else if trie_link(a)=0 then {this is a lower case letter} @; end; end; end; @ Starting from |triec_root+trie_outp(a)| we proceed through link fields and store all \\{ASCII\_codes} except the last one in the count trie; the last character has already been stored in the |xext| array. @= begin l:=triec_root+trie_outp(a); for j:=1 to i-1 do begin if triec_max=triec_size then overflow(triec_size:1,' count trie nodes'); incr(triec_max); triec_link(l):=triec_max; l:=triec_max; triec_char(l):=si(pat[j]); end; triec_link(l):=0; end @ During the output phase we will say |write_letter(i)(f)| and |write(f,xext[i])| to write the lower case external representation of the letter with internal code |i| to file |f|: |xext[i]| is the last character of the external representation whereas the \.{WEB} macro |write_letter| defined here writes all preceding characters (if any). @d write_letter_end(#)==while l>0 do begin write(#,xchr[so(triec_char(l))]); l:=triec_link(l); end @d write_letter(#)==l:=triec_link(triec_root+#); write_letter_end @* Routines for traversing pattern tries. At the end of a pass, we traverse the count trie using the following recursive procedure, selecting good and bad patterns and inserting them into the pattern trie. @p procedure traverse_count_trie(@!b: triec_pointer; @!i: dot_type);@/ {traverse subtries of family |b|; |i| is current depth in trie} var c: internal_code; {a local variable that must be saved on recursive calls} @!a: triec_pointer; {does not need to be saved} begin for c:=cmin to cmax do {find transitions belonging to this family} begin a:=b+c; if so(triec_char(a))=c then {found one} begin pat[i]:=c; if i; end; end; end; @ When we have come to the end of a pattern, |triec_good(a)| and |triec_bad(a)| contain the number of times this pattern helps or hinders the cause. We use the counts to determine if this pattern should be selected, or if it is hopeless, or if we can't decide yet. In the latter case, we set |more_to_come| true to indicate that there might still be good patterns extending the current type of patterns. @= if good_wt*triec_good(a)=thresh then {good pattern} begin insert_pattern(hyph_level,pat_dot); incr(good_pat_count); Incr(good_count)(triec_good(a)); Incr(bad_count)(triec_bad(a)); end else more_to_come:=true @ Some global variables are used to accumulate statistics about the performance of a pass. @= @!good_pat_count, @!bad_pat_count: integer; {number of patterns added at end of pass} @!good_count, @!bad_count, @!miss_count: integer; {hyphen counts} @!level_pattern_count: integer; {number of good patterns at level} @!more_to_come: boolean; @ The recursion in |traverse_count_trie| is initiated by the following procedure, which also prints some statistics about the patterns chosen. The ``efficiency'' is an estimate of pattern effectiveness. @d bad_eff==(thresh/good_wt) @p procedure collect_count_trie; begin good_pat_count:=0; bad_pat_count:=0; good_count:=0; bad_count:=0; more_to_come:=false; traverse_count_trie(triec_root,1); @/ print(good_pat_count:1,' good and ', bad_pat_count:1,' bad patterns added'); Incr(level_pattern_count)(good_pat_count); if more_to_come then print_ln(' (more to come)') @+else print_ln(' '); print('finding ',good_count:1,' good and ',bad_count:1,' bad hyphens'); if good_pat_count>0 then print_ln(', efficiency = ', good_count/(good_pat_count+bad_count/bad_eff):1:2) else print_ln(' '); print_ln('pattern trie has ',trie_count:1,' nodes, ',@| 'trie_max = ',trie_max:1,', ',op_count:1,' outputs'); end; @ At the end of a level, we traverse the pattern trie and delete bad patterns by removing their outputs. If no output remains, the node is also deleted. @p function delete_patterns(@!s: trie_pointer): trie_pointer;@/ {delete bad patterns in subtrie |s|, return 0 if entire subtrie freed, otherwise |s|} var c: internal_code; @!t: trie_pointer; @!all_freed: boolean; {must be saved on recursive calls} @!h, @!n: op_type; {do not need to be saved} begin all_freed:=true; for c:=cmin to cmax do {find transitions belonging to this family} begin t:=s+c; if so(trie_char(t))=c then begin @; if trie_link(t)>0 then trie_link(t):=delete_patterns(trie_link(t)); if (trie_link(t)>0) or (trie_outp(t)>0) or (s=trie_root) then all_freed:=false else @; end; end; if all_freed then {entire state is freed} begin trie_base_used(s):=false; s:=0; end; delete_patterns:=s; end; @ @= begin h:=0; hyf_nxt(0):=trie_outp(t); n:=hyf_nxt(0); while n>0 do begin if hyf_val(n)=max_val then hyf_nxt(h):=hyf_nxt(n) else h:=n; n:=hyf_nxt(h); end; trie_outp(t):=hyf_nxt(0); end @ Cells freed by |delete_patterns| are put at the end of the free list. @= begin trie_link(trie_back(trie_max+1)):=t; trie_back(t):=trie_back(trie_max+1); trie_link(t):=trie_max+1; trie_back(trie_max+1):=t; trie_char(t):=min_packed;@/ decr(trie_count); end @ The recursion in |delete_patterns| is initiated by the following procedure, which also prints statistics about the number of nodes deleted, and zeros bad outputs in the hash table. Note that the hash table may become somewhat disorganized when more levels are added, but this defect isn't serious. @p procedure delete_bad_patterns; var old_op_count: op_type; @!old_trie_count: trie_pointer; @!t: trie_pointer; @!h: op_type; begin old_op_count:=op_count; old_trie_count:=trie_count;@/ t:=delete_patterns(trie_root); for h:=1 to max_ops do if hyf_val(h)=max_val then begin hyf_val(h):=0; decr(op_count); end; print_ln(old_trie_count-trie_count:1,' nodes and ',@| old_op_count-op_count:1,' outputs deleted'); qmax_thresh:=7; {pattern trie will be sparser because of deleted patterns} end; @ After all patterns have been generated, we will traverse the pattern trie and output all patterns. Note that if a pattern appears more than once, only the maximum value at each position will be output. @p procedure output_patterns(@!s: trie_pointer; @!pat_len: dot_type);@/ {output patterns in subtrie |s|; |pat_len| is current depth in trie} var c: internal_code; {must be saved on recursive calls} @!t: trie_pointer; @!h: op_type; @!d: dot_type;@/ @!l: triec_pointer; {for |write_letter|} begin for c:=cmin to cmax do begin t:=s+c; if so(trie_char(t))=c then begin pat[pat_len]:=c; h:=trie_outp(t); if h>0 then @; if trie_link(t)>0 then output_patterns(trie_link(t),pat_len+1); end; end; end; @ @= begin for d:=0 to pat_len do hval[d]:=0; repeat d:=hyf_dot(h); if hval[d]0 then write(patout,xdig[hval[0]]); for d:=1 to pat_len do begin write_letter(pat[d])(patout); write(patout,xext[pat[d]]); if hval[d]>0 then write(patout,xdig[hval[d]]); end; write_ln(patout); end @* Dictionary processing routines. The procedures in this section are the ``inner loop'' of the pattern generation process. To speed the program up, key parts of these routines could be coded in machine language. @^inner loop@> @= @!word: array[word_index] of internal_code; {current word} @!dots: array[word_index] of hyf_type; {current hyphens} @!dotw: array[word_index] of digit; {dot weights} @!hval: array[word_index] of val_type; {hyphenation values} @!no_more: array[word_index] of boolean; {positions `knocked out'} @!wlen: word_index; {length of current word} @!word_wt: digit; {global word weight} @!wt_chg: boolean; {indicates |word_wt| has changed} @ The words in the |dictionary| consist of the `letters' used by the language. ``Dots'' between letters can be one of four possibilities: |'-'| indicating a hyphen, |'*'| indicating a found hyphen, |'.'| indicating an error, or nothing; these are represented internally by the four values |is_hyf|, |found_hyf|, |err_hyf|, and |no_hyf| respectively. When reading a word we will, however, convert |err_hyf| into |no_hyf| and |found_hyf| into |is_hyf| thus ignoring whether a hyphen has or has not been found by a previous set of patterns. @= xclass['.']:=invalid_class; {in case the default has been changed} xclass[xhyf[err_hyf]]:=hyf_class; xint[xhyf[err_hyf]]:=no_hyf; xclass[xhyf[is_hyf]]:=hyf_class; xint[xhyf[is_hyf]]:=is_hyf; xclass[xhyf[found_hyf]]:=hyf_class; xint[xhyf[found_hyf]]:=is_hyf; @ Furthermore single-digit word weights are allowed. A digit at the beginning of a word indicates a global word weight that is to be applied to all following words (until the next global word weight). A digit at some intercharacter position indicates a weight for that position only. The |read_word| procedure scans a line of input representing a word, and places the letters into the array |word|, with |word[1]=word[wlen]= edge_of_word|. The dot appearing between |word[dpos]| and |word[dpos+1]| is placed in |dots[dpos]|, and the corresponding dot weight in |dotw[dpos]|. @p procedure read_word; label done, found; var c: text_char; @!t: trie_pointer; begin read_buf(dictionary); word[1]:=edge_of_word; wlen:=1; buf_ptr:=0; repeat incr(buf_ptr); c:=buf[buf_ptr]; case xclass[c] of space_class: goto found; digit_class: if wlen=1 then {global word weight} begin if xint[c]<>word_wt then wt_chg:=true; word_wt:=xint[c]; end else dotw[wlen]:=xint[c]; {dot weight} hyf_class: dots[wlen]:=xint[c]; {record the dot |c|} letter_class: {record the letter |c|} begin incr(wlen); if wlen=max_len then begin print_buf; overflow('word length=',max_len:1); end; word[wlen]:=xint[c]; dots[wlen]:=no_hyf; dotw[wlen]:=word_wt; end; escape_class: {record a multi-character sequence starting with |c|} begin incr(wlen); if wlen=max_len then begin print_buf; overflow('word length=',max_len:1); end; get_letter(word[wlen]); dots[wlen]:=no_hyf; dotw[wlen]:=word_wt; end; invalid_class: bad_input('Bad character'); @.Bad character@> end; until buf_ptr=max_buf_len; found: incr(wlen); word[wlen]:=edge_of_word; end; @ Here is a procedure that uses the existing patterns to hyphenate the current word. The hyphenation value applying between the characters |word[dpos]| and |word[dpos+1]| is stored in |hval[dpos]|. In addition, |no_more[dpos]| is set to |true| if this position is ``knocked out'' by either a good or bad pattern at this level. That is, if the pattern with current length and hyphen position is a superstring of either a good or bad pattern at this level, then we don't need to collect count statistics for the pattern because it can't possibly be chosen in this pass. Thus we don't even need to insert such patterns into the count trie, which saves a good deal of space. @p procedure hyphenate; label done; var spos, @!dpos, @!fpos: word_index; @!t: trie_pointer; @!h: op_type; @!v: val_type; begin for spos:=wlen-hyf_max downto 0 do begin no_more[spos]:=false; hval[spos]:=0; fpos:=spos+1; t:=trie_root+word[fpos]; repeat h:=trie_outp(t); while h>0 do @; t:=trie_link(t); if t=0 then goto done; incr(fpos); Incr(t)(word[fpos]); until so(trie_char(t))<>word[fpos]; done: end; end; @ In order to avoid unnecessary test (and range check violations) the globals |hyf_min|, |hyf_max|, and |hyf_len| are set up such that only positions from |hyf_min| up to |wlen-hyf_max| of the |word| array need to be checked, and that words with |wlen= @!hyf_min, @!hyf_max, @!hyf_len: word_index; {limits for legal hyphens} @ @= hyf_min:=left_hyphen_min+1; hyf_max:=right_hyphen_min+1; hyf_len:=hyf_min+hyf_max; @ @= begin dpos:=spos+hyf_dot(h); v:=hyf_val(h); if (v=hyph_level) then {check if position knocked out} if ((fpos-pat_len)<=(dpos-pat_dot))and((dpos-pat_dot)<=spos) then no_more[dpos]:=true; h:=hyf_nxt(h); end @ The |change_dots| procedure updates the |dots| array representing the printing values of the hyphens. Initially, hyphens (and correctly found hyphens) in the word list are represented by |is_hyf| whereas non-hyphen positions (and erroneous hyphens) are represented by |no_hyf|. Here these values are increased by one for each hyphen found by the current patterns, thus changing |no_hyf| into |err_hyf| and |is_hyf| into |found_hyf|. The routine also collects statistics about the number of good, bad, and missed hyphens. @d incr_wt(#)==Incr(#)(dotw[dpos]) @p procedure change_dots; var dpos: word_index; begin for dpos:=wlen-hyf_max downto hyf_min do begin if odd(hval[dpos]) then incr(dots[dpos]); if dots[dpos]=found_hyf then incr_wt(good_count) else if dots[dpos]=err_hyf then incr_wt(bad_count) else if dots[dpos]=is_hyf then incr_wt(miss_count); end; end; @ The following procedure outputs the word as hyphenated by the current patterns, including any word weights. Hyphens inhibited by the values of \.{\\lefthyphenmin} and \.{\\righthyphenmin} are output as well. @p procedure output_hyphenated_word; var dpos: word_index;@/ @!l: triec_pointer; {for |write_letter|} begin if wt_chg then {output global word weight} begin write(pattmp,xdig[word_wt]); wt_chg:=false end; for dpos:=2 to wlen-2 do begin write_letter(word[dpos])(pattmp); write(pattmp,xext[word[dpos]]); if dots[dpos]<>no_hyf then write(pattmp,xhyf[dots[dpos]]); if dotw[dpos]<>word_wt then write(pattmp,xdig[dotw[dpos]]); end; write_letter(word[wlen-1])(pattmp); write_ln(pattmp,xext[word[wlen-1]]); end; @ For each dot position in the current word, the |do_word| routine first checks to see if we need to consider it. It might be knocked out or a dot we don't care about. That is, when considering hyphenating patterns, for example, we don't need to count hyphens already found. If a relevant dot is found, we increment the count in the count trie for the corresponding pattern, inserting it first if necessary. At this point of the program range check violations may occur if these counts are incremented beyond |triec_max|; it would, however, be too expensive to prevent this. @^range check violations@> @p procedure do_word; label continue, done; var spos, @!dpos, @!fpos: word_index; @!a: triec_pointer; @!goodp: boolean; begin for dpos:=wlen-dot_max downto dot_min do begin spos:=dpos-pat_dot; fpos:=spos+pat_len; @; incr(spos); a:=triec_root+word[spos]; while sposword[spos] then begin {insert new count pattern} a:=insertc_pat(fpos); goto done; end; end; done: if goodp then incr_wt(triec_good(a)) @+else incr_wt(triec_bad(a)); continue: end; end; @ The globals |good_dot| and |bad_dot| will be set to |is_hyf| and |no_hyf|, or |err_hyf| and |found_hyf|, depending on whether the current level is odd or even, respectively. The globals |dot_min|, |dot_max|, and |dot_len| are analogous to |hyf_min|, |hyf_max|, and |hyf_len| defined earlier. @= @!good_dot, @!bad_dot: hyf_type; {good and bad hyphens at current level} @!dot_min, @!dot_max, @!dot_len: word_index; {limits for legal dots} @ @= if procesp then begin dot_min:=pat_dot; dot_max:=pat_len-pat_dot; if dot_min= if no_more[dpos] then goto continue; if dots[dpos]=good_dot then goodp:=true else if dots[dpos]=bad_dot then goodp:=false else goto continue; @ If |hyphp| is set to |true|, |do_dictionary| will write out a copy of the dictionary as hyphenated by the current set of patterns. If |procesp| is set to |true|, |do_dictionary| will collect pattern statistics for patterns with length |pat_len| and hyphen position |pat_dot|, at level |hyph_level|. @= @!procesp, @!hyphp: boolean; @!pat_dot: dot_type; {hyphen position, measured from beginning of pattern} @!hyph_level: val_type; {hyphenation level} @!filnam: packed array[1..8] of char; {for |pattmp|} @ The following procedure makes a pass through the word list, and also prints out statistics about number of hyphens found and storage used by the count trie. @p procedure do_dictionary; begin good_count:=0; bad_count:=0; miss_count:=0; word_wt:=1; wt_chg:=false; reset(dictionary);@/ @@;@/ if procesp then begin init_count_trie; print_ln('processing dictionary with pat_len = ',pat_len:1, ', pat_dot = ',pat_dot:1); end; if hyphp then begin filnam:='pattmp. '; filnam[8]:=xdig[hyph_level]; rewrite(pattmp,filnam); print_ln('writing pattmp.', xdig[hyph_level]); end; @;@/ close_in(dictionary);@/ print_ln(' '); print_ln(good_count:1,' good, ',bad_count:1,' bad, ', miss_count:1,' missed'); if (good_count+miss_count)>0 then print_ln((100*good_count/(good_count+miss_count)):1:2,' %, ', (100*bad_count/(good_count+miss_count)):1:2,' %, ', (100*miss_count/(good_count+miss_count)):1:2,' %'); if procesp then print_ln(pat_count:1,' patterns, ',triec_count:1, ' nodes in count trie, ','triec_max = ',triec_max:1); if hyphp then close_out(pattmp); end; @ @= while not eof(dictionary) do begin read_word; if wlen>=hyf_len then {short words are never hyphenated} begin hyphenate; change_dots; end; if hyphp then if wlen>2 then output_hyphenated_word; {empty words are ignored} if procesp then if wlen>=dot_len then do_word; end @* Reading patterns. Before beginning a run, we can read in a file of existing patterns. This is useful for extending a previous pattern selection run to get some more levels. (Since these runs are quite time-consuming, it is convenient to choose patterns one level at a time, pausing to look at the results of the previous level, and possibly amending the dictionary.) @p procedure read_patterns; label done, found; var c: text_char; @!d: digit; @!i: dot_type; @!t: trie_pointer; begin xclass['.']:=letter_class; xint['.']:=edge_of_word; level_pattern_count:=0; max_pat:=0; reset(patterns); while not eof(patterns) do begin read_buf(patterns); incr(level_pattern_count);@/ @; found: @; end; close_in(patterns); print_ln(level_pattern_count:1,' patterns read in');@/ print_ln('pattern trie has ',trie_count:1,' nodes, ',@| 'trie_max = ',trie_max:1,', ',op_count:1,' outputs'); end; @ The global variable |max_pat| keeps track of the largest hyphenation value found in any pattern. @= @!max_pat: val_type; @ When a new pattern has been input into |buf|, we extract the letters of the pattern, and insert the hyphenation values (digits) into the |hval| array. @= pat_len:=0; buf_ptr:=0; hval[0]:=0; repeat incr(buf_ptr); c:=buf[buf_ptr]; case xclass[c] of space_class: goto found; digit_class: begin d:=xint[c]; if d>=max_val then bad_input('Bad hyphenation value'); @.Bad hyphenation value@> if d>max_pat then max_pat:=d; hval[pat_len]:=d; end; letter_class: begin incr(pat_len); hval[pat_len]:=0; pat[pat_len]:=xint[c]; end; escape_class: {record a multi-character sequence starting with |c|} begin incr(pat_len); hval[pat_len]:=0; get_letter(pat[pat_len]); end; hyf_class, invalid_class: bad_input('Bad character'); @.Bad character@> end; until buf_ptr=max_buf_len @ Then we insert the pattern for each non-vanishing hyphenation value. In addition we check that |edge_of_word| (i.e., |'.'|) occurs only as first or last character; otherwise we would have to perform a time consuming test for the end of a word in the |hyphenate| procedure. @= if pat_len>0 then {avoid spurious patterns} for i:=0 to pat_len do begin if hval[i]<>0 then insert_pattern(hval[i],i); if i>1 then if i end @* The main program. This is where \.{PATGEN} actually starts. We initialize the pattern trie, get |hyph_level| and |pat_len| limits from the terminal, and generate patterns. @p begin initialize; init_pattern_trie; read_translate; read_patterns; procesp:=true; hyphp:=false;@/ repeat print('hyph_start, hyph_finish: '); get_input(n1,n2);@/ if (n1>=1)and(n1=1)and(n20; hyph_level:=max_pat; {in case |hyph_finishhyph_start then print_ln(' ') else if hyph_start<=max_pat then print_ln('Largest hyphenation value ',max_pat:1, ' in patterns should be less than hyph_start');@/ @.Largest hyphenation value@> repeat print('pat_start, pat_finish: '); get_input(n1,n2);@/ if (n1>=1)and(n1<=n2)and(n2<=max_dot) then begin pat_start:=n1; pat_finish:=n2; end else begin n1:=0; print_ln('Specify 1<=pat_start<=pat_finish<=',max_dot:1,' !'); end; until n1>0; repeat print('good weight, bad weight, threshold: '); get_input(n1,n2,n3);@/ if (n1>=1)and(n2>=1)and(n3>=1) then begin good_wt:=n1; bad_wt:=n2; thresh:=n3; end else begin n1:=0; print_ln('Specify good weight, bad weight, threshold>=1 !'); end; until n1>0; @; delete_bad_patterns; print_ln('total of ',level_pattern_count:1, ' patterns at hyph_level ',hyph_level:1); end; find_letters(trie_link(trie_root),1); {prepare for output} rewrite(patout); output_patterns(trie_root,1); close_out(patout);@/ @; end_of_PATGEN: end. @ The patterns of a given length (at a given level) are chosen with dot positions ordered in an ``organ-pipe'' fashion. For example, for |pat_len=4| we choose patterns for different dot positions in the order 2, 1, 3, 0, 4. The variable |dot1| controls this iteration in a clever manner. @= @!n1, @!n2, @!n3: integer; {accumulators} @!i: val_type; {loop index: hyphenation level} @!j: dot_type; {loop index: pattern length} @!k: dot_type; {loop index: hyphen position} @!dot1: dot_type; @!more_this_level: array[dot_type] of boolean; @ The array |more_this_level| remembers which positions are permanently ``knocked out''. That is, if there aren't any possible good patterns remaining at a certain dot position, we don't need to consider longer patterns at this level containing that position. @= for k:=0 to max_dot do more_this_level[k]:=true; for j:=pat_start to pat_finish do begin pat_len:=j; pat_dot:=pat_len div 2; dot1:=pat_dot*2; repeat pat_dot:=dot1-pat_dot; dot1:=pat_len*2-dot1-1; if more_this_level[pat_dot] then begin do_dictionary; collect_count_trie; more_this_level[pat_dot]:=more_to_come; end; until pat_dot=pat_len; for k:=max_dot downto 1 do if not more_this_level[k-1] then more_this_level[k]:=false; end @ When all patterns have been found, the user has a chance to see what they do. The resulting \.{pattmp} file can be used as the new `dictionary' if we want to continue pattern generation from this point. @= procesp:=false; hyphp:=true;@/ print('hyphenate word list? '); get_input_ln(buf[1]); if (buf[1]='Y') or (buf[1]='y') then do_dictionary @* System-dependent changes. This section should be replaced, if necessary, by changes to the program that are necessary to make \.{PATGEN} work at a particular installation. It is usually best to design your change file so that all changes to previous sections preserve the section numbering; then everybody's version will be consistent with the printed program. More extensive changes, which introduce new sections, can be inserted here; then only the index itself will get a new section number. @^system dependencies@> @* Index. Pointers to error messages appear here together with the section numbers where each ident\-i\-fier is used.