Content
Everything in between the <?begin?> and <?end?> PIs is expanded and will thus produce real output. Line content is recursively expanded until no further expansion is possible.
Author: | Steffen (Daode) Nurpmeso |
---|---|
Contact: | steffen@sdaoden.eu |
Copyright: | ISC license |
Date: | 1997 - 2005, 2010, 2012 - 2020 |
Version: | 0.9.3 |
Status: | Red and Hot! |
S-Web42 is one more option to manage your website. It offers the possibility to expand a directory hierarchy of input files, some of which may consist of a mixture of PIs, (uninterpreted) (X)HTML or XML markup and normal text. In those which do, PIs can be defined and undefined, they may be defined to take arguments (think format strings), which can be, e.g., a handy way to create complex tables; perl(1) code can be evaluated, files can be included, also recursively; simple control statements can be used to conditionalize file content, and some simple MarkLo tags aid in make editing life a bit easier, too.
The s-web42 converter is a perl(1) script, i.e., it needs an installed Perl, version 5.8.1 or above. It also requires the authors S-SymObj module -- you may install it as a regular part of your perl(1) installation by issuing the command $ cpan S-SymObj. Alternatively you may also get it from http://www.CPAN.org or clone the git repository from https://git.sdaoden.eu/scm/s-symobj.git (browse it at https://git.sdaoden.eu/browse/s-symobj.git).
s-web42 [-v[v]] s-web42 [-v[v]] [--no-rc] --no-update-cache (or: --nuc) s-web42 [-v[v]] [--no-rc] --force-rebuild [--nuc] s-web42 [-v[v]] [--no-rc] --expand-one [FILE] (or: --eo [FILE]) perl -C -I/PATH/TO/SymObj.pm s-web42 PERL5OPT=-C PERL5LIB=/PATH/TO/SymObj.pm export PERL5OPT PERL5LIB s-web42
Use --no-rc to suppress reading of an existent config.rc file. The --no-update-cache option can be used to suppress an update of the cache database -- i.e., rerunning s-web42 again will generate the very same output. With the --force-rebuild option an existing cache database can be ignored so that effectively everything is rebuild from scratch; this mode may be combined with --no-update-cache.
The --expand-one option will read the programs standard input or FILE, if one was specified, expand and filter it just as described below, and write the resulting content to the programs standard output. This is an isolated and special mode in that none of the described website management actions are performed, except of reading in the optional configuration file, as described below.
The -v option can be used to gain some verbosity, using it twice will be even more verbose; if used in conjunction with --expand-one these messages will go to the standard error instead. The trailing two examples show how to extend the perl(1) @INC path so that the required S-SymObj module will be found without installing that in a regular place, a task that often requires administrator privileges. And please be aware that no effort is put in parsing command line arguments: one needs to use the very format shown above.
Note
S-Web42 is charset agnostic. It reads and writes files, and simply reuses the actual character encoding that perl(1) has chosen to use. It is therefore highly desirable to either use the -C command line option of perl(1) or to set the PERL5OPT environment variable to this value, because only like that perl(1) tries to reflect the users locale settings in respect to Unicode-awareness of its input and output! Please see the perlrun(1) manual for more documentation, and the above examples on how to do that in a POSIX-compatible, Bourne or Korn shell.
If S-Web42 detects that perl(1) does not use UTF-8 I/O even though the users locale is UTF-8 aware it will complain remarkably.
S-Web42 consists of only one file: s-web42. In the repository there is also the script test.sh, which is the unit test of S-Web42, and header, footer, template.html, hook and config.rc, but these form a (somewhat primitive and rather identical to the authors very own website) usage example only, they are not required for operation.
Once invoked, the converter script s-web42 requires the subdirectory site to exist in the current working directory, since that is used as the input tree. It will also assume it can use the filenames cache.dat (the cache database), cache.old (database state before last run), and, temporarily, *.tmp in there. The generated shell archive will always be w42-update.sh, and it will not have any executable bits set.
If the file config.rc is found in the working directory, and the --no-rc command line option has not been used, it will be read -- the Assignments of PI variables seen there will form the outermost context and will thus be inherited by all files under site (PIs can be overridden and undefined for anything deeper in the hierarchy only). Here, and only here is it possible to specify some very Special PI variables.
The presence of a file hook is also recorded, and it will be used as a per-directory fallback hook in all those site subdirectories which do not provide their own. The optional per-directory hooks can be used to create and/or modify the contents of the directory (and subdirectories and only) on the fly. It is a fatal error if such a hook is not an executable program or if it does not exit successfully. Hooks will be given one command line argument, and that is the current working directory from within which the hook is run.
Then the converter will enter the site directory and recursively parse the tree therein. An existent per-directory hook is run at that point, and before deeper levels of the hierarchy are entered. Note that the directory entries SCCS, CVS and anything which starts with a dot (thus also .git and .RCS) are completely ignored by S-Web42, except that files may be included when placed in the WHITELIST variable of config.rc. It is not possible to savely create directories on the fly except in deeper, not yet parsed hierarchy levels below the current directory, too.
Note
S-Web42 will not handle paths with embedded quotation marks, i.e., neither " nor ' characters may be used for directory- and filenames. You should possibly be conservative about filenames in general, mostly in respect to embedded whitespace, as the implemented quoting rules are primitive, yet sufficient.
After the entire tree has been traversed like that once, the content of all directories will be reread, and the resulting tree will be compared against the version recorded in the cache database. Anything which is missing or which seems to be modified will be rebuild, either as a bitwise copy of the source or as the result of S-Web42 filtering, as below. Filtering will be applied to all files which' name includes the string -w42, or, to be exact, which name' end with a string that matches (.*?)(-w42(?:-(x|[icewpatsm]+))?(-new)?)$.
S-Web42 assumes that a file is modified if its modification timestamp is different from that recorded in the cache database (but see IGNORE_MODTIME), and if its MD5 checksum is different from the recorded one. One can force rebuilding of S-Web42-filtered files on a per-file base by using a filename that includes the suffix -new, as in index.html-w42-new -- neither modification time nor input checksums will matter for such files, they will always be rebuild. Input checksum of S-Web42-filtered files means that the configuration and the Assignments part have been fully expanded, but anything thereafter is treated bitwise.
Note
Files included via the <?include?> PI are not checked, i.e., no dependency tracking is performed. The Assignments directive ?include? will however be resolved and therefore affects the input checksum that may cause file rebuilding.
Any file rebuilt non-bitwise is then checksummed again, and will only be part of the result if the MD5 checksum of the rebuilt target is different from the checksum stored in the cache database, but not otherwise. (The only exception to this rule is if the user uses the --force-rebuild command line option, as that will rebuild everything from scratch.)
So, what will the result look like? One may be astonished, but the result will be generated as a shell archive with uuencode(1)d (and, optionally, COMPRESSed) members. These archives can either be used to update a local mirror ($ sh w42-update.sh --local TARGET-PATH) or as a sftp(1) batchfile ($ sh w42-update.sh --sftp TMP-PATH | sftp -b - user@host[:dir]; rm -rf TMP-PATH). In the latter case TMP-PATH must be some temporary path that can be used by the archive to unpack its contents therein (it will issue a mkdir TMP-PATH, i.e., create that directory as necessary).
During operation, the generated shell archive will try to ignore any errors, continuing its operation until all commands have been issued. E.g., in sftp(1) batch mode, the termination-on-error is suppressed by prefixing commands with a hyphen -. Any error, and some other informational messages will be logged to standard error, however.
File and directory removals will also be properly handled. However, if the cache database (a textfile) is lost, then the only solution is to delete the target directory manually and to rebuild it with a new from-scratch archive.
S-Web42 does not look into files unless their name ends with a special suffix. If that is not seen, files will be treated as bins of undefined binary content and handled bitwise. If, on the other hand, the filename ends with the suffix (.*?)(-w42(?:-(x|[icewpatsm]+))?(-new)?)$, then it will be subject to content filtering, as described in the rest of this document. Let us inspect that cryptic expression:
If, after the ignored part, the string -w42 is seen, as in index.html-w42, then this file is flagged as being subject to S-Web42 content filtering.
Note
The trailing S-Web42 specific part will be stripped from the filename so that, e.g., an input file index.html-w42-new will produce an output file index.html.
Moreover, all files that are subject to S-Web42 content filtering (and config.rc but this is a special case in that S-Web42 will complain if it contains anything else but the Assignments part) have to comply to a very specific content layout scheme, henceforth called a S-Web42 context:
These filter operations will be performed on the Assignments part of all contexts which are subject to filtering. They will also be applied to the Content part unless the mode configuration suffix was -x or otherwise excluded these actions:
These filter operations will be performed on the Content part unless the mode configuration excluded them:
These filter operations will be performed on the Content part unless the mode configuration suffix was -x or otherwise excluded these actions:
If a textblock is surrounded by empty lines it will be enclosed in a <p></p> pair unless the block seems to be enclosed in a tag, or unless so-called "mode-switching" PIs are used within it. I.e., no automatic paragraph will be provided for a otherwise perfectly legal textblock if an <?include?> directive is contained therein, or <?perl?> or even a <?pre?>. Automatic paragraphs are ment for human friendly editing of, well, paragraphs, not for fancy markup. By starting such a paragraph with a special trigger character sequence several different kinds of markup can be generated automatically:
Example:
== It is not a Wiki! You would not believe what i saw: * Cats * Mice * Birds ------- _ Wow! _ Or Wuff-Wuff! => <h2>It is not a Wiki!</h2> <p>You would not believe what i saw:</p> <ul><li><p>Cats</p></li><li><p>Mice</p></li><li><p>Birds</p></li></ul> <hr /> <blockquote><p>Wow!</p><p>Or Wuff-Wuff!</p></blockquote>
The DOM standard (http://www.w3.org/TR/DOM-Level-3-Core) is used (by browsers and such poor software) to transform site content to DOM objects. Unfortunately code like:
</p> <p>
(may) result(s) in a useless DOM object covering the newline in between the two tags. To circumvent that S-Web42 tries to join tags like these together.
Some MarkLo markers will be converted to markup. Expanded content is reevaluated until no more expansion is possible. MarkLo detection and expansion is neither performed across newline boundaries nor PI occurrences, and there is no possibility to escape MarkLo expansion (via backslash escaping for example) except by turning it off entirely. It is possible to embed a closing brace by escaping it like that, however:
\c{CONTENT} -> <tt>CONTENT</tt> \i{CONTENT} -> <em>CONTENT</em> \b{CONTENT} -> <strong>CONTENT</strong> \u{CONTENT} -> <u>CONTENT</u> # (These are bit special, but nice to use) \a{NAME} -> <a name="NAME"></a> \l{LINK} -> <a href="LINK">LINK</a> \i{I \b{really \u{{love\}} you}, baby!} -> <em>I <strong>really <u>{love}</u> you</strong>, baby!</em>
Finally a -w42 file content example:
WHO = Ziggy <?begin?> Yo S-Web42. <?WHO?>. # This is a comment line. # Yet another comment line This LN \ use\ s \i{NL} escaping.\\\\ <?end?> -> <p>Yo S-Web42.</p><p>Ziggy. This LN uses <em>NL</em> escaping.\\</p>
But -- feel free:
<?begin?>\i{Hello}, S-Web42.<?end?> -> <em>Hello</em>, S-Web42.
The content is prefixed by the (PI) variable assignment block.
var1 = content of var1 var2 ?= conditional assignment (if not yet defined) var3 += content (assigned or) added to var3 var4 @= var4 will be an array, this is the first value var4 @= the var4 array gains another value var5 ?@= conditional array assignment (if not yet defined) var5 @= append a member to the now anyway existent var5 array var6 = <em>markup</em><?def foo<>stupid example?><?foo?><?undef foo?> ?include? = path
All PI variables defined like that may contain any content, including complete PIs and markup. There is no technical difference in between these and stuff defined via def and defa (and defx) with the exceptional possibility to include the PI start and close tags <? and ?>, respectively (as shown in the var6 example). S-Web42 PI and variable names may consist of alphabetical characters, digit characters and the hyphen (-). Note that the case matters, just as usual for XML processing.
The ?include? directive can be used to include assignment directives from other files, also recursively. It is not allowed to start the Content section whilst doing so.
Note
If the path starts with a tilde ~, that will be replaced by the value of the environment variable HOME. Likewise, if the path starts with a plus sign +, then that will be replaced with the content of the environment variable WEB42INC. Note it is really that simple.
Note
By definition there is really no notion of a "chroot". One may leave site absolutely or relatively, just as desired.
Everything in between the <?begin?> and <?end?> PIs is expanded and will thus produce real output. Line content is recursively expanded until no further expansion is possible.
Any content after the <?end?> PI is not parsed at all.
There are a few PI variables which are treated in a special way, either because they can only be set in config.rc, or they are readonly PIs that are provided automatically, or because they are used by Processing Instructions (PIs) as content-injection hooks.
Readonly. Here _A stands for "array" and _S for "string". These are readonly PIs that correspond to the modification time of the currently processed (outermost) file, in UTC and LOCAL time, respectively. The order of the arrays is: 0=year, 1=month, 2=day, 3=hour, 4=minute, 5=second. The entry at index 6 is the string "UTC" for the UTC versions and the offset from UTC in the ISO 8601:2000 standard format (+hhmm or -hhmm) otherwise. The strings use the format "YYYY-MM-DD HH:MM:SS" and, again, the UTC version appends the string " UTC" whereas the local version appends the string " +-ZONEOFFSET".
Note
If the Time::Piece perl(1) module cannot be loaded (it is believed to be a standard module since Perl version 5.10), then S-Web42 will only provide UTC values, even for the local versions. I.e., the local ones are only aliases, then.
It follows the list of predefined processing instructions. PIs marked "paired" in the list below are special in that they need a closing end tag -- to, e.g., end the paired PI <?perl?> the PI <?perl end?> is necessary.
For def, defa (also via implicit array vivification, as below) and defx the same name restrictions apply to the introduced PI variable name as has been documented for Assignments. Note that the content that is assigned to PI variables created by those PIs may not contain PIs themselves, of course. It is however possible to use the pseudo-tags <^ and ^> as aliases there, which will be automatically converted to <? and ?>, respectively, and via a simple regular expression, once the PI variable is expanded; to create an empty tag <>, the alias <^> has been provided for completeness sake. See def and defx for examples.
Define a PI which expands to its value part when used.
<?def var1<>varcontent, may contain <em>tagsoup</em>?> <?def var2<>any <> content but PI start and end tags?> <?def var3<><^def var4<>es^><^var4^><^undef var4^>?> ... <?var1?> <?var2?> T<?var3?><?pi-if var4?>T -> varcontent, may contain <em>tagsoup</em> any <> content but PI start and end tags TesT
Define or extend a PI that serves as an array. Separate members are indicated by placing the empty tag <> in the value content (which is different to def, which would simply expand the empty tag). Individual array members may be accessed by "calling" the array with the desired member index (starting at 0) as an argument:
<?defa arrnam<>m 1?> <?defa arrnam<>m 2?> <?defa arrnam<>m 3<>m 4?> ... <p><?arrnam 0?><?arrnam 1?>\ <?arrnam 2?><?arrnam 3?></p> -> <p>m 1m 2m 3m 4</p>
One may loop over an entire array by giving loop as the first argument. In this mode two additional, optional arguments may be given; the first will be injected before the value, the second after the value. E.g.:
<p><?arrnam loop<><b><></b>?></p> -> <p><b>m 1</b><b>m 2</b><b>m 3</b><b>m 4</b></p>
If the first argument is instead one of unshift and push, then the remaining arguments are joined to a single one and inserted at the front/the back of the array, respectively, auto-vivificating the array as necessary. Likewise, an argument of one of shift and pop removes the first/the last member of the array, causing a log message in verbose mode if the array does not have any members. And an argument undef-empty will undef the array if it is empty. E.g.:
<?test push<>entry 1?> <?test unshift<>entry 0?> <?test push<>entry 2?> <?test loop<><> ?><br /> <?test pop?> <?test shift?> <?test loop<><> ?><br /> <?test pop?> <?test loop<><> ?><br /> <?test undef-empty?> -> entry 0 entry 1 entry 2 <br />entry 1 <br /><br />
Define a PI which takes arguments (think format strings). Arguments are indicated by placing the empty tag <> in the value content -- those tags will be replaced by the corresponding user-supplied argument when used (which is different to def, which would simply expand the empty tag).
<?defx var2<>Expanded <> var <> content <>?> <?defx note<><em><></em>: <>.?> <?defx subscription<><^note For subscribers only<^><>^>?> ... <?var2 arg1<>arg2<>arg3?> <?subscription nonexistent list?> -> Expanded arg1 var arg2 content arg3 <em>For subscribers only</em>: nonexistent list.
These purely convenience PIs expand to (X)HTML hyperlinks; the first two should be used to create links which leave the site (see WWW_PREFIX and WWW_SUFFIX), the other two for site-local ones. The versions without the t suffix take one argument, the others will also create a title attribute and thus require two arguments.
<?href http://www.netbsd.org?> -> <a href="http://www.netbsd.org">www.netbsd.org</a> <?hreft http://www.opensource.org<><em>Lots</em> of licenses?> -> <a href="http://www.opensource.org" title="Lots of licenses" ><em>Lots</em> of licenses</a>
Simple conditional control statements which test for (non)existence of a variable (PI) and process the enclosed block only if the condition is true. (You may also "test" for 0, which evaluates as not defined.)
<p> <?ifndef HOMEPAGE?> <?lreft index.html<>[HOME]?> <?else?> Jo-ho hooo -- welcome on my homepage, dude! <?fi?> </p>
The PI <?include?> can be used to include a file that itself will be subject to the same expansion and filtering that is in use for the including file, i.e., it must form a valid S-Web42 context. Paths are interpreted relative to the source directory of the file which uses the include directive.
<?xinclude?> is similar to <?include?>, except that the included file is expected to (implicitly) consist of Content only, so that its inclusion may modify the PI environment of the current context.
The <?raw_include?> PI will simply include the given file raw and without any S-Web42 processing. <?frank_include?> will also include the given file raw, but it will process the lines and escape &, < and > characters, so that HTML parsers will be able to display the (rather) raw content as desired.
HOME = ../index.html <?begin?> <?include ../../header?> Hello World, v2. <pre> <?frank_include /etc/passwd?> </pre> <?include ../../footer?> <?end?>
Note
If the path starts with a tilde ~, that will be replaced by the value of the environment variable HOME. Likewise, if the path starts with a plus sign +, then that will be replaced with the content of the environment variable WEB42INC. Note it is really that simple.
Note
By definition there is really no notion of a "chroot". One may leave site absolutely or relatively, just as desired.
These are paired statements which can be used to embed perl(1) or sh(1) code, respectively. The code will be executed in a subprocess, with its working directory set to that of the topmost context, i.e., the file that is currently being produced output for.
The output that the subprocess produces on its standard output channel will be subject to the same expansion and filtering that is in use for the surrounding context. In fact the output is expected to form a valid S-Web42 context except for the x-versions, which are expected to produce Content output (implicitly) only, and which thus modify the PI environment of the current context.
The code is subject to the normal S-Web42 processing before it is passed through to the subprocess that runs the interpreter. I.e., PI expansion can be used to "pass arguments" through. This raises the question how a valid S-Web42 context can be produced if PIs are expanded during code evaluation. Well, it turns out that S-Web42 injects four variables automatically:
PIS | <? |
PIE | ?> |
BEGIN | <?BEGIN?> |
END | <?END?> |
(Since it is not necessary to quote the closing ?> of a PI $PIE is provided only for completeness sake.) So -- unfortunately one has to perform the uncomfortable task of PI building via string concatenation to avoid unwanted data expansions. E.g.:
<?perl?> my $gmd = gmtime; my $lmd = localtime; my $user = fetch_user(); print <<__EOT__; USER = $user ${BEGIN} ${PIS}include +header?> <h1>Hi ${PIS}USER${PIE}!</h1> It is ${lmd}. That is ${gmd} UTC! ${PIS}include +footer?> ${END} __EOT__ <?perl end?> # This works: $ echo '<?begin?>START<?sh?>' \ > 'printf "${PIS}BEGIN?>IN${PIS}END?>" | tr [:upper:] [:lower:]' \ > '<?sh end?>END<?end?>' | s-web42 --no-rc --eo # An <?eval PERL-EXPR?> may at some time be a regular PI.. $ echo '<?begin?><?defx eval<><^xperl^><><^xperl end^>?>\ > <?eval my $t=gmtime; print $t?><?end?>' | s-web42 --no-rc --eo
Note
The entire -- expanded -- script is read into memory before being passed to the interpreter that runs in the subprocess and will eval the script. These PIs cannot be nested (especially not with each other).
The "preformatted" paired PIs. While these PIs are active none of the filters that have been documented in About.Filter are active except for the PI expansion filter (indeed they create a new context for their content). Since that is mostly of interest inside of HTML <pre> tags, the expansion of the <?xpre?> PI contains this tag already. And the <?xcdata?> PI enwraps the content in a CDATA section in addition.
Note
Leading whitespace on the line of the <?xpre end?> will of course be copied through to the output. This may look odd in an otherwise beautifully indented source.
Undefine an user defined array or PI.
Note
This undefines in the current context only! I.e., if file1 defines the PI DOIT and then includes file2, then an <?undef DOIT?> from within file2 will not affect file1, but only file2 and deeper contexts.
S-Web42 no longer needs external programs for creating the shell archive. (My performance increasement was pretty dramatical.)
The <?mode?> PI has been rethought and using it for adjustments no longer works on a global basis, but changes will not affect outer contexts.
Before that rewrite a long standing bug of <?mode?> had been fixed which mattered for false use cases where the reverting <?mode %?> had been forgotten.
Thanks to Ian Abbott, who alluded the missing <?xcdata?> and <?frank_include?> PIs (at least indirectly through statements on the tz AT iana DOT org mailing list).
Thanks to Dave Mitchell from perl-porters for giving the hint on backtracing that made the fancy recursive regular expression that is used for MarkLo proof against user syntax misses. (I never use and do not really understand these super-fancy extended regular expressions beyond assertions, but remembered the example from the documentation and was too lazy to write the hand-driven recursive parser for that, so i tried it.)
Add a Markdown compatible syntax block.
With automatic parapraphs too many flushs occur, and tagsoup joining is not yet implemented.
The convertion phase can be parallelized without much (in fact almost no) effort; add a special MAXJOBS or similar special config.rc PI variable to control this, then.
An <?eval? PERL-EXPR?> could at some time become a regular PI, that simply evaluates in the current context. Maybe lazy start a concurrent child and communicate with that for this purpose (pay once once needed). If parallelized, one such evalizer for each worker thread/process.
Copyright (c) 1997 - 2024, Steffen Nurpmeso <steffen@sdaoden.eu>
@(#)code-web42.html-w42 1.38 2022-12-22T18:41:43+0000