| Description | Hierarchy | Fields | Methods | Properties |
type THtmlTemplateParser = class(TObject)
This is the template processor class which can apply a template to one or more html documents.
You can use it by calling the methods parseTemplate and parseHTML. parseTemplate loads a certain template and parseHTML matches the template to a html/xml file.
A template file is just like a html file with special commands. The parser than matches every text and tag of the template to text/tag in the html file, while ignoring every additional data in latter file. If no match is possible an exception is raised.
The template can extract certain values from the html file into variables, and you can access these variables with the property variables and variableChangeLog. Former only contains the final value of the variables, latter records every assignment during the matching of the template.
Getting started
Creating a template to analyze a xml-file/webpage:
First, you should remove all things from the webpage that are uninteresting, dynamically generated or invalid xml (or alternatively start with an empty file as template).
Then, you should replace all parts that you want to extract with <t:s>yourVariableName:=text()</t:s>.
This will write the value of the text node that contains the t:s tag in the variable yourVariableName.
Instead of the t:s tag, you can also use the short notation {yourVariableName:=text()}; and instead of text() to read the text node, you can also use @attrib to read an attribute; or an arbitrary complex xpath/xquery-expression
Then the template is finished, at least the trivial things
If you want to read several elements like table rows, you need to surround the matching tags with template:loop, e.g. <template:loop><tr>..</tr></template:loop> and the things between the loop-tags is repeated as long as possible. You can also use the short notation by adding a star like <tr>..</tr>* .
Using the template from Pascal:
First, create a new THtmlTemplateParser: parser := THtmlTemplateParser.create()
Load the template with parser.parseTemplate('..template..') or parser.parseTemplateFile('template-file')
Process the webpage with parser.parseHTML('..html..') or parser.parseHTMLFile('html-file')
Read the result of variable yourVariableName through parser.variables.values['yourVariableName']
If you used loops, only the last value of the variable is avaible in the variables property, the previous values can be enumerated through variableChangelog.
Template examples
Template: <b><template:read var="test" source="text()"></b>
Html-File: <b>Hello World!</b>
This will set the variable test to "Hello World!"
Template: <b><t:s>test:=.</t:s></b>
Html-File: <b>Hello World!</b>
This will also set the variable test to "Hello World!"
Template: <b>{test:=.}</b>
Html-File: <b>Hello World!</b>
This will also set the variable test to "Hello World!"
Template: <template:loop><b><t:s>test:=.</t:s></b></template:loop>
Html-File: <b>Hello World!</b>
This will also set the variable test to "Hello World!"
Template: <table> <template:loop> <tr> <td> <template:read var="readField()" source="text()"> </td> </tr> </template:loop> </table>
Html-File: <table> <tr> <td> row-cell 1 </td> </tr> <tr> <td> row-cell 2 </td> </tr> ... <tr> <td> row-cell n </td> </tr> </table>
This will read row after row, and will write each first field to the change log of the variable readField() .
Template: <table> <template:loop> <tr> <td> <template:read var="readField1()" source="text()"> </td> <td> <template:read var="readField2()" source="text()"> </td> <td> <template:read var="readField3()" source="text()"> </td> ... </tr> </template:loop> </table>
Html-File: <table> <tr> <td> a </td> <td> b </td> <td> c </td> </tr> ... </tr> </table>
This will read readField1()=a, readField2()=b, readField3()=c...
Of you can use your own names instead of readFieldX() and they are independent of the html file. So such templates can convert several pages with different structures, to the same internal data layout of your application.
Template: <template:loop> <tr> <template:read var="readAnotherRow()" source="deep-text(',')"> </tr> </template:loop>
Html-File: ... <tr> <td> a </td> <td> b </td> <td> c </td> </tr> <tr> <td> foo </td> <td> bar </td> </tr> ...
This will read all rows, and write lines like a,b,c and foo,bar to the changelog.
Template: <li template:condition="filter(text(), '1*:') != filter(text(), 'ˆ1?:|ˆ(11+?)\1+:')"><template:read var="prime" source="text()"/></li>
Html-File: ... <li>1111: this is 4</li><li>1:1 is no prime</li><li>1111111: here is 7</li><li>11111111: 8</li> ...
This will return "1111111: here is 7", because 1111111 is the first prime in that list.
Template:
<form>
<template:switch>
<input t:condition="(@type = ('checkbox', 'radio') and exists(@checked)) or (@type = ('hidden', 'password', 'text'))">{post:=concat(@name,'=',@value)}</input>
<select>{temp:=@name}<option t:condition="exists(@selected)">{post:=concat($temp,'=',@value)}</option>?</select>
<textarea>{post:=concat(@name,'=',text())}</textarea>
</template:switch>*
</form>
Html-File: any form
This example will extract from each relevant element in the form the name and value pair which is sent to the webserver. It is very general, and will work with all forms, independent of things like nesting deep. Therefore it is a little bit ugly; but if you create a template for a specific page, you usually know which elements you will find there, so the template becomes much simpler in practical cases.
See the unit tests in tests/extendedhtmlparser_tests.pas for more examples.
Template reference
Basically the template file is a html file, and the parser tries to match the structure of the template html file to the html file.
A tag of the html file is considered as equal to a tag of the template file, if the tag names are equal, all attributes are the same (regardless of their order) and every child node of the tag in the template is also equal to a child node of the tag in the html file (in the same order and nesting).
Text nodes are considered as equal, if the text in the html file starts with the whitespace trimmed text of the template file. All comparisons are performed case insensitive.
The matching occurs with backtracking, so it will always find the first and longest match.
The following template commands can be used:
<template:read var="??" source="??" [regex="??" [submatch="??"]]/>
The XPath-expression in source is evaluated and stored in variable of var.
If a regex is given, only the matching part is saved. If submatch is given, only the submatch-th match of the regex is returned. (e.g. b will be the 2nd match of "(a)(b)(c)") (However, you should use the xq-function filter instead of the regex/submatch attributes, because former is more elegant)
<template:s>var:=source</template:s>
Short form of template:read. The expression in source is evaluated and assigned to the variable s.
You can also set several variables like a:=1,b:=2,c:=3 (Remark: The := is actually part of the expression syntax, so you can use much more complex expressions.)
<template:if test="??"/> .. </template:if>
Everything inside this tag is only used iff the XPath-expression in test equals to true
<template:else [test="??"]/> .. </template:else>
Everything inside this tag is only used iff the immediate previous if/else block was not executed.
You can chain several else blocks that have test attributes together after an starting if, to create an ifelse chain, in which only one if or else block is used.
E.g.: <template:if test="$condition">..</template:if><template:else test="$condition2">..</template:else><template:else>..</template:else>
<template:loop [min="?"] [max="?"]> .. </template:loop>
Everything inside this tag is repeated between [min,max] times. (default min=0, max=infinity)
E.g. if you write <template:loop> X </template:loop> , it has the same effect as XXXXX with the largest possible count of X <= max for a given html file.
If min=0 and there is no possible match for the loop interior the loop is completely ignored.
If there are more possible matches than max, they are ignored.
<template:switch [value="??"]> ... </template:switch> This command can be used to match only one of several possibilities. It has two different forms:
Case 1: All direct child elements are template commands:
Then the switch statement will choose the first child command, whose attribute test evaluates to true.
Additionally, if one of the child elements has an attributes value, the expressions of the switch and the child value attribute are evaluated, and the command is only choosen, if both expressions are equal.
An element that has neither a value nor a test attribute is always choosen (if no element before it is choosen).
If no child can be choosen at the current position in the html file, the complete switch statement will skipped.
Case 2: All direct child elements are normal html tags:
This tag is matched to an html tag, iff one of its direct children can be matched to that html tag.
For example <template:switch><a>..</a> <b>..</b></template:switch> will match either <a>..</a> or <b>..</b>, but not both. If there is an <a> and a <b> tag in the html file, only the first one will be matched (if there is no loop around the switch tag). These switch-constructs are mainly used within a loop to collect the values of different tags, or to combine to different templates.
If no child can be matched at the current position in the html file, the matching will be tried again at the next position (different to case 1).
<template:switch-prioritized> ... </template:switch-prioritized> Another version of a case 2 switch statement that only may contain normal html tags.
The switch-prioritized prefers earlier child element to later child elements, while the normal switch match alls child elements equally. So a normal switch containing <a> and <b>, will match <a> or <b>, whichever appears first in the html file. The switch-prioritized contrastingly would match <a>, if there is any <a>, and <b> only iff there is no <a> in the html file.
Therefore <template:switch-prioritized [value="??"]> <a>..</a> <b>..</b> .. </template:switch-prioritized> is identical to <a template:optional="true">..<t:s>found:=true()</t:s></a> <b template:optional="true" template:test="not($found)">..<t:s>found:=true()</t:s></b> .... (and this command is kind of redunant, so it might be removed in later versions)
<template:match-text [regex=".."] [starts-with=".."] [ends-with=".."] [contains=".."] [is=".."] [case-sensitive=".."] [list-contains=".."]/>
Matches a text node and is more versatile than just including the text in the template.
regex matches an arbitrary regular expression against the text node.
starts-with/ends-with/contains/is check the text verbatim against the text node, in the obvious way.
list-contains treats the text of the node as a comma separated list and tests if that list contains the attribute value .
case-sensitive enables case-sensitive comparisons.
<template:meta [default-text-matching="??"] [default-case-sensitive="??"]/>
Specifies meta information to change the template semantic:
default-text-matching: specifies how text node in the template are matched against html text nodes. You can set it to the allowed attributes of match-text. (default is "starts-with")
default-text-case-sensitive: specifies if text nodes are matched case sensitive.
Each of these commands can also have a property test="{xpath condition}", and the tag is ignored if the condition does not evaluate to true (so <template:tag test="{condition}">..</template:tag> is a short hand for <template:if test="{condition}">). <template:tag>..</template:tag></template:if>
There are two special attributes allowed for html or matching tags in the template file:
template:optional="true"
if this is set the file is read successesfully even if the tag doesn't exist.
You should never have an optional element as direct children of a loop, because the loop has lower priority as the optional element, so the parser will skip loop iterations if it can find a later match for the optional element. But it is fine to use optional tags that have an non-optional parent tag within the loop.
template:condition="xpath"
if this is given, a tag is only accepted as matching, iff the given xpath-expression returns true (powerful, but slow)
(condition is not the same as test: if test evaluates to false, the template tag is ignored; if condition evaluates to false, the html tag )
The default prefixes for template commands are "template:" and "t:", you can change that with the templateNamespace-property or by defining a new namespace in the template like xmlns:yournamespace="http://www.benibela.de/2011/templateparser" . (only the xmlns:prefix form is supported, not xmlns without prefix)
Short notation
Commonly used commands can be abbreviated as textual symbols instead of xml tags. To avoid conflicts with text node matching, this short notation is only allowed at the beginning of template text nodes.
The short read tag <t:s>foo:=..</t:s> to read something in variable foo can be abbreviated as {foo:=..}. Similarly {} can be written within attributes to read the attribute, e.g. <a href="{$dest := .}"/>.
Also the trailing := . can be omitted, if only one variable assignment occurs, e.g. as {$foo} is equivalent to foo := . and $foo := ..
Optional and repeated elements can be marked with ?, *, +, {min, max}; like <a>?...</a> or, equivalent, <a>..</a>?.
An element marked with ? becomes optional, which has the same effect as adding the template:optional="true" attribute.
An element marked with * can be repeated any times, which has the same effect as surrounding it with a template:loop element.
An element marked with + has to be repeated at least once, which has the same effect as surrounding it with a template:loop element with attribute min=1.
An element marked with {min,max} has to be repeated at least min-times and at most max-times (just like in a t:loop) (remember that additional data/elements are always ignored).
An element marked with {count} has to be repeated exactly count-times (just like in a t:loop) (remember that additional data/elements are always ignored).
Breaking changes from previous versions:
As was announced in planned changes, the meaning of {$x} and {6} was changed
As was announced in planned changes, the meaning of <x value="{$x}"/> was changed
Adding the short notation breaks all templates that match text nodes starting with *, +, ? or {
The default template prefix was changed to template: (from htmlparser:). You can add the old prefix to the templateNamespace-property, if you want to continue to use it
All changes mentioned in pseudoxpath.
Also text() doesn't match the next text element anymore, but the next text element of the current node. Use .//text() for the old behaviour
All variable names in the pxp are now case-sensitive in the default mode. You can set variableChangeLog.caseSensitive to change it to the old behaviour (however, variables defined with in the expression by for/some/every (but not by := ) remain case sensitive)
There was always some confusion, if the old variable changelog should be deleted or merged with the new one, if you process several html documents. Therefore the old merging option was removed and replaced by the KeepPreviousVariables property.
Planned breaking changes:
Avoid unmatched parenthesis and pipes within text nodes:
Currently is no short notation to read alternatives with the template:switch command, like <template:switch><a>..</a><b>..</b><c>..</c></template:switch>.
In future this might be the same as (<a>..</a>|<b>..</b>|<c>..</c>).
![]() |
constructor create; |
![]() |
destructor destroy; override; |
![]() |
procedure parseTemplate(template: string; templateName: string = '<unknown>'); |
![]() |
procedure parseTemplateFile(templatefilename: string); |
![]() |
function parseHTML(html: string; htmlFileName: string = ''; contentType: string = ''):boolean; |
![]() |
function parseHTMLFile(htmlfilename: string):boolean; |
![]() |
function replaceVarsOld(s:string;customReplace: TReplaceFunction=nil):string; deprecated; |
![]() |
function replaceEnclosedExpressions(str:string):string; |
![]() |
function debugMatchings(const width: integer): string; |
![]() |
function parseQuery(const expression: string): IXQuery; |
![]() |
property variables: TXQVariableChangeLog read GetVariables; |
![]() |
property variableChangeLog: TXQVariableChangeLog read FVariableLog; |
![]() |
property oldVariableChangeLog: TXQVariableChangeLog read FOldVariableLog; |
![]() |
property VariableChangeLogCondensed: TXQVariableChangeLog read GetVariableLogCondensed; |
![]() |
property templateNamespaces: TNamespaceList read GetTemplateNamespace; |
![]() |
property ParsingExceptions: boolean read FParsingExceptions write FParsingExceptions; |
![]() |
property OutputEncoding: TEncoding read FOutputEncoding write FOutputEncoding; |
![]() |
property KeepPreviousVariables: TKeepPreviousVariables read FKeepOldVariables write FKeepOldVariables; |
![]() |
property trimTextNodes: TTrimTextNodes read FTrimTextNodes write FTrimTextNodes; |
![]() |
property UnnamedVariableName: string read FUnnamedVariableName write FUnnamedVariableName; |
![]() |
property AllowVeryShortNotation: boolean read FVeryShortNotation write FVeryShortNotation; |
![]() |
property AllowPropertyDotNotation: boolean read FObjects write FObjects; |
![]() |
property SingleQueryModule: boolean read FSingleQueryModule write FSingleQueryModule; |
![]() |
property hasRealVariableDefinitions: boolean read GetTemplateHasRealVariableDefinitions; |
![]() |
property TemplateTree: TTreeNode read getTemplateTree; |
![]() |
property HTMLTree: TTreeNode read getHTMLTree; |
![]() |
property TemplateParser: TTreeParser read FTemplate; |
![]() |
property HTMLParser: TTreeParser read FHTML; |
![]() |
property QueryEngine: TXQueryEngine read FQueryEngine; |
![]() |
constructor create; |
![]() |
destructor destroy; override; |
![]() |
procedure parseTemplate(template: string; templateName: string = '<unknown>'); |
|
loads the given template, stores templateName for debugging issues | |
![]() |
procedure parseTemplateFile(templatefilename: string); |
|
loads a template from a file | |
![]() |
function parseHTMLFile(htmlfilename: string):boolean; |
|
parses the given file by applying a previously loaded template. | |
![]() |
function replaceVarsOld(s:string;customReplace: TReplaceFunction=nil):string; deprecated; |
|
Warning: this symbol is deprecated. This replaces every $variable; in s with variables.values['variable'] or the value returned by customReplace (should not be used anymore) | |
![]() |
function replaceEnclosedExpressions(str:string):string; |
|
This treats str as extended string and evaluates the pxquery expression x"str" | |
![]() |
function debugMatchings(const width: integer): string; |
![]() |
function parseQuery(const expression: string): IXQuery; |
|
Returns a IXQuery that accesses the variable storage of the template engine. Mostly intended for internal use, but you might find it useful to evaluate external XPath expressions which are not part of the template | |
![]() |
property variables: TXQVariableChangeLog read GetVariables; |
|
List of all variables | |
![]() |
property variableChangeLog: TXQVariableChangeLog read FVariableLog; |
|
All assignments to a variables during the matching of the template. You can use TStrings.GetNameValue to get the variable/value in a certain line | |
![]() |
property oldVariableChangeLog: TXQVariableChangeLog read FOldVariableLog; |
|
All assignments to a variable during the matching of previous templates. (see TKeepPreviousVariables) | |
![]() |
property VariableChangeLogCondensed: TXQVariableChangeLog read GetVariableLogCondensed; |
|
VariableChangeLog with duplicated objects removed (i.e. if you have obj := object(), obj.a := 1, obj.b := 2, obj := object(); the normal change log will contain 4 objects (like {}, {a:1}, {a:1,b:2}, {}), but the condensed log only two {a:1,b:2}, {}) | |
![]() |
property templateNamespaces: TNamespaceList read GetTemplateNamespace; |
|
Global namespaces to set the commands that will be recognized as template commands. Default prefixes are template: and t: | |
![]() |
property ParsingExceptions: boolean read FParsingExceptions write FParsingExceptions; |
|
If this is true (default) it will raise an exception if the matching fails. | |
![]() |
property OutputEncoding: TEncoding read FOutputEncoding write FOutputEncoding; |
|
Output encoding, i.e. the encoding of the read variables. Html document and template are automatically converted to it | |
![]() |
property KeepPreviousVariables: TKeepPreviousVariables read FKeepOldVariables write FKeepOldVariables; |
|
Controls if old variables are deleted when processing a new document (see TKeepPreviousVariables) | |
![]() |
property trimTextNodes: TTrimTextNodes read FTrimTextNodes write FTrimTextNodes; |
|
How to trim text nodes (default ttnAfterReading). There is also pseudoxpath.XQGlobalTrimNodes which controls, how the values are returned. | |
![]() |
property AllowVeryShortNotation: boolean read FVeryShortNotation write FVeryShortNotation; |
|
Enables the the very short notation (e.g. {a:=text()}, <a>*) (default: true) | |
![]() |
property TemplateTree: TTreeNode read getTemplateTree; |
|
A tree representation of the current template | |
![]() |
property HTMLTree: TTreeNode read getHTMLTree; |
|
A tree representation of the processed html file | |
![]() |
property TemplateParser: TTreeParser read FTemplate; |
|
X/HTML parser used to read the templates (public so you can change the parsing behaviour, if you really need it) | |
![]() |
property HTMLParser: TTreeParser read FHTML; |
|
X/HTML parser used to read the pages (public so you can change the parsing behaviour, if you really need it) | |
![]() |
property QueryEngine: TXQueryEngine read FQueryEngine; |
|
XQuery engine used for evaluating query expressions contained in the template | |