Class THtmlTemplateParser

DescriptionHierarchyFieldsMethodsProperties

Unit

Declaration

type THtmlTemplateParser = class(TObject)

Description

This is the template processor class which can apply a template to one or more html documents.

You can use it by calling the methods parseTemplate and parseHTML. parseTemplate loads a certain template and parseHTML matches the template to a html/xml file.
A template file is just like a html file with special commands. The parser than matches every text and tag of the template to text/tag in the html file, while ignoring every additional data in latter file. If no match is possible an exception is raised.
The template can extract certain values from the html file into variables, and you can access these variables with the property variables and variableChangeLog. Former only contains the final value of the variables, latter records every assignment during the matching of the template.

Getting started

Creating a template to analyze a xml-file/webpage:

  1. First, you should remove all things from the webpage that are uninteresting, dynamically generated or invalid xml (or alternatively start with an empty file as template).

  2. Then, you should replace all parts that you want to extract with <t:s>yourVariableName:=text()</t:s>.
    This will write the value of the text node that contains the t:s tag in the variable yourVariableName.

    Instead of the t:s tag, you can also use the short notation {yourVariableName:=text()}; and instead of text() to read the text node, you can also use @attrib to read an attribute; or an arbitrary complex xpath/xquery-expression

  3. Then the template is finished, at least the trivial things

If you want to read several elements like table rows, you need to surround the matching tags with template:loop, e.g. <template:loop><tr>..</tr></template:loop> and the things between the loop-tags is repeated as long as possible. You can also use the short notation by adding a star like <tr>..</tr>* .

Using the template from Pascal:

  1. First, create a new THtmlTemplateParser: parser := THtmlTemplateParser.create()

  2. Load the template with parser.parseTemplate('..template..') or parser.parseTemplateFile('template-file')

  3. Process the webpage with parser.parseHTML('..html..') or parser.parseHTMLFile('html-file')

  4. Read the result of variable yourVariableName through parser.variables.values['yourVariableName']

If you used loops, only the last value of the variable is avaible in the variables property, the previous values can be enumerated through variableChangelog.

Template examples

Example, how to read the first <b>-tag:

Template: <b><template:read var="test" source="text()"></b>
Html-File: <b>Hello World!</b>

This will set the variable test to "Hello World!"

Example, how to read the first <b>-tag using the short template notation:

Template: <b><t:s>test:=.</t:s></b>
Html-File: <b>Hello World!</b>

This will also set the variable test to "Hello World!"

Example, how to read the first <b>-tag using the very short template notation:

Template: <b>{test:=.}</b>
Html-File: <b>Hello World!</b>

This will also set the variable test to "Hello World!"

Example, how to read all <b>-tags:

Template: <template:loop><b><t:s>test:=.</t:s></b></template:loop>
Html-File: <b>Hello World!</b>

This will also set the variable test to "Hello World!"

Example, how to read the first field of a every row of a table:

Template: <table> <template:loop> <tr> <td> <template:read var="readField()" source="text()"> </td> </tr> </template:loop> </table>
Html-File: <table> <tr> <td> row-cell 1 </td> </tr> <tr> <td> row-cell 2 </td> </tr> ... <tr> <td> row-cell n </td> </tr> </table>

This will read row after row, and will write each first field to the change log of the variable readField() .

Example, how to read the several field of a every row of a table:

Template: <table> <template:loop> <tr> <td> <template:read var="readField1()" source="text()"> </td> <td> <template:read var="readField2()" source="text()"> </td> <td> <template:read var="readField3()" source="text()"> </td> ... </tr> </template:loop> </table>
Html-File: <table> <tr> <td> a </td> <td> b </td> <td> c </td> </tr> ... </tr> </table>

This will read readField1()=a, readField2()=b, readField3()=c...
Of you can use your own names instead of readFieldX() and they are independent of the html file. So such templates can convert several pages with different structures, to the same internal data layout of your application.

Example, how to read all rows of every table CSV like:

Template: <template:loop> <tr> <template:read var="readAnotherRow()" source="deep-text(',')"> </tr> </template:loop>
Html-File: ... <tr> <td> a </td> <td> b </td> <td> c </td> </tr> <tr> <td> foo </td> <td> bar </td> </tr> ...

This will read all rows, and write lines like a,b,c and foo,bar to the changelog.

Example, how to read the first list item starting with an unary prime number:

Template: <li template:condition="filter(text(), '1*:') != filter(text(), 'ˆ1?:|ˆ(11+?)\1+:')"><template:read var="prime" source="text()"/></li>
Html-File: ... <li>1111: this is 4</li><li>1:1 is no prime</li><li>1111111: here is 7</li><li>11111111: 8</li> ...

This will return "1111111: here is 7", because 1111111 is the first prime in that list.

Example, how to extract all elements of a html form:

Template:

<form>
  <template:switch>
    <input t:condition="(@type = ('checkbox', 'radio') and exists(@checked)) or (@type = ('hidden', 'password', 'text'))">{post:=concat(@name,'=',@value)}</input>
    <select>{temp:=@name}<option t:condition="exists(@selected)">{post:=concat($temp,'=',@value)}</option>?</select>
    <textarea>{post:=concat(@name,'=',text())}</textarea>
  </template:switch>*
</form>

Html-File: any form

This example will extract from each relevant element in the form the name and value pair which is sent to the webserver. It is very general, and will work with all forms, independent of things like nesting deep. Therefore it is a little bit ugly; but if you create a template for a specific page, you usually know which elements you will find there, so the template becomes much simpler in practical cases.

See the unit tests in tests/extendedhtmlparser_tests.pas for more examples.

Template reference

Basically the template file is a html file, and the parser tries to match the structure of the template html file to the html file.
A tag of the html file is considered as equal to a tag of the template file, if the tag names are equal, all attributes are the same (regardless of their order) and every child node of the tag in the template is also equal to a child node of the tag in the html file (in the same order and nesting).
Text nodes are considered as equal, if the text in the html file starts with the whitespace trimmed text of the template file. All comparisons are performed case insensitive.
The matching occurs with backtracking, so it will always find the first and longest match.

The following template commands can be used:


Each of these commands can also have a property test="{xpath condition}", and the tag is ignored if the condition does not evaluate to true (so <template:tag test="{condition}">..</template:tag> is a short hand for <template:if test="{condition}"><template:tag>..</template:tag></template:if>).

There are two special attributes allowed for html or matching tags in the template file:

The default prefixes for template commands are "template:" and "t:", you can change that with the templateNamespace-property or by defining a new namespace in the template like xmlns:yournamespace="http://www.benibela.de/2011/templateparser" . (only the xmlns:prefix form is supported, not xmlns without prefix)

Short notation

Commonly used commands can be abbreviated as textual symbols instead of xml tags. To avoid conflicts with text node matching, this short notation is only allowed at the beginning of template text nodes.

The short read tag <t:s>foo:=..</t:s> to read something in variable foo can be abbreviated as {foo:=..}. Similarly {} can be written within attributes to read the attribute, e.g. <a href="{$dest := .}"/>.
Also the trailing := . can be omitted, if only one variable assignment occurs, e.g. as {$foo} is equivalent to foo := . and $foo := ..

Optional and repeated elements can be marked with ?, *, +, {min, max}; like <a>?...</a> or, equivalent, <a>..</a>?.
An element marked with ? becomes optional, which has the same effect as adding the template:optional="true" attribute.
An element marked with * can be repeated any times, which has the same effect as surrounding it with a template:loop element.
An element marked with + has to be repeated at least once, which has the same effect as surrounding it with a template:loop element with attribute min=1.
An element marked with {min,max} has to be repeated at least min-times and at most max-times (just like in a t:loop) (remember that additional data/elements are always ignored).
An element marked with {count} has to be repeated exactly count-times (just like in a t:loop) (remember that additional data/elements are always ignored).

Breaking changes from previous versions:

Planned breaking changes:

Hierarchy

Overview

Methods

Public constructor create;
Public destructor destroy; override;
Public procedure parseTemplate(template: string; templateName: string = '<unknown>');
Public procedure parseTemplateFile(templatefilename: string);
Public function parseHTML(html: string; htmlFileName: string = ''; contentType: string = ''):boolean;
Public function parseHTMLFile(htmlfilename: string):boolean;
Public function replaceVarsOld(s:string;customReplace: TReplaceFunction=nil):string; deprecated;
Public function replaceEnclosedExpressions(str:string):string;
Public function debugMatchings(const width: integer): string;
Public function parseQuery(const expression: string): IXQuery;

Properties

Public property variables: TXQVariableChangeLog read GetVariables;
Public property variableChangeLog: TXQVariableChangeLog read FVariableLog;
Public property oldVariableChangeLog: TXQVariableChangeLog read FOldVariableLog;
Public property VariableChangeLogCondensed: TXQVariableChangeLog read GetVariableLogCondensed;
Public property templateNamespaces: TNamespaceList read GetTemplateNamespace;
Public property ParsingExceptions: boolean read FParsingExceptions write FParsingExceptions;
Public property OutputEncoding: TEncoding read FOutputEncoding write FOutputEncoding;
Public property KeepPreviousVariables: TKeepPreviousVariables read FKeepOldVariables write FKeepOldVariables;
Public property trimTextNodes: TTrimTextNodes read FTrimTextNodes write FTrimTextNodes;
Public property UnnamedVariableName: string read FUnnamedVariableName write FUnnamedVariableName;
Public property AllowVeryShortNotation: boolean read FVeryShortNotation write FVeryShortNotation;
Public property AllowPropertyDotNotation: boolean read FObjects write FObjects;
Public property SingleQueryModule: boolean read FSingleQueryModule write FSingleQueryModule;
Public property hasRealVariableDefinitions: boolean read GetTemplateHasRealVariableDefinitions;
Public property TemplateTree: TTreeNode read getTemplateTree;
Public property HTMLTree: TTreeNode read getHTMLTree;
Public property TemplateParser: TTreeParser read FTemplate;
Public property HTMLParser: TTreeParser read FHTML;
Public property QueryEngine: TXQueryEngine read FQueryEngine;

Description

Methods

Public constructor create;
 
Public destructor destroy; override;
 
Public procedure parseTemplate(template: string; templateName: string = '<unknown>');

loads the given template, stores templateName for debugging issues

Public procedure parseTemplateFile(templatefilename: string);

loads a template from a file

Public function parseHTML(html: string; htmlFileName: string = ''; contentType: string = ''):boolean;

parses the given data by applying a previously loaded template. htmlFileName is just for debugging issues

Public function parseHTMLFile(htmlfilename: string):boolean;

parses the given file by applying a previously loaded template.

Public function replaceVarsOld(s:string;customReplace: TReplaceFunction=nil):string; deprecated;

Warning: this symbol is deprecated.

This replaces every $variable; in s with variables.values['variable'] or the value returned by customReplace (should not be used anymore)

Public function replaceEnclosedExpressions(str:string):string;

This treats str as extended string and evaluates the pxquery expression x"str"

Public function debugMatchings(const width: integer): string;
 
Public function parseQuery(const expression: string): IXQuery;

Returns a IXQuery that accesses the variable storage of the template engine. Mostly intended for internal use, but you might find it useful to evaluate external XPath expressions which are not part of the template

Properties

Public property variables: TXQVariableChangeLog read GetVariables;

List of all variables (variableChangeLog is usually faster)

Public property variableChangeLog: TXQVariableChangeLog read FVariableLog;

All assignments to a variables during the matching of the template. You can use TStrings.GetNameValue to get the variable/value in a certain line

Public property oldVariableChangeLog: TXQVariableChangeLog read FOldVariableLog;

All assignments to a variable during the matching of previous templates. (see TKeepPreviousVariables)

Public property VariableChangeLogCondensed: TXQVariableChangeLog read GetVariableLogCondensed;

VariableChangeLog with duplicated objects removed (i.e. if you have obj := object(), obj.a := 1, obj.b := 2, obj := object(); the normal change log will contain 4 objects (like {}, {a:1}, {a:1,b:2}, {}), but the condensed log only two {a:1,b:2}, {})

Public property templateNamespaces: TNamespaceList read GetTemplateNamespace;

Global namespaces to set the commands that will be recognized as template commands. Default prefixes are template: and t:
Namespaces can also be defined in a template with the xmlns: notation and the namespace url 'http://www.benibela.de/2011/templateparser'

Public property ParsingExceptions: boolean read FParsingExceptions write FParsingExceptions;

If this is true (default) it will raise an exception if the matching fails.

Public property OutputEncoding: TEncoding read FOutputEncoding write FOutputEncoding;

Output encoding, i.e. the encoding of the read variables. Html document and template are automatically converted to it

Public property KeepPreviousVariables: TKeepPreviousVariables read FKeepOldVariables write FKeepOldVariables;

Controls if old variables are deleted when processing a new document (see TKeepPreviousVariables)

Public property trimTextNodes: TTrimTextNodes read FTrimTextNodes write FTrimTextNodes;

How to trim text nodes (default ttnAfterReading). There is also pseudoxpath.XQGlobalTrimNodes which controls, how the values are returned.

Public property UnnamedVariableName: string read FUnnamedVariableName write FUnnamedVariableName;

Default variable name. If a something is read from the document, but not assigned to a variable, it is assigned to this one. (Default: _result)

Public property AllowVeryShortNotation: boolean read FVeryShortNotation write FVeryShortNotation;

Enables the the very short notation (e.g. {a:=text()}, <a>*) (default: true)

Public property AllowPropertyDotNotation: boolean read FObjects write FObjects;

If object properties can be accessed with $object.propertyname (e.g. object(("a", 1, "b", 2)).a would become 1). When objects are enabled, variable names cannot contain points. (default true)

Public property SingleQueryModule: boolean read FSingleQueryModule write FSingleQueryModule;

If all XPath/XQuery expressions in the templates are kept in the same module. Only if true, XQuery variables/functions declared are accessible in other read commands. (declarations must be preceded by xquery version "1.0"; and followed by an expression, if only ()) Global variables, declared with a simple $x := value, are always everywhere accessible. (default true)

Public property hasRealVariableDefinitions: boolean read GetTemplateHasRealVariableDefinitions;

If the currently loaded template contains := variable definitions (contrary to assign values to the default variable with {.} ) (CAN ONLY BE USED AFTER the template has been applied!)

Public property TemplateTree: TTreeNode read getTemplateTree;

A tree representation of the current template

Public property HTMLTree: TTreeNode read getHTMLTree;

A tree representation of the processed html file

Public property TemplateParser: TTreeParser read FTemplate;

X/HTML parser used to read the templates (public so you can change the parsing behaviour, if you really need it)

Public property HTMLParser: TTreeParser read FHTML;

X/HTML parser used to read the pages (public so you can change the parsing behaviour, if you really need it)

Public property QueryEngine: TXQueryEngine read FQueryEngine;

XQuery engine used for evaluating query expressions contained in the template


Generated by PasDoc 0.11.0 on 2013-07-13 02:13:21