Class THtmlTemplateParser

Unit

Declaration

type THtmlTemplateParser = class(TObject)

Description

This is the pattern matching processor class which can apply a pattern to one or more HTML documents.

You can use it by calling the methods parseTemplate and parseHTML. parseTemplate loads a certain pattern and parseHTML matches the pattern to an HTML/XML file.
A pattern file is just like an HTML file with special commands (it used to be called template file). The parser than matches every text and tag of the pattern to text/tag in the HTML file, while ignoring every additional data in latter file. If no match is possible an exception is raised.
The pattern can extract certain values from the HTML file into variables, and you can access these variables with the property variables and variableChangeLog. Former only contains the final value of the variables, latter records every assignment during the matching of the pattern.

Getting started

Creating a template to analyze an XML-file/webpage:

(

If you want to read several elements like table rows, you need to surround the matching tags with template:loop, e.g. <template:loop><tr>..</tr></template:loop> and the things between the loop-tags is repeated as long as possible. You can also use the short notation by adding a star like <tr>..</tr>* .

Using the templates from Pascal:

  1. First, create a new THtmlTemplateParser: parser := THtmlTemplateParser.create()

  2. Load the template with parser.parseTemplate('..template..') or parser.parseTemplateFile('template-file')

  3. Process the webpage with parser.parseHTML('..html..') or parser.parseHTMLFile('html-file')

  4. Read the result of variable yourVariableName through parser.variables.values['yourVariableName']

If you used loops, only the last value of the variable is available in the variables property, the previous values can be enumerated through variableChangelog.

Template examples

Example, how to read first <b>-tag:

Html-File: <b>Hello World!</b>
Template: <b>{.}</b>

This will set the default variable _result to "Hello World!"

Example, how to read the first <b>-tag in a explicit named variable:

Html-File: <b>Hello World!</b>
Template: <b>{$test}</b>

This will set the variable test to "Hello World!".
Some alternative forms are <b>{$test := .}</b>, <b><t:s>test := .</t:s></b>, <b><template:s>test := text()</template:s></b> or <b><t:read var="test" source="text()"></b>.

Example, how to read all <b>-tags:

Html-File: <b>Hello </b><b>World!</b>
Template: <b>{.}</b>*

This will change the value of the variable _result twice, to "Hello " and "World!". Both values are available in the variable changelog.
Some alternative forms are: <t:loop><b>{.}</b></t:loop>, <template:loop><b>{.}</b></template:loop>, <template:loop><b>{_result := text()}</b></template:loop>, ...

Example, how to read the first field of every row of a table:

HTML-File: <table> <tr> <td> row-cell 1 </td> </tr> <tr> <td> row-cell 2 </td> </tr> ... <tr> <td> row-cell n </td> </tr> </table>
Template: <table> <template:loop> <tr> <td> {$field} </td> </tr> </template:loop> </table>

This will read row after row, and will write each first field to the change log of the variable field.

Example, how to read several fields of every row of a table:

HTML-File: <table> <tr> <td> a </td> <td> b </td> <td> c </td> </tr> ... </tr> </table>
Template: <table> <template:loop> <tr> <td> {$field1} </td> <td> {$field2} </td> <td> {$field3} </td> ... </tr> </template:loop> </table>

This will read $field1=a, $field2=b, $field3=c...
If you now want to process multiple pages which have a similar, but slightly different table/data layount, you can create a template for each of them, and the Pascal side of the application is independent of the source pages. Then it is even possible for the user of the application to add new pages.

Example, how to read all elements between two elements:

HTML-File:

<h1>Start</h1>
  <b>Text 1</b>
  <b>Text 2</b>
<h1>End</h1>


Template:

<h1>Start</h1>
  <b>{.}</b>*
<h1>End</h1>


This will read all b elements between the two headers.

Example, how to read the first list item starting with an unary prime number:

HTML-File: ... <li>1111: this is 4</li><li>1:1 is no prime</li><li>1111111: here is 7</li><li>11111111: 8</li> ...
Template: <li template:condition="filter(text(), '1*:') != filter(text(), 'ˆ1?:|ˆ(11+?)\1+:')">{$prime}</li>

This will return "1111111: here is 7", because 1111111 is the first prime in that list.

See the unit tests in tests/extendedhtmlparser_tests.pas for more examples.

Why not XPath/CSS-Selectors?

You might wonder, why you should use templates, if you already know XPath or CSS Selectors.

The answer is that, although XPath/CSS works fine for single values, it is not powerful enough to read multiple values or data from multiple sources, because:

That said, it is obviously also possible to use XPath or CSS with the templates:

<html>{//your/xpath/expression}</html> or <html>{css("your.css#expression")}</html>

In fact there exists no other modern XPath/CSS interpreter for FreePascal.

Template reference

Basically the template file is an HTML file, and the parser tries to match the structure of the template html file to the html file.
A tag of the html file is considered as equal to a tag of the template file, if the tag names are equal, all attributes are the same (regardless of their order) and every child node of the tag in the template is also equal to a child node of the tag in the html file (in the same order and nesting).
Text nodes are considered as equal, if the text in the html file starts with the whitespace trimmed text of the template file. All comparisons are performed case insensitive.
The matching occurs with backtracking, so it will always find the first and longest match.

The following template commands can be used:


These template attributes can be used on any template element:


On HTML/matching tags also these matching modifying attributes can be used:

The default prefixes for template commands are "template:" and "t:", you can change that with the templateNamespace-property or by defining a new namespace in the template like xmlns:yournamespace="http://www.benibela.de/2011/templateparser" . (only the xmlns:prefix form is supported, not xmlns without prefix)

Short notation

Commonly used commands can be abbreviated as textual symbols instead of xml tags. To avoid conflicts with text node matching, this short notation is only allowed at the beginning of template text nodes.

The short read tag <t:s>foo:=..</t:s> to read something in variable foo can be abbreviated as {foo:=..}. Similarly {} can be written within attributes to read the attribute, e.g. <a href="{$dest := .}"/>.
Also the trailing := . can be omitted, if only one variable assignment occurs, e.g. as {$foo} is equivalent to foo := . and $foo := ..

Optional and repeated elements can be marked with ?, *, +, {min, max}; like <a>?...</a> or, equivalent, <a>..</a>?.
An element marked with ? becomes optional, which has the same effect as adding the template:optional="true" attribute.
An element marked with * can be repeated any times, which has the same effect as surrounding it with a template:loop element.
An element marked with + has to be repeated at least once, which has the same effect as surrounding it with a template:loop element with attribute min=1.
An element marked with {min,max} has to be repeated at least min-times and at most max-times (just like in a t:loop) (remember that additional data/elements are always ignored).
An element marked with {count} has to be repeated exactly count-times (just like in a t:loop) (remember that additional data/elements are always ignored).

Breaking changes from previous versions:

Planned breaking changes:

Hierarchy

Overview

Nested Types

Published TDebugMatchingPrintNode = function (node: TTreeNode): string of object;

Methods

Public procedure parseHTMLSimple(html, uri, contenttype: string);
Public function matchLastTrees: Boolean;
Public constructor create;
Public destructor destroy; override;
Public procedure parseTemplate(template: string; templateName: string = '<unknown>');
Public procedure parseTemplateFile(templatefilename: string);
Public function parseHTML(html: string; htmlFileName: string = ''; contentType: string = ''):boolean;
Public function parseHTMLFile(htmlfilename: string):boolean;
Public function replaceVarsOld(s:string;customReplace: TReplaceFunction=nil):string; deprecated;
Public function replaceEnclosedExpressions(str:string):string;
Public function debugMatchings(const width: integer; htmlToString: TDebugMatchingPrintNode = nil ): string;
Public function parseQuery(const expression: string): IXQuery;

Properties

Public property variables: TXQVariableChangeLog read GetVariables;
Public property variableChangeLog: TXQVariableChangeLog read FVariableLog;
Public property oldVariableChangeLog: TXQVariableChangeLog read FOldVariableLog;
Public property VariableChangeLogCondensed: TXQVariableChangeLog read GetVariableLogCondensed;
Public property templateNamespaces: TNamespaceList read GetTemplateNamespace;
Public property ParsingExceptions: boolean read FParsingExceptions write FParsingExceptions;
Public property OutputEncoding: TSystemCodePage read FOutputEncoding write FOutputEncoding;
Public property KeepPreviousVariables: TKeepPreviousVariables read FKeepOldVariables write FKeepOldVariables;
Public property trimTextNodes: TTrimTextNodes read FTrimTextNodes write FTrimTextNodes;
Public property UnnamedVariableName: string read FUnnamedVariableName write FUnnamedVariableName;
Public property AllowVeryShortNotation: boolean read FVeryShortNotation write FVeryShortNotation;
Public property SingleQueryModule: boolean read FSingleQueryModule write FSingleQueryModule;
Public property hasRealVariableDefinitions: boolean read GetTemplateHasRealVariableDefinitions;
Public property TemplateTree: TTreeNode read getTemplateTree;
Public property HTMLTree: TTreeNode read getHTMLTree;
Public property TemplateParser: TTreeParser read FTemplate;
Public property HTMLParser: TTreeParser read FHTML;
Public property QueryEngine: TXQueryEngine read FQueryEngine;
Public property QueryContext: TXQEvaluationContext read FQueryContext write FQueryContext;

Description

Nested Types

Published TDebugMatchingPrintNode = function (node: TTreeNode): string of object;
 

Methods

Public procedure parseHTMLSimple(html, uri, contenttype: string);

Parses an HTML file without performing matching. For internal use,

Public function matchLastTrees: Boolean;
 
Public constructor create;
 
Public destructor destroy; override;
 
Public procedure parseTemplate(template: string; templateName: string = '<unknown>');

loads the given template, stores templateName for debugging issues

Public procedure parseTemplateFile(templatefilename: string);

loads a template from a file

Public function parseHTML(html: string; htmlFileName: string = ''; contentType: string = ''):boolean;

parses the given data by applying a previously loaded template. htmlFileName is just for debugging issues

Public function parseHTMLFile(htmlfilename: string):boolean;

parses the given file by applying a previously loaded template.

Public function replaceVarsOld(s:string;customReplace: TReplaceFunction=nil):string; deprecated;

Warning: this symbol is deprecated.

This replaces every $variable; in s with variables.values['variable'] or the value returned by customReplace (should not be used anymore)

Public function replaceEnclosedExpressions(str:string):string;

This treats str as extended string and evaluates the pxquery expression x"str"

Public function debugMatchings(const width: integer; htmlToString: TDebugMatchingPrintNode = nil ): string;
 
Public function parseQuery(const expression: string): IXQuery;

Returns a IXQuery that accesses the variable storage of the template engine. Mostly intended for internal use, but you might find it useful to evaluate external XPath expressions which are not part of the template

Properties

Public property variables: TXQVariableChangeLog read GetVariables;

List of all variables (variableChangeLog is usually faster)

Public property variableChangeLog: TXQVariableChangeLog read FVariableLog;

All assignments to a variables during the matching of the template. You can use TStrings.GetNameValue to get the variable/value in a certain line

Public property oldVariableChangeLog: TXQVariableChangeLog read FOldVariableLog;

All assignments to a variable during the matching of previous templates. (see TKeepPreviousVariables)

Public property VariableChangeLogCondensed: TXQVariableChangeLog read GetVariableLogCondensed;

VariableChangeLog with duplicated objects removed (i.e. if you have obj := object(), obj.a := 1, obj.b := 2, obj := object(); the normal change log will contain 4 objects (like {}, {a:1}, {a:1,b:2}, {}), but the condensed log only two {a:1,b:2}, {})

Public property templateNamespaces: TNamespaceList read GetTemplateNamespace;

Global namespaces to set the commands that will be recognized as template commands. Default prefixes are template: and t:
Namespaces can also be defined in a template with the xmlns: notation and the namespace url 'http://www.benibela.de/2011/templateparser'

Public property ParsingExceptions: boolean read FParsingExceptions write FParsingExceptions;

If this is true (default) it will raise an exception if the matching fails.

Public property OutputEncoding: TSystemCodePage read FOutputEncoding write FOutputEncoding;

Output encoding, i.e. the encoding of the read variables. Html document and template are automatically converted to it

Public property KeepPreviousVariables: TKeepPreviousVariables read FKeepOldVariables write FKeepOldVariables;

Controls if old variables are deleted when processing a new document (see TKeepPreviousVariables)

Public property trimTextNodes: TTrimTextNodes read FTrimTextNodes write FTrimTextNodes;

How to trim text nodes (default ttnAfterReading). There is also pseudoxpath.XQGlobalTrimNodes which controls, how the values are returned.

Public property UnnamedVariableName: string read FUnnamedVariableName write FUnnamedVariableName;

Default variable name. If a something is read from the document, but not assigned to a variable, it is assigned to this one. (Default: _result)

Public property AllowVeryShortNotation: boolean read FVeryShortNotation write FVeryShortNotation;

Enables the the very short notation (e.g. {a:=text()}, <a>*) (default: true)

Public property SingleQueryModule: boolean read FSingleQueryModule write FSingleQueryModule;

If all XPath/XQuery expressions in the templates are kept in the same module. Only if true, XQuery variables/functions declared are accessible in other read commands. (declarations must be preceded by xquery version "1.0"; and followed by an expression, if only ()) Global variables, declared with a simple $x := value, are always everywhere accessible. (default true)

Public property hasRealVariableDefinitions: boolean read GetTemplateHasRealVariableDefinitions;

If the currently loaded template contains := variable definitions (contrary to assign values to the default variable with {.} ) (CAN ONLY BE USED AFTER the template has been applied!)

Public property TemplateTree: TTreeNode read getTemplateTree;

A tree representation of the current template

Public property HTMLTree: TTreeNode read getHTMLTree;

A tree representation of the processed html file

Public property TemplateParser: TTreeParser read FTemplate;

X/HTML parser used to read the templates (public so you can change the parsing behaviour, if you really need it)

Public property HTMLParser: TTreeParser read FHTML;

X/HTML parser used to read the pages (public so you can change the parsing behaviour, if you really need it)

Public property QueryEngine: TXQueryEngine read FQueryEngine;

XQuery engine used for evaluating query expressions contained in the template

Public property QueryContext: TXQEvaluationContext read FQueryContext write FQueryContext;

Context used to evaluate XQuery expressions. For internal use.


Generated by PasDoc 0.16.0.