From 35a201cc8ef0c3f5b2df88d2e528aabee1048348 Mon Sep 17 00:00:00 2001 From: Wojtek Kosior Date: Fri, 30 Apr 2021 18:47:09 +0200 Subject: Initial/Final commit --- libxml2-2.9.10/doc/xmlreader.html | 475 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 475 insertions(+) create mode 100644 libxml2-2.9.10/doc/xmlreader.html (limited to 'libxml2-2.9.10/doc/xmlreader.html') diff --git a/libxml2-2.9.10/doc/xmlreader.html b/libxml2-2.9.10/doc/xmlreader.html new file mode 100644 index 0000000..354e5bd --- /dev/null +++ b/libxml2-2.9.10/doc/xmlreader.html @@ -0,0 +1,475 @@ + + + + + + + Libxml2 XmlTextReader Interface tutorial + + + +

Libxml2 XmlTextReader Interface tutorial

+ +

This document describes the use of the XmlTextReader streaming API added +to libxml2 in version 2.5.0 . This API is closely modeled after the XmlTextReader +and XmlReader +classes of the C# language.

+ +

This tutorial will present the key points of this API, and working +examples using both C and the Python bindings:

+ +

Table of content:

Introduction: why a new API
Walking a simple tree
Extracting informations for the current + node
Extracting informations for the + attributes
Validating a document
Entities substitution
Relax-NG Validation
Mixing the reader and tree or XPath + operations

+ +

Introduction: why a new API

+ +

Libxml2 main API is +tree based, where the parsing operation results in a document loaded +completely in memory, and expose it as a tree of nodes all available at the +same time. This is very simple and quite powerful, but has the major +limitation that the size of the document that can be hamdled is limited by +the size of the memory available. Libxml2 also provide a SAX based API, but that version was +designed upon one of the early expat version of SAX, SAX is +also not formally defined for C. SAX basically work by registering callbacks +which are called directly by the parser as it progresses through the document +streams. The problem is that this programming model is relatively complex, +not well standardized, cannot provide validation directly, makes entity, +namespace and base processing relatively hard.

+ +

The XmlTextReader +API from C# provides a far simpler programming model. The API acts as a +cursor going forward on the document stream and stopping at each node in the +way. The user's code keeps control of the progress and simply calls a +Read() function repeatedly to progress to each node in sequence in document +order. There is direct support for namespaces, xml:base, entity handling and +adding DTD validation on top of it was relatively simple. This API is really +close to the DOM Core +specification This provides a far more standard, easy to use and powerful +API than the existing SAX. Moreover integrating extension features based on +the tree seems relatively easy.

+ +

In a nutshell the XmlTextReader API provides a simpler, more standard and +more extensible interface to handle large documents than the existing SAX +version.

+ +

Walking a simple tree

+ +

Basically the XmlTextReader API is a forward only tree walking interface. +The basic steps are:

prepare a reader context operating on some input
run a loop iterating over all nodes in the document
free up the reader context

+ +

Here is a basic C sample doing this:

#include <libxml/xmlreader.h>
+
+void processNode(xmlTextReaderPtr reader) {
+    /* handling of a node in the tree */
+}
+
+int streamFile(char *filename) {
+    xmlTextReaderPtr reader;
+    int ret;
+
+    reader = xmlNewTextReaderFilename(filename);
+    if (reader != NULL) {
+        ret = xmlTextReaderRead(reader);
+        while (ret == 1) {
+            processNode(reader);
+            ret = xmlTextReaderRead(reader);
+        }
+        xmlFreeTextReader(reader);
+        if (ret != 0) {
+            printf("%s : failed to parse\n", filename);
+        }
+    } else {
+        printf("Unable to open %s\n", filename);
+    }
+}

+ +

A few things to notice:

the include file needed : libxml/xmlreader.h
the creation of the reader using a filename
the repeated call to xmlTextReaderRead() and how any return value + different from 1 should stop the loop
that a negative return means a parsing error
how xmlFreeTextReader() should be used to free up the resources used by + the reader.

+ +

Here is similar code in python for exactly the same processing:

import libxml2
+
+def processNode(reader):
+    pass
+
+def streamFile(filename):
+    try:
+        reader = libxml2.newTextReaderFilename(filename)
+    except:
+        print "unable to open %s" % (filename)
+        return
+
+    ret = reader.Read()
+    while ret == 1:
+        processNode(reader)
+        ret = reader.Read()
+
+    if ret != 0:
+        print "%s : failed to parse" % (filename)

+ +

The only things worth adding are that the xmlTextReader +is abstracted as a class like in C# with the same method names (but the +properties are currently accessed with methods) and that one doesn't need to +free the reader at the end of the processing. It will get garbage collected +once all references have disappeared.

+ +

Extracting information for the current node

+ +

So far the example code did not indicate how information was extracted +from the reader. It was abstrated as a call to the processNode() routine, +with the reader as the argument. At each invocation, the parser is stopped on +a given node and the reader can be used to query those node properties. Each +Property is available at the C level as a function taking a single +xmlTextReaderPtr argument whose name is +xmlTextReaderProperty , if the return type is an +xmlChar * string then it must be deallocated with +xmlFree() to avoid leaks. For the Python interface, there is a +Property method to the reader class that can be called on the +instance. The list of the properties is based on the C# +XmlTextReader class set of properties and methods:

NodeType: The node type, 1 for start element, 15 for end of + element, 2 for attributes, 3 for text nodes, 4 for CData sections, 5 for + entity references, 6 for entity declarations, 7 for PIs, 8 for comments, + 9 for the document nodes, 10 for DTD/Doctype nodes, 11 for document + fragment and 12 for notation nodes.
Name: the qualified + name of the node, equal to (Prefix:)LocalName.
LocalName: the local name of + the node.
Prefix: a shorthand reference to the namespace associated with + the node.
NamespaceUri: the URI defining the namespace associated with + the node.
BaseUri: the base URI of the node. See the XML Base W3C specification.
Depth: the depth of the node in the tree, starts at 0 for the + root node.
HasAttributes: whether the node has attributes.
HasValue: whether the node can have a text value.
Value: provides the text value of the node if present.
IsDefault: whether an Attribute node was generated from the + default value defined in the DTD or schema (unsupported + yet).
XmlLang: the xml:lang scope + within which the node resides.
IsEmptyElement: check if the current node is empty, this is a + bit bizarre in the sense that <a/> will be considered + empty while <a></a> will not.
AttributeCount: provides the number of attributes of the + current node.

+ +

Let's look first at a small example to get this in practice by redefining +the processNode() function in the Python example:

def processNode(reader):
+    print "%d %d %s %d" % (reader.Depth(), reader.NodeType(),
+                           reader.Name(), reader.IsEmptyElement())

+ +

and look at the result of calling streamFile("tst.xml") for various +content of the XML test file.

+ +

For the minimal document "<doc/>" we get:

0 1 doc 1

+ +

Only one node is found, its depth is 0, type 1 indicate an element start, +of name "doc" and it is empty. Trying now with +"<doc></doc>" instead leads to:

0 1 doc 0
+0 15 doc 0

+ +

The document root node is not flagged as empty anymore and both a start +and an end of element are detected. The following document shows how +character data are reported:

<doc><a/><b>some text</b>
+<c/></doc>

+ +

We modifying the processNode() function to also report the node Value:

def processNode(reader):
+    print "%d %d %s %d %s" % (reader.Depth(), reader.NodeType(),
+                              reader.Name(), reader.IsEmptyElement(),
+                              reader.Value())

+ +

The result of the test is:

0 1 doc 0 None
+1 1 a 1 None
+1 1 b 0 None
+2 3 #text 0 some text
+1 15 b 0 None
+1 3 #text 0
+
+1 1 c 1 None
+0 15 doc 0 None

+ +

There are a few things to note:

the increase of the depth value (first row) as children nodes are + explored
the text node child of the b element, of type 3 and its content
the text node containing the line return between elements b and c
that elements have the Value None (or NULL in C)

+ +

The equivalent routine for processNode() as used by +xmllint --stream --debug is the following and can be found in +the xmllint.c module in the source distribution:

static void processNode(xmlTextReaderPtr reader) {
+    xmlChar *name, *value;
+
+    name = xmlTextReaderName(reader);
+    if (name == NULL)
+        name = xmlStrdup(BAD_CAST "--");
+    value = xmlTextReaderValue(reader);
+
+    printf("%d %d %s %d",
+            xmlTextReaderDepth(reader),
+            xmlTextReaderNodeType(reader),
+            name,
+            xmlTextReaderIsEmptyElement(reader));
+    xmlFree(name);
+    if (value == NULL)
+        printf("\n");
+    else {
+        printf(" %s\n", value);
+        xmlFree(value);
+    }
+}

+ +

Extracting information for the attributes

+ +

The previous examples don't indicate how attributes are processed. The +simple test "<doc a="b"/>" provides the following +result:

0 1 doc 1 None

+ +

This proves that attribute nodes are not traversed by default. The +HasAttributes property allow to detect their presence. To check +their content the API has special instructions. Basically two kinds of operations +are possible:

to move the reader to the attribute nodes of the current element, in + that case the cursor is positioned on the attribute node
to directly query the element node for the attribute value

+ +

In both case the attribute can be designed either by its position in the +list of attribute (MoveToAttributeNo or GetAttributeNo) or +by their name (and namespace):

GetAttributeNo(no): provides the value of the attribute with + the specified index no relative to the containing element.
GetAttribute(name): provides the value of the attribute with + the specified qualified name.
GetAttributeNs(localName, namespaceURI): provides the value of the + attribute with the specified local name and namespace URI.
MoveToAttributeNo(no): moves the position of the current + instance to the attribute with the specified index relative to the + containing element.
MoveToAttribute(name): moves the position of the current + instance to the attribute with the specified qualified name.
MoveToAttributeNs(localName, namespaceURI): moves the position + of the current instance to the attribute with the specified local name + and namespace URI.
MoveToFirstAttribute: moves the position of the current + instance to the first attribute associated with the current node.
MoveToNextAttribute: moves the position of the current + instance to the next attribute associated with the current node.
MoveToElement: moves the position of the current instance to + the node that contains the current Attribute node.

+ +

After modifying the processNode() function to show attributes:

def processNode(reader):
+    print "%d %d %s %d %s" % (reader.Depth(), reader.NodeType(),
+                              reader.Name(), reader.IsEmptyElement(),
+                              reader.Value())
+    if reader.NodeType() == 1: # Element
+        while reader.MoveToNextAttribute():
+            print "-- %d %d (%s) [%s]" % (reader.Depth(), reader.NodeType(),
+                                          reader.Name(),reader.Value())

+ +

The output for the same input document reflects the attribute:

0 1 doc 1 None
+-- 1 2 (a) [b]

+ +

There are a couple of things to note on the attribute processing:

Their depth is the one of the carrying element plus one.
Namespace declarations are seen as attributes, as in DOM.

+ +

Validating a document

+ +

Libxml2 implementation adds some extra features on top of the XmlTextReader +API. The main one is the ability to DTD validate the parsed document +progressively. This is simply the activation of the associated feature of the +parser used by the reader structure. There are a few options available +defined as the enum xmlParserProperties in the libxml/xmlreader.h header +file:

XML_PARSER_LOADDTD: force loading the DTD (without validating)
XML_PARSER_DEFAULTATTRS: force attribute defaulting (this also imply + loading the DTD)
XML_PARSER_VALIDATE: activate DTD validation (this also imply loading + the DTD)
XML_PARSER_SUBST_ENTITIES: substitute entities on the fly, entity + reference nodes are not generated and are replaced by their expanded + content.
more settings might be added, those were the one available at the 2.5.0 + release...

+ +

The GetParserProp() and SetParserProp() methods can then be used to get +and set the values of those parser properties of the reader. For example

def parseAndValidate(file):
+    reader = libxml2.newTextReaderFilename(file)
+    reader.SetParserProp(libxml2.PARSER_VALIDATE, 1)
+    ret = reader.Read()
+    while ret == 1:
+        ret = reader.Read()
+    if ret != 0:
+        print "Error parsing and validating %s" % (file)

+ +

This routine will parse and validate the file. Error messages can be +captured by registering an error handler. See python/tests/reader2.py for +more complete Python examples. At the C level the equivalent call to ativate +the validation feature is just:

ret = xmlTextReaderSetParserProp(reader, XML_PARSER_VALIDATE, 1)

+ +

and a return value of 0 indicates success.

+ +

Entities substitution

+ +

By default the xmlReader will report entities as such and not replace them +with their content. This default behaviour can however be overridden using:

+ +

reader.SetParserProp(libxml2.PARSER_SUBST_ENTITIES,1)

+ +

Relax-NG Validation

+ +

Introduced in version 2.5.7

+ +

Libxml2 can now validate the document being read using the xmlReader using +Relax-NG schemas. While the Relax NG validator can't always work in a +streamable mode, only subsets which cannot be reduced to regular expressions +need to have their subtree expanded for validation. In practice it means +that, unless the schemas for the top level element content is not expressible +as a regexp, only chunk of the document needs to be parsed while +validating.

+ +

The steps to do so are:

create a reader working on a document as usual
before any call to read associate it to a Relax NG schemas, either the + preparsed schemas or the URL to the schemas to use
errors will be reported the usual way, and the validity status can be + obtained using the IsValid() interface of the reader like for DTDs.

+ +

Example, assuming the reader has already being created and that the schema +string contains the Relax-NG schemas:

rngp = libxml2.relaxNGNewMemParserCtxt(schema, len(schema))

+rngs = rngp.relaxNGParse()

+reader.RelaxNGSetSchema(rngs)

+ret = reader.Read()

+while ret == 1:

+    ret = reader.Read()

+if ret != 0:

+    print "Error parsing the document"

+if reader.IsValid() != 1:

+    print "Document failed to validate"

+

+ +

See reader6.py in the sources or documentation for a complete +example.

+ +

Mixing the reader and tree or XPath operations

+ +

Introduced in version 2.5.7

+ +

While the reader is a streaming interface, its underlying implementation +is based on the DOM builder of libxml2. As a result it is relatively simple +to mix operations based on both models under some constraints. To do so the +reader has an Expand() operation allowing to grow the subtree under the +current node. It returns a pointer to a standard node which can be +manipulated in the usual ways. The node will get all its ancestors and the +full subtree available. Usual operations like XPath queries can be used on +that reduced view of the document. Here is an example extracted from +reader5.py in the sources which extract and prints the bibliography for the +"Dragon" compiler book from the XML 1.0 recommendation:

f = open('../../test/valid/REC-xml-19980210.xml')
+input = libxml2.inputBuffer(f)
+reader = input.newTextReader("REC")
+res=""
+while reader.Read():
+    while reader.Name() == 'bibl':
+        node = reader.Expand()            # expand the subtree
+        if node.xpathEval("@id = 'Aho'"): # use XPath on it
+            res = res + node.serialize()
+        if reader.Next() != 1:            # skip the subtree
+            break;

+ +

Note, however that the node instance returned by the Expand() call is only +valid until the next Read() operation. The Expand() operation does not +affects the Read() ones, however usually once processed the full subtree is +not useful anymore, and the Next() operation allows to skip it completely and +process to the successor or return 0 if the document end is reached.

+ +

Daniel Veillard

+ +

$Id$

+ +

+ + -- cgit v1.2.3