From 35a201cc8ef0c3f5b2df88d2e528aabee1048348 Mon Sep 17 00:00:00 2001 From: Wojtek Kosior Date: Fri, 30 Apr 2021 18:47:09 +0200 Subject: Initial/Final commit --- libxml2-2.9.10/doc/xmlreader.html | 475 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 475 insertions(+) create mode 100644 libxml2-2.9.10/doc/xmlreader.html (limited to 'libxml2-2.9.10/doc/xmlreader.html') diff --git a/libxml2-2.9.10/doc/xmlreader.html b/libxml2-2.9.10/doc/xmlreader.html new file mode 100644 index 0000000..354e5bd --- /dev/null +++ b/libxml2-2.9.10/doc/xmlreader.html @@ -0,0 +1,475 @@ + + + + + + + Libxml2 XmlTextReader Interface tutorial + + + +

Libxml2 XmlTextReader Interface tutorial

+ +

+ +

This document describes the use of the XmlTextReader streaming API added +to libxml2 in version 2.5.0 . This API is closely modeled after the XmlTextReader +and XmlReader +classes of the C# language.

+ +

This tutorial will present the key points of this API, and working +examples using both C and the Python bindings:

+ +

Table of content:

+ + +

+ +

Introduction: why a new API

+ +

Libxml2 main API is +tree based, where the parsing operation results in a document loaded +completely in memory, and expose it as a tree of nodes all available at the +same time. This is very simple and quite powerful, but has the major +limitation that the size of the document that can be hamdled is limited by +the size of the memory available. Libxml2 also provide a SAX based API, but that version was +designed upon one of the early expat version of SAX, SAX is +also not formally defined for C. SAX basically work by registering callbacks +which are called directly by the parser as it progresses through the document +streams. The problem is that this programming model is relatively complex, +not well standardized, cannot provide validation directly, makes entity, +namespace and base processing relatively hard.

+ +

The XmlTextReader +API from C# provides a far simpler programming model. The API acts as a +cursor going forward on the document stream and stopping at each node in the +way. The user's code keeps control of the progress and simply calls a +Read() function repeatedly to progress to each node in sequence in document +order. There is direct support for namespaces, xml:base, entity handling and +adding DTD validation on top of it was relatively simple. This API is really +close to the DOM Core +specification This provides a far more standard, easy to use and powerful +API than the existing SAX. Moreover integrating extension features based on +the tree seems relatively easy.

+ +

In a nutshell the XmlTextReader API provides a simpler, more standard and +more extensible interface to handle large documents than the existing SAX +version.

+ +

Walking a simple tree

+ +

Basically the XmlTextReader API is a forward only tree walking interface. +The basic steps are:

+
    +
  1. prepare a reader context operating on some input
  2. +
  3. run a loop iterating over all nodes in the document
  4. +
  5. free up the reader context
  6. +
+ +

Here is a basic C sample doing this:

+
#include <libxml/xmlreader.h>
+
+void processNode(xmlTextReaderPtr reader) {
+    /* handling of a node in the tree */
+}
+
+int streamFile(char *filename) {
+    xmlTextReaderPtr reader;
+    int ret;
+
+    reader = xmlNewTextReaderFilename(filename);
+    if (reader != NULL) {
+        ret = xmlTextReaderRead(reader);
+        while (ret == 1) {
+            processNode(reader);
+            ret = xmlTextReaderRead(reader);
+        }
+        xmlFreeTextReader(reader);
+        if (ret != 0) {
+            printf("%s : failed to parse\n", filename);
+        }
+    } else {
+        printf("Unable to open %s\n", filename);
+    }
+}
+ +

A few things to notice:

+ + +

Here is similar code in python for exactly the same processing:

+
import libxml2
+
+def processNode(reader):
+    pass
+
+def streamFile(filename):
+    try:
+        reader = libxml2.newTextReaderFilename(filename)
+    except:
+        print "unable to open %s" % (filename)
+        return
+
+    ret = reader.Read()
+    while ret == 1:
+        processNode(reader)
+        ret = reader.Read()
+
+    if ret != 0:
+        print "%s : failed to parse" % (filename)
+ +

The only things worth adding are that the xmlTextReader +is abstracted as a class like in C# with the same method names (but the +properties are currently accessed with methods) and that one doesn't need to +free the reader at the end of the processing. It will get garbage collected +once all references have disappeared.

+ +

Extracting information for the current node

+ +

So far the example code did not indicate how information was extracted +from the reader. It was abstrated as a call to the processNode() routine, +with the reader as the argument. At each invocation, the parser is stopped on +a given node and the reader can be used to query those node properties. Each +Property is available at the C level as a function taking a single +xmlTextReaderPtr argument whose name is +xmlTextReaderProperty , if the return type is an +xmlChar * string then it must be deallocated with +xmlFree() to avoid leaks. For the Python interface, there is a +Property method to the reader class that can be called on the +instance. The list of the properties is based on the C# +XmlTextReader class set of properties and methods:

+ + +

Let's look first at a small example to get this in practice by redefining +the processNode() function in the Python example:

+
def processNode(reader):
+    print "%d %d %s %d" % (reader.Depth(), reader.NodeType(),
+                           reader.Name(), reader.IsEmptyElement())
+ +

and look at the result of calling streamFile("tst.xml") for various +content of the XML test file.

+ +

For the minimal document "<doc/>" we get:

+
0 1 doc 1
+ +

Only one node is found, its depth is 0, type 1 indicate an element start, +of name "doc" and it is empty. Trying now with +"<doc></doc>" instead leads to:

+
0 1 doc 0
+0 15 doc 0
+ +

The document root node is not flagged as empty anymore and both a start +and an end of element are detected. The following document shows how +character data are reported:

+
<doc><a/><b>some text</b>
+<c/></doc>
+ +

We modifying the processNode() function to also report the node Value:

+
def processNode(reader):
+    print "%d %d %s %d %s" % (reader.Depth(), reader.NodeType(),
+                              reader.Name(), reader.IsEmptyElement(),
+                              reader.Value())
+ +

The result of the test is:

+
0 1 doc 0 None
+1 1 a 1 None
+1 1 b 0 None
+2 3 #text 0 some text
+1 15 b 0 None
+1 3 #text 0
+
+1 1 c 1 None
+0 15 doc 0 None
+ +

There are a few things to note:

+ + +

The equivalent routine for processNode() as used by +xmllint --stream --debug is the following and can be found in +the xmllint.c module in the source distribution:

+
static void processNode(xmlTextReaderPtr reader) {
+    xmlChar *name, *value;
+
+    name = xmlTextReaderName(reader);
+    if (name == NULL)
+        name = xmlStrdup(BAD_CAST "--");
+    value = xmlTextReaderValue(reader);
+
+    printf("%d %d %s %d",
+            xmlTextReaderDepth(reader),
+            xmlTextReaderNodeType(reader),
+            name,
+            xmlTextReaderIsEmptyElement(reader));
+    xmlFree(name);
+    if (value == NULL)
+        printf("\n");
+    else {
+        printf(" %s\n", value);
+        xmlFree(value);
+    }
+}
+ +

Extracting information for the attributes

+ +

The previous examples don't indicate how attributes are processed. The +simple test "<doc a="b"/>" provides the following +result:

+
0 1 doc 1 None
+ +

This proves that attribute nodes are not traversed by default. The +HasAttributes property allow to detect their presence. To check +their content the API has special instructions. Basically two kinds of operations +are possible:

+
    +
  1. to move the reader to the attribute nodes of the current element, in + that case the cursor is positioned on the attribute node
  2. +
  3. to directly query the element node for the attribute value
  4. +
+ +

In both case the attribute can be designed either by its position in the +list of attribute (MoveToAttributeNo or GetAttributeNo) or +by their name (and namespace):

+ + +

After modifying the processNode() function to show attributes:

+
def processNode(reader):
+    print "%d %d %s %d %s" % (reader.Depth(), reader.NodeType(),
+                              reader.Name(), reader.IsEmptyElement(),
+                              reader.Value())
+    if reader.NodeType() == 1: # Element
+        while reader.MoveToNextAttribute():
+            print "-- %d %d (%s) [%s]" % (reader.Depth(), reader.NodeType(),
+                                          reader.Name(),reader.Value())
+ +

The output for the same input document reflects the attribute:

+
0 1 doc 1 None
+-- 1 2 (a) [b]
+ +

There are a couple of things to note on the attribute processing:

+ + +

Validating a document

+ +

Libxml2 implementation adds some extra features on top of the XmlTextReader +API. The main one is the ability to DTD validate the parsed document +progressively. This is simply the activation of the associated feature of the +parser used by the reader structure. There are a few options available +defined as the enum xmlParserProperties in the libxml/xmlreader.h header +file:

+ + +

The GetParserProp() and SetParserProp() methods can then be used to get +and set the values of those parser properties of the reader. For example

+
def parseAndValidate(file):
+    reader = libxml2.newTextReaderFilename(file)
+    reader.SetParserProp(libxml2.PARSER_VALIDATE, 1)
+    ret = reader.Read()
+    while ret == 1:
+        ret = reader.Read()
+    if ret != 0:
+        print "Error parsing and validating %s" % (file)
+ +

This routine will parse and validate the file. Error messages can be +captured by registering an error handler. See python/tests/reader2.py for +more complete Python examples. At the C level the equivalent call to ativate +the validation feature is just:

+
ret = xmlTextReaderSetParserProp(reader, XML_PARSER_VALIDATE, 1)
+ +

and a return value of 0 indicates success.

+ +

Entities substitution

+ +

By default the xmlReader will report entities as such and not replace them +with their content. This default behaviour can however be overridden using:

+ +

reader.SetParserProp(libxml2.PARSER_SUBST_ENTITIES,1)

+ +

Relax-NG Validation

+ +

Introduced in version 2.5.7

+ +

Libxml2 can now validate the document being read using the xmlReader using +Relax-NG schemas. While the Relax NG validator can't always work in a +streamable mode, only subsets which cannot be reduced to regular expressions +need to have their subtree expanded for validation. In practice it means +that, unless the schemas for the top level element content is not expressible +as a regexp, only chunk of the document needs to be parsed while +validating.

+ +

The steps to do so are:

+ + +

Example, assuming the reader has already being created and that the schema +string contains the Relax-NG schemas:

+
rngp = libxml2.relaxNGNewMemParserCtxt(schema, len(schema))
+rngs = rngp.relaxNGParse()
+reader.RelaxNGSetSchema(rngs)
+ret = reader.Read()
+while ret == 1:
+ ret = reader.Read()
+if ret != 0:
+ print "Error parsing the document"
+if reader.IsValid() != 1:
+ print "Document failed to validate"

+
+ +

See reader6.py in the sources or documentation for a complete +example.

+ +

Mixing the reader and tree or XPath operations

+ +

Introduced in version 2.5.7

+ +

While the reader is a streaming interface, its underlying implementation +is based on the DOM builder of libxml2. As a result it is relatively simple +to mix operations based on both models under some constraints. To do so the +reader has an Expand() operation allowing to grow the subtree under the +current node. It returns a pointer to a standard node which can be +manipulated in the usual ways. The node will get all its ancestors and the +full subtree available. Usual operations like XPath queries can be used on +that reduced view of the document. Here is an example extracted from +reader5.py in the sources which extract and prints the bibliography for the +"Dragon" compiler book from the XML 1.0 recommendation:

+
f = open('../../test/valid/REC-xml-19980210.xml')
+input = libxml2.inputBuffer(f)
+reader = input.newTextReader("REC")
+res=""
+while reader.Read():
+    while reader.Name() == 'bibl':
+        node = reader.Expand()            # expand the subtree
+        if node.xpathEval("@id = 'Aho'"): # use XPath on it
+            res = res + node.serialize()
+        if reader.Next() != 1:            # skip the subtree
+            break;
+ +

Note, however that the node instance returned by the Expand() call is only +valid until the next Read() operation. The Expand() operation does not +affects the Read() ones, however usually once processed the full subtree is +not useful anymore, and the Next() operation allows to skip it completely and +process to the successor or return 0 if the document end is reached.

+ +

Daniel Veillard

+ +

$Id$

+ +

+ + -- cgit v1.2.3