From 35a201cc8ef0c3f5b2df88d2e528aabee1048348 Mon Sep 17 00:00:00 2001 From: Wojtek Kosior Date: Fri, 30 Apr 2021 18:47:09 +0200 Subject: Initial/Final commit --- libxml2-2.9.10/doc/encoding.html | 217 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 217 insertions(+) create mode 100644 libxml2-2.9.10/doc/encoding.html (limited to 'libxml2-2.9.10/doc/encoding.html') diff --git a/libxml2-2.9.10/doc/encoding.html b/libxml2-2.9.10/doc/encoding.html new file mode 100644 index 0000000..42f3396 --- /dev/null +++ b/libxml2-2.9.10/doc/encoding.html @@ -0,0 +1,217 @@ + + +Encodings support
Action against software patentsGnome2 LogoW3C LogoRed Hat Logo
Made with Libxml2 Logo

The XML C parser and toolkit of Gnome

Encodings support

Main Menu
Related links

If you are not really familiar with Internationalization (usual shortcut +is I18N) , Unicode, characters and glyphs, I suggest you read a presentation +by Tim Bray on Unicode and why you should care about it.

If you don't understand why it does not make sense to have a string +without knowing what encoding it uses, then as Joel Spolsky said please do not +write another line of code until you finish reading that article.. It is +a prerequisite to understand this page, and avoid a lot of problems with +libxml2, XML or text processing in general.

Table of Content:

    +
  1. What does internationalization support + mean ?
  2. +
  3. The internal encoding, how and + why
  4. +
  5. How is it implemented ?
  6. +
  7. Default supported encodings
  8. +
  9. How to extend the existing + support
  10. +

What does internationalization support mean ?

XML was designed from the start to allow the support of any character set +by using Unicode. Any conformant XML parser has to support the UTF-8 and +UTF-16 default encodings which can both express the full unicode ranges. UTF8 +is a variable length encoding whose greatest points are to reuse the same +encoding for ASCII and to save space for Western encodings, but it is a bit +more complex to handle in practice. UTF-16 use 2 bytes per character (and +sometimes combines two pairs), it makes implementation easier, but looks a +bit overkill for Western languages encoding. Moreover the XML specification +allows the document to be encoded in other encodings at the condition that +they are clearly labeled as such. For example the following is a wellformed +XML document encoded in ISO-8859-1 and using accentuated letters that we +French like for both markup and content:

<?xml version="1.0" encoding="ISO-8859-1"?>
+<très>là </très>

Having internationalization support in libxml2 means the following:

    +
  • the document is properly parsed
  • +
  • information about it's encoding is saved
  • +
  • it can be modified
  • +
  • it can be saved in its original encoding
  • +
  • it can also be saved in another encoding supported by libxml2 (for + example straight UTF8 or even an ASCII form)
  • +

Another very important point is that the whole libxml2 API, with the +exception of a few routines to read with a specific encoding or save to a +specific encoding, is completely agnostic about the original encoding of the +document.

It should be noted too that the HTML parser embedded in libxml2 now obey +the same rules too, the following document will be (as of 2.2.2) handled in +an internationalized fashion by libxml2 too:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
+                      "http://www.w3.org/TR/REC-html40/loose.dtd">
+<html lang="fr">
+<head>
+  <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1">
+</head>
+<body>
+<p>W3C crée des standards pour le Web.</body>
+</html>

The internal encoding, how and why

One of the core decisions was to force all documents to be converted to a +default internal encoding, and that encoding to be UTF-8, here are the +rationales for those choices:

    +
  • keeping the native encoding in the internal form would force the libxml + users (or the code associated) to be fully aware of the encoding of the + original document, for examples when adding a text node to a document, + the content would have to be provided in the document encoding, i.e. the + client code would have to check it before hand, make sure it's conformant + to the encoding, etc ... Very hard in practice, though in some specific + cases this may make sense.
  • +
  • the second decision was which encoding. From the XML spec only UTF8 and + UTF16 really makes sense as being the two only encodings for which there + is mandatory support. UCS-4 (32 bits fixed size encoding) could be + considered an intelligent choice too since it's a direct Unicode mapping + support. I selected UTF-8 on the basis of efficiency and compatibility + with surrounding software: +
      +
    • UTF-8 while a bit more complex to convert from/to (i.e. slightly + more costly to import and export CPU wise) is also far more compact + than UTF-16 (and UCS-4) for a majority of the documents I see it used + for right now (RPM RDF catalogs, advogato data, various configuration + file formats, etc.) and the key point for today's computer + architecture is efficient uses of caches. If one nearly double the + memory requirement to store the same amount of data, this will trash + caches (main memory/external caches/internal caches) and my take is + that this harms the system far more than the CPU requirements needed + for the conversion to UTF-8
    • +
    • Most of libxml2 version 1 users were using it with straight ASCII + most of the time, doing the conversion with an internal encoding + requiring all their code to be rewritten was a serious show-stopper + for using UTF-16 or UCS-4.
    • +
    • UTF-8 is being used as the de-facto internal encoding standard for + related code like the pango + upcoming Gnome text widget, and a lot of Unix code (yet another place + where Unix programmer base takes a different approach from Microsoft + - they are using UTF-16)
    • +
    +
  • +

What does this mean in practice for the libxml2 user:

    +
  • xmlChar, the libxml2 data type is a byte, those bytes must be assembled + as UTF-8 valid strings. The proper way to terminate an xmlChar * string + is simply to append 0 byte, as usual.
  • +
  • One just need to make sure that when using chars outside the ASCII set, + the values has been properly converted to UTF-8
  • +

How is it implemented ?

Let's describe how all this works within libxml, basically the I18N +(internationalization) support get triggered only during I/O operation, i.e. +when reading a document or saving one. Let's look first at the reading +sequence:

    +
  1. when a document is processed, we usually don't know the encoding, a + simple heuristic allows to detect UTF-16 and UCS-4 from encodings where + the ASCII range (0-0x7F) maps with ASCII
  2. +
  3. the xml declaration if available is parsed, including the encoding + declaration. At that point, if the autodetected encoding is different + from the one declared a call to xmlSwitchEncoding() is issued.
  4. +
  5. If there is no encoding declaration, then the input has to be in either + UTF-8 or UTF-16, if it is not then at some point when processing the + input, the converter/checker of UTF-8 form will raise an encoding error. + You may end-up with a garbled document, or no document at all ! Example: +
    ~/XML -> ./xmllint err.xml 
    +err.xml:1: error: Input is not proper UTF-8, indicate encoding !
    +<très>là </très>
    +   ^
    +err.xml:1: error: Bytes: 0xE8 0x73 0x3E 0x6C
    +<très>là </très>
    +   ^
    +
  6. +
  7. xmlSwitchEncoding() does an encoding name lookup, canonicalize it, and + then search the default registered encoding converters for that encoding. + If it's not within the default set and iconv() support has been compiled + it, it will ask iconv for such an encoder. If this fails then the parser + will report an error and stops processing: +
    ~/XML -> ./xmllint err2.xml 
    +err2.xml:1: error: Unsupported encoding UnsupportedEnc
    +<?xml version="1.0" encoding="UnsupportedEnc"?>
    +                                             ^
    +
  8. +
  9. From that point the encoder processes progressively the input (it is + plugged as a front-end to the I/O module) for that entity. It captures + and converts on-the-fly the document to be parsed to UTF-8. The parser + itself just does UTF-8 checking of this input and process it + transparently. The only difference is that the encoding information has + been added to the parsing context (more precisely to the input + corresponding to this entity).
  10. +
  11. The result (when using DOM) is an internal form completely in UTF-8 + with just an encoding information on the document node.
  12. +

Ok then what happens when saving the document (assuming you +collected/built an xmlDoc DOM like structure) ? It depends on the function +called, xmlSaveFile() will just try to save in the original encoding, while +xmlSaveFileTo() and xmlSaveFileEnc() can optionally save to a given +encoding:

    +
  1. if no encoding is given, libxml2 will look for an encoding value + associated to the document and if it exists will try to save to that + encoding, +

    otherwise everything is written in the internal form, i.e. UTF-8

    +
  2. +
  3. so if an encoding was specified, either at the API level or on the + document, libxml2 will again canonicalize the encoding name, lookup for a + converter in the registered set or through iconv. If not found the + function will return an error code
  4. +
  5. the converter is placed before the I/O buffer layer, as another kind of + buffer, then libxml2 will simply push the UTF-8 serialization to through + that buffer, which will then progressively be converted and pushed onto + the I/O layer.
  6. +
  7. It is possible that the converter code fails on some input, for example + trying to push an UTF-8 encoded Chinese character through the UTF-8 to + ISO-8859-1 converter won't work. Since the encoders are progressive they + will just report the error and the number of bytes converted, at that + point libxml2 will decode the offending character, remove it from the + buffer and replace it with the associated charRef encoding &#123; and + resume the conversion. This guarantees that any document will be saved + without losses (except for markup names where this is not legal, this is + a problem in the current version, in practice avoid using non-ascii + characters for tag or attribute names). A special "ascii" encoding name + is used to save documents to a pure ascii form can be used when + portability is really crucial
  8. +

Here are a few examples based on the same test document and assumin a +terminal using ISO-8859-1 as the text encoding:

~/XML -> ./xmllint isolat1 
+<?xml version="1.0" encoding="ISO-8859-1"?>
+<très>là</très>
+~/XML -> ./xmllint --encode UTF-8 isolat1 
+<?xml version="1.0" encoding="UTF-8"?>
+<très>là  </très>
+~/XML -> 

The same processing is applied (and reuse most of the code) for HTML I18N +processing. Looking up and modifying the content encoding is a bit more +difficult since it is located in a <meta> tag under the <head>, +so a couple of functions htmlGetMetaEncoding() and htmlSetMetaEncoding() have +been provided. The parser also attempts to switch encoding on the fly when +detecting such a tag on input. Except for that the processing is the same +(and again reuses the same code).

Default supported encodings

libxml2 has a set of default converters for the following encodings +(located in encoding.c):

    +
  1. UTF-8 is supported by default (null handlers)
  2. +
  3. UTF-16, both little and big endian
  4. +
  5. ISO-Latin-1 (ISO-8859-1) covering most western languages
  6. +
  7. ASCII, useful mostly for saving
  8. +
  9. HTML, a specific handler for the conversion of UTF-8 to ASCII with HTML + predefined entities like &copy; for the Copyright sign.
  10. +

More over when compiled on an Unix platform with iconv support the full +set of encodings supported by iconv can be instantly be used by libxml. On a +linux machine with glibc-2.1 the list of supported encodings and aliases fill +3 full pages, and include UCS-4, the full set of ISO-Latin encodings, and the +various Japanese ones.

To convert from the UTF-8 values returned from the API to another encoding +then it is possible to use the function provided from the encoding module like UTF8Toisolat1, or use the +POSIX iconv() +API directly.

Encoding aliases

From 2.2.3, libxml2 has support to register encoding names aliases. The +goal is to be able to parse document whose encoding is supported but where +the name differs (for example from the default set of names accepted by +iconv). The following functions allow to register and handle new aliases for +existing encodings. Once registered libxml2 will automatically lookup the +aliases when handling a document:

    +
  • int xmlAddEncodingAlias(const char *name, const char *alias);
  • +
  • int xmlDelEncodingAlias(const char *alias);
  • +
  • const char * xmlGetEncodingAlias(const char *alias);
  • +
  • void xmlCleanupEncodingAliases(void);
  • +

How to extend the existing support

Well adding support for new encoding, or overriding one of the encoders +(assuming it is buggy) should not be hard, just write input and output +conversion routines to/from UTF-8, and register them using +xmlNewCharEncodingHandler(name, xxxToUTF8, UTF8Toxxx), and they will be +called automatically if the parser(s) encounter such an encoding name +(register it uppercase, this will help). The description of the encoders, +their arguments and expected return values are described in the encoding.h +header.

Daniel Veillard

-- cgit v1.2.3