From 35a201cc8ef0c3f5b2df88d2e528aabee1048348 Mon Sep 17 00:00:00 2001 From: Wojtek Kosior Date: Fri, 30 Apr 2021 18:47:09 +0200 Subject: Initial/Final commit --- libxml2-2.9.10/doc/tutorial/ar01s09.html | 63 ++++++++++++++++++++++++++++++++ 1 file changed, 63 insertions(+) create mode 100644 libxml2-2.9.10/doc/tutorial/ar01s09.html (limited to 'libxml2-2.9.10/doc/tutorial/ar01s09.html') diff --git a/libxml2-2.9.10/doc/tutorial/ar01s09.html b/libxml2-2.9.10/doc/tutorial/ar01s09.html new file mode 100644 index 0000000..2987add --- /dev/null +++ b/libxml2-2.9.10/doc/tutorial/ar01s09.html @@ -0,0 +1,63 @@ +Encoding Conversion

Encoding Conversion

+Data encoding compatibility problems are one of the most common + difficulties encountered by programmers new to XML in + general and libxml in particular. Thinking + through the design of your application in light of this issue will help + avoid difficulties later. Internally, libxml + stores and manipulates data in the UTF-8 format. Data used by your program + in other formats, such as the commonly used ISO-8859-1 encoding, must be + converted to UTF-8 before passing it to libxml + functions. If you want your program's output in an encoding other than + UTF-8, you also must convert it.

Libxml uses + iconv if it is available to convert + data. Without iconv, only UTF-8, UTF-16 and + ISO-8859-1 can be used as external formats. With + iconv, any format can be used provided + iconv is able to convert it to and from + UTF-8. Currently iconv supports about 150 + different character formats with ability to convert from any to any. While + the actual number of supported formats varies between implementations, every + iconv implementation is almost guaranteed to + support every format anyone has ever heard of.

[Warning]Warning

A common mistake is to use different formats for the internal data + in different parts of one's code. The most common case is an application + that assumes ISO-8859-1 to be the internal data format, combined with + libxml, which assumes UTF-8 to be the + internal data format. The result is an application that treats internal + data differently, depending on which code section is executing. The one or + the other part of code will then, naturally, misinterpret the data. +

This example constructs a simple document, then adds content provided + at the command line to the document's root element and outputs the results + to stdout in the proper encoding. For this example, we + use ISO-8859-1 encoding. The encoding of the string input at the command + line is converted from ISO-8859-1 to UTF-8. Full code: Appendix H, Code for Encoding Conversion Example

The conversion, encapsulated in the example code in the + convert function, uses + libxml's + xmlFindCharEncodingHandler function: +

+	1xmlCharEncodingHandlerPtr handler;
+        2size = (int)strlen(in)+1; 
+        out_size = size*2-1; 
+        out = malloc((size_t)out_size); 
+
+…
+	3handler = xmlFindCharEncodingHandler(encoding);
+…
+	4handler->input(out, &out_size, in, &temp);
+…	
+	5xmlSaveFormatFileEnc("-", doc, encoding, 1);
+      

+

1

handler is declared as a pointer to an + xmlCharEncodingHandler function.

2

The xmlCharEncodingHandler function needs + to be given the size of the input and output strings, which are + calculated here for strings in and + out.

3

xmlFindCharEncodingHandler takes as its + argument the data's initial encoding and searches + libxml's built-in set of conversion + handlers, returning a pointer to the function or NULL if none is + found.

4

The conversion function identified by handler + requires as its arguments pointers to the input and output strings, + along with the length of each. The lengths must be determined + separately by the application.

5

To output in a specified encoding rather than UTF-8, we use + xmlSaveFormatFileEnc, specifying the + encoding.

+

-- cgit v1.2.3