Encoding Conversion

Encoding Conversion
Prev		Next

+Data encoding compatibility problems are one of the most common + difficulties encountered by programmers new to XML in + general and libxml in particular. Thinking + through the design of your application in light of this issue will help + avoid difficulties later. Internally, libxml + stores and manipulates data in the UTF-8 format. Data used by your program + in other formats, such as the commonly used ISO-8859-1 encoding, must be + converted to UTF-8 before passing it to libxml + functions. If you want your program's output in an encoding other than + UTF-8, you also must convert it.

Libxml uses + iconv if it is available to convert + data. Without iconv, only UTF-8, UTF-16 and + ISO-8859-1 can be used as external formats. With + iconv, any format can be used provided + iconv is able to convert it to and from + UTF-8. Currently iconv supports about 150 + different character formats with ability to convert from any to any. While + the actual number of supported formats varies between implementations, every + iconv implementation is almost guaranteed to + support every format anyone has ever heard of.

	Warning
	A common mistake is to use different formats for the internal data + in different parts of one's code. The most common case is an application + that assumes ISO-8859-1 to be the internal data format, combined with + libxml, which assumes UTF-8 to be the + internal data format. The result is an application that treats internal + data differently, depending on which code section is executing. The one or + the other part of code will then, naturally, misinterpret the data. +

This example constructs a simple document, then adds content provided + at the command line to the document's root element and outputs the results + to stdout in the proper encoding. For this example, we + use ISO-8859-1 encoding. The encoding of the string input at the command + line is converted from ISO-8859-1 to UTF-8. Full code: Appendix H, Code for Encoding Conversion Example

The conversion, encapsulated in the example code in the + convert function, uses + libxml's + xmlFindCharEncodingHandler function: +

+	xmlCharEncodingHandlerPtr handler;
+        size = (int)strlen(in)+1; 
+        out_size = size*2-1; 
+        out = malloc((size_t)out_size); 
+
+…
+	handler = xmlFindCharEncodingHandler(encoding);
+…
+	handler->input(out, &out_size, in, &temp);
+…	
+	xmlSaveFormatFileEnc("-", doc, encoding, 1);
+

	`handler` is declared as a pointer to an + `xmlCharEncodingHandler` function.
	The `xmlCharEncodingHandler` function needs + to be given the size of the input and output strings, which are + calculated here for strings `in` and + `out`.
	`xmlFindCharEncodingHandler` takes as its + argument the data's initial encoding and searches + libxml's built-in set of conversion + handlers, returning a pointer to the function or NULL if none is + found.
	The conversion function identified by `handler` + requires as its arguments pointers to the input and output strings, + along with the length of each. The lengths must be determined + separately by the application.
	To output in a specified encoding rather than UTF-8, we use + `xmlSaveFormatFileEnc`, specifying the + encoding.

Prev	Up	Next
Retrieving Attributes	Home	A. Compilation