This particular gotcha is a good example of a class of problems I call encoding mismatches and is an easy and rather nasty to trap to fall into for a number of reasons:
- The Java API for XML processing lets you get away with it so effortlessly.
- Novice Java programmers learn to use Reader classes to read in textual data and, "since XML is textual data", this seems an obvious and helpful thing thing to do when reading in XML.
- English speakers don't usually notice anything is wrong until it's too late!
I have certainly made this mistake in the past myself, as have many people I've worked with over the years, so think it is definitely something that every Java programmer who uses XML should at least think about, even if you're already doing things correctly.
Before we even look at XML though, we need to take a few steps back and briefly remind ourselves of some of the things we should know about when communicating textual data.
Unicode and encodings
Every programmer needs to know at least a little bit about encodings, which are algorithms specifying how textual data should be represented as binary data for storage and transmission. Java and XML both support the Unicode standard, which defines well over 100000 characters and symbols in use throughout the world. In order to communicate all of these characters digitally, they need to be packed into bytes and, with a single byte capable of representing only 256 possible characters, this is clearly not a trivial task. One arguably old-fashioned approach is to throw away most of Unicode and select a subset that can be mapped into a single byte. This gives us popular encodings such as the 128-character ASCII encoding and the various Latin encodings, which flesh ASCII out with characters used in various European countries. More flexible encodings that allow you to use all of the Unicode character range necessarily use more than one byte to represent certain characters. Perhaps the most common example of these encodings is UTF-8, which uses between 1 and 4 bytes to encode each Unicode character and is very efficient for representing English text as it encodes ASCII characters in exactly the same way as the ASCII and Latin encodings.
Reading encoded text data
In order to read in textual data represented as bytes, you need to know the encoding that was used so that it can be decoded correctly. This simple fact often surprises novice programmers, who all too easily rely on things like a "default" encoding for reading and writing textual data. Default encodings are very convenient if you are the sole producer and consumer of your data, but they are are useless if you're using data you've obtained from someone or somewhere else. For this reason, most "protocols" that communicate textual data have a mechanism of telling you which encoding is being used so that you (or your software) can correctly decode it.
Encoding mismatches
If you read in textual data using the wrong encoding then you will find erroneous characters introduced into the decoded text and your users will start complaining about "funny symbols". You may also get an error report if you're lucky, depending on the API and the way the encoding algorithm works.
Noticing these mismatched encoding bugs can sometimes be harder than you imagine, especially when dealing with English text in the common encodings. As a Brit, the most common encodings I encounter are UTF-8 and Latin-1. Since ASCII characters are encoded into the same bytes when using the ASCII, Latin and UTF-8 encodings, encoding mismatches only become evident when using non-ASCII characters such as accented European characters or mathematical symbols, and it's not uncommon to encounter software that has managed to get into production without having thought of ever using such characters, leading to funny symbol reports from confused users. For example, a German ü character is encoded as 2 bytes in UTF-8. If these bytes are then (incorrectly) decoded using Latin-1, then you'll end up with 2 (wrong) characters instead of the ü. Fun!
Reading in text using Java
The simplest and traditional way of reading in textual data in Java is to use a subclass of Reader. You can correctly and consistently read a UTF-8-encoded text file in with the rather verbose:
InputStreamReader reader = new InputStreamReader(new FileInputStream(new File("myfile.txt")), "UTF-8");Less experienced programmers might opt for the shorter:
FileReader reader = new FileReader(new File("myfile.txt"));The problem with this second form is that the encoding is not specified anywhere, so Java will use the "platform default" encoding, which may or may not be the correct one and will be specific to the computer the code is running on. (So, in particular, this form should never be used in "server-side" code.)
If you look at the java.io package Java API, you'll see that many Reader constructors let you specify the encoding that should be used, whereas many specify no encoding, using the platform default. This can be OK if you're reading and writing out text files locally, but you should only use these default encodings if you are 100% sure that the default encoding is the correct one, otherwise the character data will be decoded wrongly. Also, the Reader classes don't report decoding errors so it's hard to detect when things go wrong.
The XML specification is clever here and allows you to specify the encoding within the (normally optional) XML declaration at the start of the file, using a default of UTF-8 if no declaration is found. Here's an example:
<?xml version="1.0" encoding="ISO-8859-1"?>
This says that the encoded binary representation of this XML file uses the ISO-8859-1 (a.k.a. Latin-1) encoding.
When you tell an XML parser to parse a binary stream, it looks at the first few bytes to work out which encoding should be used. It then decodes the stream using this encoding and parses the resulting textual data, hopefully correctly. Your XML parser is actually doing a lot of work for you here, which you should be thankful for. You should also let it do this work, as it's much more likely to do it correctly than you are!
If, on the other hand, you decide to decode your XML first (e.g. using a Java Reader class), then you need to know the correct encoding in advance. You'll then be passing character data to your XML parser and it will correctly ignore the encoding specified in the XML declaration since you have already decoded the text. If you decoded using the wrong encoding, then funny symbols will no doubt ensue.
Reading XML with Java
The Java XML APIs generally come with a number of overloaded methods for parsing, transforming and doing other exciting things to XML sources. Based on what you've read so far, you'll now generally know to avoid using ones that take Reader, favouring InputStream or File instead.
Here's an example of parsing a File with a SAX Parser:
public static void parseXMLGood(File file, DefaultHandler handler) throws Exception {
SAXParser saxParser = SAXParserFactory.newInstance().newSAXParser();
saxParser.parse(file, handler);
}
This is actually nice and simple in this case since the API helpfully provides a parse() method taking a File. In other cases, you might need to obtain a FileInputStream first.
Of course, as with all "rules", there are valid cases for breaking them. For example, if you've built up some XML programmatically as a big String, then using a StringReader is of course the right approach.
Conclusion... and moral of the story
Unless you have reason to do otherwise - and know what you're doing - you should always:
- Pass raw binary streams (e.g. InputStream, File) to your XML parser
- Let your XML parser do the decoding for you
- Only use the Reader constructors that specify an explicit encoding