With the widespread adoption of internet connected devices, today's application developers must ensure that their software is accessible to users across the world if they hope to achieve the highest possible adoption rates. This is especially true if your application reads data from outside sources (e.g. files, user input, APIs), as it likely will not be long before your program encounters text that includes foreign language characters and symbols. If not handled gracefully, non-english characters can cause all sorts of headaches for developers and, perhaps most importantly, lead to application downtime.
The computing industry’s current de facto standard for handling and representing characters and symbols from all of the major languages (and some not so major) is Unicode. In Unicode, characters are mapped one-to-one to hexadecimal numbers known as Unicode code points. For example, the capital letter “A” is represented by the code point U+0041, where U+ indicates Unicode and 0041 is the hexadecimal representation for the character. A code point’s hexadecimal value should not be confused with the byte values used to represent the Unicode character in computer memory. Rather, the code point’s hexadecimal value is just a numerical way to uniquely identify and represent a specific Unicode character. The byte string used to represent a Unicode character in memory is solely dependent upon the chosen character encoding.
In order to understand the purpose of character encodings, it is important to remember that computers store bytes, not characters. When computer programs read bytes from memory, they must know how to interpret them. Character encodings were developed as a way to represent text characters as bytes on a computer, and conversely, read and interpret bytes from a computer and display them as text on a screen. The process of converting a string of characters into bytes is referred to as encoding, while converting a string of bytes back into characters is referred to as decoding. A defined method of encoding and decoding is collectively referred to as a codec. There are a variety of different of encodings available to use, each with their advantages and disadvantages. The most widely used codec on the internet is UTF-8 .
Now let's dive into some code! The code samples below all use Python, as it is the programming language I use most from day to day. Furthermore, the examples focus on Python 2.7, as changes were made in Python 3 that make dealing with Unicode characters much more intuitive.
Probably the best place to start is to outline the Python built-in data types available for representing character strings. In Python 2 the two built-in types are str and unicode. A str object is created with single or double quotes, pretty standard stuff.
A unicode object is created by placing a lowercase “u” in front of single or double quotes, with Unicode codepoints being stored between the quotation marks. This syntax is shorthand for the unicode object constructor. Codepoints are represented in Python with an escaped letter “U,” followed by the codepoints hexadecimal value. If the hexadecimal value is 4 or less digits, a lower case “u” is used, while an uppercase “U” is used for hexadecimal values with 5 to 8 digits. Furthermore, all codepoints must be either four or eight hexadecimal digits in length, and should be left-padded with zeros in order to meet these requirements. Lastly, ASCII characters may be stored in a Python Unicode object with their character literals, and therefore do not need to be converted to their codepoint representations.
And of course, you can store multiple Unicode characters in the same unicode object.
The biggest distinction between Python’s str and unicode objects is that str objects store bytes while unicode objects store codepoints. Although it is convenient to use Unicode codepoints for processing data within our programs, all computer I/O is done with bytes. Therefore, you should use Python str objects to read and write data into and out of your program, but use Python unicode objects while processing data within your program. The encode and decode methods are used to convert between str and unicode objects.
Encode is a method called on an unicode object that will convert it into a str object, using the codec provided as a string argument. In other words, encode converts codepoints to bytes.
Conversely, decode is a method called on a str object (i.e. a byte string) that converts it into a unicode object. The decode method takes one argument, which is the codec that will be used to decode the string. Therefore, in order to properly decode the string you must know the codec that was used to encode the string!
The previous example shows what can happen if you decode a byte string with a codec different from the one used to encode it. This is why when you are reading in data (i.e. bytes) from an outside source (e.g. a file, web socket) you must always know the encoding!
- Read data into str objects and decode into a unicode objects as soon as possible
- You must know the source encoding in order to properly decode data
- Before writing data, encode with the desired codec
Originally published on May 09, 2016
I hope my first blog post was helpful! Be sure to reach out if you have any questions.