Encodings
Unicode 6.0.0
Unicode 6.0.0 was released on October 11, 2010. Mainstream journalists didn’t take much notice, if the results of a search for "Unicode Consortium" in the Google News Archive are any indication. There was a bit of an exception in India: "Typing the Rupee symbol set to get easier". Since a significant number of the items retrieved were in languages that I don’t understand, and since I didn’t search at all for translations of "Unicode Consortium", I can’t say for certain what other exceptions there may have been. The Consortium itself noted the following highlights:
- over 1,000 additional symbols—chief among them the additional emoji symbols, which are especially important for mobile phones
- the new official Indian currency symbol: the Indian Rupee Sign
- 222 additional CJK Unified Ideographs in common use in China, Taiwan, and Japan
- 603 additional characters for African language support, including extensions to the Tifinagh, Ethiopic, and Bamum scripts
- three additional scripts: Mandaic, Batak, and Brahmi
It may be, 19 years after the release of version 1.0.0, that Unicode is getting a little bit long in the tooth: "History of Unicode Release and Publication Dates". Not much growing left to do. Still, the Unicode story has been an important one, and there appear to be no serious challenges to Unicode’s current dominance.
The character code for the new Rupee symbol is "U+20B9". It’s part of the "Currency Symbols" block, 20A0-20CF, as one might expect. For historical reasons, our dollar sign isn’t part of that block. Instead, our dollar sign is part of the "C0 Controls and Basic Latin" block: 0000-007F. Its character code is "U+0024".
20B9 and 0024 are hexidecimal (base-16) numbers. 20B9 is 2×16³ + 0×16² + 11×16¹ + 9×16⁰, or 8377, in decimal (base-10) notation. 24 in hexidecimal translates to 36 in decimal, and 100100, or 1×2⁵ + 0×2⁴ + 0×2³ + 1×2² + 0×2¹ + 0×2⁰, in binary (base-2). (There is a handy "Hex To Decimal and Binary Converter" on easycalculation.com.)
ASCII
The "C0 Controls and Basic Latin" block of Unicode is said to be "backwards-compatible" with ASCII: American Standard Code for Information Interchange, ANSI X3.4-1986, American National Standards Institute, Inc., March 26, 1986. ASCII was a 7-bit code. With 7 bits, you get just 2⁷, or 128, codes: 0 to 7F (hexadecimal), 0 to 127 (decimal), and 0 to 1111111 (binary). That was fine for encoding U.S. English, but there was no room for the accented characters of other languages, like Canadian French.
The first edition of ASCII was published June 17, 1963 by the American Standards Association. In 1964, IBM introduced the System/360 mainframe. It was a very successful product. The System/360 line didn’t use ASCII though. Instead, it used an 8-bit code, defined on Code Page 37, and known as "Extended Binary Coded Decimal Interchange Code" or "EBCDIC". (See the information document and code page.) Since you can have twice as many codes with 8 bits as you can have with 7 bits, the inclusion of additional codes in Code Page 37 made it adequate for the representation of most Western European languages. Variants of EBCDIC were created to serve other markets. (See IBM’s "Code page identifiers.")
ASCII and EBCDIC were incompatible, though of course it was possible to translate. The survival of ASCII was assured, however, with the adoption by the U.S. government, in 1968, of Federal Information Processing Standard 1 (FIPS-1) [See Martha M. Gray, "Code for Information Interchange: ASCII".] The Digital Equipment Corporation was a major competitor of IBM in the 1960s and 1970s, especially in the minicomputer market. The highly influential C programming language and Unix operating system were developed for DEC’s PDP and VAX lines. Unlike EBCDIC, DEC used ASCII for the 0 to 7F range of its Multinational Character Set. The additional characters required for internationalization were all put into the 80 to FF range provided by an 8th bit.
This proved to be the more influential approach in the long run. Indeed, Microsoft’s DOS followed this approach with the 8-bit Code Page 437 for the first IBM-PCs, though its coding was different. (See the information document and code page.) So did Apple, with Mac OS Roman. Unix and Unix-like operating systems, such as GNU/Linux and Mac OS X, naturally also followed this approach.
International Standardization
The European Computer Manufacturers Association produced the first edition of ECMA-6 in 1965. It was a standard for 7-bit encodings, and was nearly identical to ASCII. The International Organization for Standardization’s ISO/IEC-346, the first edition of which was published in 1971, was nearly identical to ECMA-6.
An 8-bit encoding standard, ECMA-94, was published in its first edition in March 1985. It was very similar to DEC’s Multinational Character Set, with variants suited to different language groups. The different parts of ISO-8859, which began to be published in 1987, were nearly identical to the corresponding ECMA standards. 15 of the 16 parts originally projected for ISO 8859 were eventually published. Part 11, for example, ISO/IEC-8859-11, provided an 8-bit encoding of a Latin/Thai alphabet. Part 6, ISO/IEC-8859-6, provided an 8-bit encoding of a Latin/Arabic alphabet. Part 1, ISO/IEC-8859-1, provided an 8-bit encoding for most Western European languages, and was popularly known as "Latin-1".
Microsoft mostly adopted IS0/IEC-8859-1. Its Windows 1252 code page was nearly identical to it.
The "C1 Controls and Latin Supplement" block, 0080-00FF, of Unicode are backwards-compatible with ISO/IEC-8859-1.
Unicode
The first draft proposal for Unicode was "Unicode 88" by Joseph D. Becker of Xerox, published in 1988. The basic idea of Unicode was to put an end to the use of local variants, i.e., the one-to-many mappings of code points to characters. Unicode was to be all-inclusive. Making Unicode backwards compatible with ISO/IEC-8859-1 compromised the principle of one-to-one mappings somewhat, but this was probably a price that had to be paid for adoption.
The Unicode Consortium and the ISO/IEC have worked closely together since almost the beginning. The encoded characters of The Unicode Standard, Version 1.1 and ISO/IEC-10646-1:1993 were the same. (See "Appendix C: Relationship to ISO/IEC 10646".)
Obviously, more than 8 bits were going to be needed. Becker’s initial proposal was for 16 bits per character, i.e. two octets (8-bit bytes), or four hexadecimal digits. That allows for 2¹⁶ or 65,536 characters.
In version 2.0.0 (July 1996), it was accepted that obsolete characters should also be encoded. This meant that 65,536 would not be enough. The first 65,536 characters are now said to be in the "Basic Multilingual Plane" (BMP), or Plane 0. The standard allows for an additional 16 planes, or 1,114,112 characters in total. It is hoped this will be more than enough. Plane 1, the "Supplementary Multilingual Plane" (SMP), contains things like Linear B and the Egyptian hieroglyphics. Plane 2, the "Supplementary Ideographic Plane" (SIP), contains the ideographs required for Han unification. There are no definite plans for most of the other planes.
Does this mean that Unicode now uses 24 or 32 bits per character? It turns out that, for mundane, practical reasons, the answer is complicated. So first, a bit of background.
Input and Output
It’s a good thing to have a system for encoding character data. We can’t just look at a USB stick, though, and read it like a book, whether the data on it is Unicode or otherwise. That’s why there are monitors, printers and the like. But there are practical challenges involved in making and using such things. The people who design actual hardware and software will inevitably encounter constraints and make compromises.
It’s nice to know, for example, that the Phoenician letter ALF has an encoding U+10900, but can you find a screen or printer font for that? If there were a font with a representation of every Unicode character, would your smartphone have enough memory to store it?
Similarly, we need practical ways to put encoded character data into computer memory. That’s why we have keyboards, mice, touch-screens, and so on. Most of us have had the experience of texting with a cellphone, and understand that there’s a price to be paid for compactness. Similarly, most of us in English Canada have had to figure out how to type an "é" using a keyboard that doesn’t have that key. You may once have used DOS alt codes to input such characters. If that brings back fond memories, no doubt you’ll also enjoy the Wikipedia pages on "Unicode input" and "Chinese input methods for computers".
We humans have an easier time working with decimal or hexadecimal digits than we do with binary ones. We work better with alphabets and even ideographs than we do with their hexadecimal encodings. The designers of input/output devices have done what they could to deal with this reality. Each change in technology, however, results in a new set of compromises, both for manufacturers and consumers. The old Hollerith cards, for example, allowed for 80 characters per card, one in each column. The popular IBM model 029 keypunch used the 12 rows in each column to encode just 64 distinct characters: 10 numbers, 28 punctuation marks and symbols, and the 26 capital letters. (See Douglas W. Jones, "Punched Card Codes".) It was never impossible to do additional encoding by hand. People, however, would usually just put up with the all-caps data which was relatively easy to input with their keypunches. That’s just how we are. But we were also happy to abandon the keypunch machines when terminal input became possible.
Text and Terminals
The idea of "text" (contrasted with "binary") has developed over time, but owes much to the characters that could be typed and displayed on early terminals. With terminals came the shells and the text editors. There was no such thing as a graphical user interface (GUI) in the early days. Instead, there was a text user interface or command-line interface between machine and human.
Teletype was a trade name, and the TTY or teleprinter predated the general purpose computer. The ASR-33 Teletype, introduced in 1963, was the first to use the ASCII character set. There was a shift-key, but only the capital letters were printed. It was the same with early dot matrix printers, like the Digital Equipment Corporation’s Decwriter LA30, introduced in 1970. (ASCII art flourished, notwithstanding.) Daisy wheel printers later became available for fine printing, but there were still just 96 different characters on a wheel.
So too with early video display terminals. DEC’s VT05 display terminal, for example, also introduced in 1970, could only display upper case. The first model to display lower case letters was the VT52, introduced in 1975. 1983’s VT220 was the first to display the Multinational Character Set mentioned earlier.
On teletypes, text editing was necessarily line oriented. An early and influential line editor was QED (Quick Editor), written for the release, in 1966, of the first "on-line system", the SDS 940. For the early video display terminals, there were visual editors. Another early and influential editor, TECO (Text Editor and Corrector), bridged the gap. It was developed at MIT in 1963, then used with a PDP-6 and video display at a demonstration in 1964. The original Emacs was a set of "Editor MACroS" for TECO. Even today, if your favourite text editor won’t do the job, GNU Emacs probably will. (This column was written with gedit on an Ubuntu system. Gedit is also available for Windows and Mac OS X.)
The difference between a text editor and a word processor isn’t quite as easy to define as it used to be. The word processing file will still be much larger than a corresponding text file. That’s because the common word processors (Microsoft Word, OpenOffice, Wordperfect and iWork) are necessarily concerned with the styling of a document as well as the bare text. Historically, if you opened up a word processing file in a text editor, there would be strangeness, because of all the non-ASCII bytes in the file. Today, however, you’re as likely to see ordinary text with XML markup. That’s an impact that the web has had.
Serialization and the Internet
It isn’t just text that gets encoded. In the history of telecommunications, voice came before data. When people decided that they wanted to transmit data over phone lines, they needed an encoding for sound waves. What they came up with was pulse code modulation (PCM):
In telephony, a standard audio signal for a single phone call is encoded as 8,000 analog samples per second, of 8 bits each, giving a 64 kbit/s digital signal known as DS0.
It was noted earlier that IBM’s System/360 popularized the 8-bit byte. Not surprisingly, it used an 8-bit bus too. Such a bus allowed bits to be transmitted a byte at a time, in parallel, rather than one by one, or serially.
Older readers may recall when computers would have parallel ports for printers. Now, we mostly have Universal Serial Bus (USB) ports instead. Serial communications are now fast enough that we don’t notice the difference in speed.
The original version of the RS-232 standard for serial connectors was issued in 1962 by the EIA (Electronic Industries Association, now Alliance). Cables had male and female D-shaped heads: one row of 13 pins (or sockets), and another row of 12. Of the 25 pins, only two were for data: one for sending, the other for receiving. It was the job of a universal asynchronous receiver/transmitter (UART) to translate serial to parallel and vice versa, i.e. to know how many bits made a byte.
By 1969, most computers and networks were using 8-bit bytes (or octets). Thus it happened that when the internet that the dinosaurs used was being developed, Vint Cerf wrote, in "ASCII format for Network Interchange" (RFC20): "For concreteness, we suggest the use of standard 7-bit ASCII embedded in an 8 bit byte whose high order bit is always 0."
In August 1982, RFC821 defined the "Simple Mail Transfer Protocol" (The current, August 2008, version is RFC5321.) RFC822 defined the "Standard for the Format of ARPA Internet Text Messages.” Section 3.1 said:
A message consists of header fields and, optionally, a body. The body is simply a sequence of lines containing ASCII characters. It is separated from the headers by a null line (i.e., a line with nothing preceding the CRLF).
Section 2.1 of the current (2008) standard (RFC5322) says pretty much the same thing:
At the most basic level, a message is a series of characters. A message that is conformant with this specification is composed of characters with values in the range of 1 through 127 and interpreted as US-ASCII [ANSI.X3-4.1986] characters. …
As might be expected, the requirement that all e-mail be sent in 7-bit ASCII came to be seen as a constraint. One workaround appeared in 1992: "MIME (Multipurpose Internet Mail Extensions)" (RFC1341, later replaced by RFC1521, and subsequently by RFC2045). The introduction explained the problem:
RFC 822 was intended to specify a format for text messages. As such, non-text messages, such as multimedia messages that might include audio or images, are simply not mentioned. Even in the case of text, however, RFC 822 is inadequate for the needs of mail users whose languages require the use of character sets richer than US ASCII [US-ASCII]. Since RFC 822 does not specify mechanisms for mail containing audio, video, Asian language text, or even text in most European languages, additional specifications are needed.
RFC 2045 (November 1996) defines two procedures by which files containing 8-bit extended-ASCII text, and binary data, respectively, could be encoded as 7-bit ASCII text: "quoted-printable" and "base64". Section 6.7 says:
The Quoted-Printable encoding is intended to represent data that largely consists of octets that correspond to printable characters in the US-ASCII character set. … Any octet, except a CR or LF that is part of a CRLF line break of the canonical (standard) form of the data being encoded, may be represented by an “=” followed by a two digit hexadecimal representation of the octet’s value. … Octets with decimal values of 33 through 60 inclusive, and 62 through 126, inclusive, MAY be represented as the US-ASCII characters which correspond to those octets …
The characters referred to are the ordinary numerals and letters of the English alphabet, and some of the punctuation marks. Section 6.8 says:
The Base64 Content-Transfer-Encoding is designed to represent arbitrary sequences of octets in a form that need not be humanly readable. … The encoding process represents 24-bit groups of input bits as output strings of 4 encoded characters. Proceeding from left to right, a 24-bit input group is formed by concatenating 3 8bit input groups. These 24 bits are then treated as 4 concatenated 6-bit groups, each of which is translated into a single digit in the base64 alphabet. [There is a table] … The encoded output stream must be represented in lines of no more than 76 characters each. All line breaks or other characters not found in Table 1 must be ignored by decoding software.
For some reason, "lines of no more than 76 characters" makes me think of Hollerith cards.
The other thing that RFC2045 does is to specify "that Content Types, Content Subtypes, Character Sets, Access Types, and conversion values for MIME mail will be assigned and listed by the IANA." IANA is the Internet Assigned Names Authority, which maintains a registry of MIME types. One such content type is "text", the subtypes of which include "plain", "html" and "xml". IANA’s list of character sets includes "US-ASCII", "ISO-8859-1" and "UTF-8".
The better approach, at least for the longer term, was simply to move to an 8-bit-clean encoding. Since it was important for older mail transfer agents to continue to be able to exchange mail with the newer ones, this was done, in July 1994, by way of an extension: "SMTP Service Extension for 8bit-MIMEtransport" (RFC1652). One MTA asks another if it speaks 8BITMIME. If both do, no quoted-printable or base64 encoding is needed. If one or the other doesn’t, then a 7-bit encoding takes place.
The World Wide Web
The foundation of the web is the "Hypertext Transfer Protocol" (HTTP). The first version was 0.9 (1991):
The client sends a document request consisting of a line of ASCII characters terminated by a CR LF (carriage return, line feed) pair. A well-behaved server will not require the carriage return character. …
The response to a simple GET request is a message in hypertext mark-up language ( HTML ). This is a byte stream of ASCII characters.
Lines shall be delimited by an optional carriage return followed by a mandatory line feed chararcter. The client should not assume that the carriage return will be present. Lines may be of any length. Well-behaved servers should retrict line length to 80 characters excluding the CR LF pair.
The current standard (HTTP 1.1) is defined by RFC2616 (June 1999). Section 1.1 provides a bit of history:
The first version of HTTP, referred to as HTTP/0.9, was a simple protocol for raw data transfer across the Internet. HTTP/1.0, as defined by RFC 1945, improved the protocol by allowing messages to be in the format of MIME-like messages, containing metainformation about the data transferred and modifiers on the request/response semantics. … Messages are passed in a format similar to that used by Internet mail as defined by the Multipurpose Internet Mail Extensions (MIME).
Section 1.4 describes the overall operation:
The HTTP protocol is a request/response protocol. A client sends a request to the server in the form of a request method, URI, and protocol version, followed by a MIME-like message containing request modifiers, client information, and possible body content over a connection with a server. The server responds with a status line, including the message’s protocol version and a success or error code, followed by a MIME-like message containing server information, entity metainformation, and possible entity-body content. The relationship between HTTP and MIME is described in appendix 19.4.
Appendix section 19.4.4 says:
HTTP does not use the Content-Transfer-Encoding (CTE) field of MIME. Proxies and gateways from MIME-compliant protocols to HTTP MUST remove any non-identity CTE ("quoted-printable" or "base64") encoding prior to delivering the response message to an HTTP client.
Instead, as Section 3.6 notes, HTTP uses transfer codings:
Transfer codings are analogous to the Content-Transfer-Encoding values of MIME, which were designed to enable safe transport of binary data over a 7-bit transport service. However, safe transport has a different focus for an 8bit-clean transfer protocol. In HTTP, the only unsafe characteristic of message-bodies is the difficulty in determining the exact body length (section 7.2.2), or the desire to encrypt data over a shared transport.
So the web is always "8-bit-clean", unlike e-mail.
Finally, section 3.7.1 deals with text defaults:
When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP. Data in character sets other than "ISO-8859-1" or its subsets MUST be labeled with an appropriate charset value.
Since ISO-8859-1 is an 8-bit encoding, this only makes sense.
Escapes and Percent Encoding
Two characters have special uses in XML: & (ampersand) and < (less than). To use these characters in a non-special way, we need instead to write "&" and "<" respectively (named); "&" and "<" respectively (decimal); or "&" and "<" respectively (hexadecimal). There are similar problems sometimes with > (greater than), ' (apostrophe) and " (quote), as indicated in section 2.4 of the specification. Comparable restrictions on the contents of elements are set out in section 8.1.2 of the HTML5 specification; escapes and character references are discussed in section 8.1.4.
Another odd little thing that people who make web pages should know about is percent encoding in URIs. The details are set out in section 2.1 of RFC3986: "Uniform Resource Identifier (URI): Generic Syntax" (January 2005). A whole range of characters have special meanings in a URI. That means that if you want them to have their ordinary meaning within a URI, they must be encoded. For example, the general anchor format is <a href="URI"></a>. The double quote signals the end. If you need it in a URI for some other purpose, it can be encoded as "%22", 22 being the two hex digits representing the double quotation mark.
Unicode Transformation Formats
I wrote above that the answer to the question whether Unicode used something other than 16-bit encodings was complicated. Attentive readers, recalling the struggle to get even as far as 8-bit-clean in SMTP and HTTP, may already have guessed that the magic number is actually 8, as in octet.
Section 2.5 of Unicode 6.0.0 deals generally with "Encoding Forms". Section 3.9 provides the definitions. There are three. Listed in order of importance, they are UTF-8, UTF-16, and UTF-32. Figure 2.11 from Unicode 6.0.0 provides a nice illustration of the differences:
UTF-32, with 32 bits, is the easiest to understand. Anyone who wants to use it though is going to face challenges. For example, the HTML5 specification says bluntly:
Authors should not use UTF-32, as the encoding detection algorithms described in this specification intentionally do not distinguish it from UTF-16.
Instead:
Authors are encouraged to use UTF-8. … Using non-UTF-8 encodings can have unexpected results on form submission and URL encodings, which use the document’s character encoding by default.
RFC2277 sets out the "IETF Policy on Character Sets and Languages" (January 1998). Section 3.1 says: "Protocols MUST be able to use the UTF-8 charset."
UTF-8 uses 1, 2, 3 or 4 octets, depending on the range within which the character’s code point falls. RFC3629 is "UTF-8, a transformation format of ISO 10646" (November 2003). I can’t improve upon its description of the UTF-8 encoding:
Just for fun, I’ve put the four characters from Figure 2.11 above in a little web page called slawjunk201101.html. Check it and find out how many of the four characters your browser has fonts for. If you save that little page to your own computer, then upload it to en.Webhex.net (a hex editor in the cloud), you can confirm that the UTF-8 encodings are as described.
In the early days of the web, HTML had to be ASCII. If you wanted a double dagger on your page, you would have to use a lengthy string of ASCII characters, such as "‡", "‡" or "‡". Now, however, you have the simpler option of just defining your encoding as UTF-8, and inserting the character like so: ‡. (It’s more often things like WordPress that make my life complicated now. Have a look at the source for this paragraph to see why.)
UTF-8 is the clear choice for those using Western European languages like English, since the most commonly used characters require only one octet.
UTF-16, which uses either two bytes (16 bits) or four bytes (32 bits) per character, is sometimes the better choice for files with a lot of CJK content. It is important, however, not to confuse UTF-16 with UCS-2. UCS-2 was the 2-byte (i.e. 16-bit) implementation of the universal character set used from 1991 to 1995 (i.e. up to Unicode 1.1.) For the reasons noted above, it only implemented the BMP, and is thus inconsistent with UTF-16 for the BMP. UCS-2 is, in fact, obsolete and, as the Unicode FAQ indicates, "This term should now be avoided." Since UCS-2 had the same conceptual simplicity that UTF-32 has (and which UTF-16 lacks), it has had a legacy impact on some programming languages and operating systems.
Section 4.3.3 of the XML 1.0 specification says:
All XML processors MUST be able to read entities in both the UTF-8 and UTF-16 encodings. … Entities encoded in UTF-16 MUST and entities encoded in UTF-8 MAY begin with the Byte Order Mark described by Annex H of [ISO/IEC 10646:2000], section 16.8 of [Unicode] (the ZERO WIDTH NO-BREAK SPACE character, #xFEFF). This is an encoding signature, not part of either the markup or the character data of the XML document. XML processors MUST be able to use this character to differentiate between UTF-8 and UTF-16 encoded documents. … In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration, or for an entity which begins with neither a Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8. Note that since ASCII is a subset of UTF-8, ordinary ASCII entities do not strictly need an encoding declaration.
In other words, UTF-8 is the default encoding for XML.
Note that there isn’t a single mention of UTF-32 in this. UTF-32 is a registered IANA character set. It just isn’t much used for transport or even storage. As the Unicode programming issues FAQ notes, however, the use of the conceptually simple UTF-32 in programming can sometimes make string handling simpler. UTF-32 is a subset of the obsolete UCS-4.
For those who want to know more about the practical considerations, the Wikipedia article, "Comparison of Unicode Encodings", provides a useful survey. The Unicode "UTF-8, UTF-16, UTF-32 & BOM" FAQ is also worth a look.
Conclusion
According to Wikipedia:
UTF-8 is … increasingly being used as the default character encoding in operating systems, programming languages, APIs, and software applications.
This isn’t surprising. It’s another prime example, like XML, of the huge impact that the web has had upon all computing.
Comments are closed.