There’s rhyme and reason to character codes, but it’s not poetry. It’s the evolution of ancient machine controls.
Knowing how rudimentary character codes work gives you control over what an application visually represents to the user. There are good historical reasons behind them, and the motivations of localization and language differences underneath. Character codes define what’s read on every screen you write for a user application. They allow you to rapidly permit localization context changes among character sets to internationalize your dialog, as well as perform rudimentary screen control—and put personalization options for user controls.
On the surface, character codes – anything used to store and communicate human or mathematical notation – are murky and contentious. Underneath these codes, however, is historical rhyme and reason, not to mention continuing contention.
Various audible codes, such as Morse Code, were refined to primitive machine control functions. These begat Émile Baudot’s investigations, which led to his character set predating EBCDIC and ASCII. Baudot invented a character set using just five bits, with a left-and-right most significant bit. These were actually left hand signals and right hand signals on an ancient, pre-typewriter keyboard that evolved from the Morse Code Key. Baudot inserted several non-printing codes, too, including BEL, which rang a bell at a downstream office to call attention to (or wake up) the operator. The BEL code still works with some operating systems. Try typing it at a command-line prompt and see what happens.
The era of the American Standard Characters for Information Interchange (or just plain ASCII) arrived when Baudot’s primitive machine functionality to the mindset of what typewriters do was coupled to sending the signal via ancient wireless means using actual vacuum tubes (radiotelegraphy). Then you could muddy things up by adding all varieties of proprietary signals relevant to both IBM hardware and IBM customer programming, and you get EBCIDIC.
Add non-English symbols, and the character codes need international standards bodies to step-in. There are the variations of ASCII to accommodate the myriad differing languages, then the variations of Cyrillic character set variations. Add into the mix Chinese, Korean, Katagana/Hiragana, Sinhelise, Twe, and things become still more complex.
Here’s the evolution of these codes.
International Morse Code
Samuel Morse invented a code for telegraphy, but it’s been refined through many iterations. There are extensions, abbreviations that are International Morse-specific, and those additions that amateur radio operators use thousands of times every day. The number of significant digits starts here at just five places; five possible spots are the maximum that characters displace in terms of three types of signals, dots, dashes, and spaces. Without these, it would not be enough to represent lower case characters, much punctuation, or other characters.
Baudot’s clever machine allowed multiplexing the data, and a right-shift and left-shift keyed number or symbol on his machine. Murray in turn, evolved Baudot so that it worked okay with a mechanical printer that could punch paper tape and read the punched paper tape once encoded. Stock market tickers evolved. There were pre-facsimile machine inventions that attempted to form characters from dot matrices. But importantly, the tickers gave way to the radio telegraphers, international variations of the code, and also the need to so simple things like make teletypes print lower case letters. Thus the Shift key arrived. (And now it’s in our way on every keyboard.)
In the early 1960s, more bits for more characters, punctuation, and machine control evolved into what was called ASA ANSI, and a teletype machine made by AT&T (the old one) called the TWX arrived. The American National Standards Institute then guarded and extended what was then termed: ANSI ASCII, or US-ASCII. International variations and flavors were then evolved by international standards bodies, including the ISO and ITU. Now we’re using 7 bits to represent characters, but soon we’ll run out of bit-positions to uniquely identify characters or machine control sequences.
Adding an eighth bit elevates the number of characters and glyphs and machine control positions. Many variations exist in character sets built with 8-bit representations, only some of which are “standards-based.” IBM invented the block of codes based on 8-bit ASCII for different markets, calling these combinations “code pages,” and the name still sticks. Yet Atari and others did the same sort of thing, making graphics characters that could be used in games and GUIs; these were made for somewhat easy programming using ASCII variations. Compatibility among computer systems vendors reached new lows, but there was much pressure to systematize how characters and codes could be represented.
The International Standards Organization (ISO) released what amounted to code pages for differing language use, under the ISO 8859 standard. Now you could write code and be confused about localization and language internationally. Eight-bit codes were out of control, so to speak, and exchanging data among operating systems, computer families, and applications became more difficult. The context would be: Call the code page and character set, and with luck your application might actually show the user the correct and desired view without blowing up screen positioning geometry—and preserving intended localization of the characters and how they were shown on a page or screen.
The Unicode Foundation arrived on the scene as a multi-vendor/multi-platform consortium that desired to fix the problem of interoperability of character sets. It aimed to address the need to encompass even the strangest and unique characters, ranging from Braille to seldom-used Chinese characters, and all plausible variations.
The first standard promulgated that was able to be backwards-compatible with ASCII, and agreed upon, was UTF-8. UTF stands for Universal Transformation Format and is an 4×8-bit code, meaning four octets represent an enormous number of symbols. It’s backwardly compatible with ASCII.
Whether US-ASCII, UTF-8, or EBCDIC variants, there became the HyperText Markup Language, which used World Wide Web and Internet Engineering Task Force (IETF) standards, each of which are stated in the form of Requests for Comments (RFCs). The initial RFCs that covered communications characters relied on basic 7-bit ASCII, but of course, this was not enough to represent languages, localizations, and additional glyphs or extensions. Even today, the Internet is based on 7-bit characters which can be upon command or convention, appended together to form pictures, or signal the start of things like video streams—whose characters are still sent as 7-bit objects
The original HTML 2 format for character handling arrived in the IETF 1866. Then, as Unicode met the International Electrotechnical Commission (ISO), it became RFC 2070, representing the ISO standard 10646 as an all-encompassing, end of the earth-lasting character set.
Web pages now use a declaration to help sort through the muddiness, and the most common declaration is:
Content-Type: text/html; charset=ISO-8859-1
When character sets can’t be understood, or are incorrectly cited, odd results occur for the viewer of the page. Character and browser sniffing, a technique used to ferret the correct choice from among many possible when ambiguity in character sets occur, is a well-known technique but not within the scope of this article.
I’ve never tried specifying a charset as Morse. Shards of characters might line my screen.
And now: our handy chart. I’ve listed the codes in such a way as to make them cross-referenced. At the bottom is a chart of the international codes, the bodies that promulgated them, and links to where they can be found.