Alphabets, Ciphers and Codes

A look at some interesting and useful lore


The Morse Code and the ASCII Code are not codes, strictly speaking. There's nothing at all wrong with using these terms, but for technical purposes it is good to be more precise. The term alphabet for such things was used in the early days, but seems to have dropped out of use. I'll use it here with its early significance, as a correspondence between letters, numbers and punctuation ("characters") and the elementary states of a signalling system. In telegraphy, these elementary states are two, for example mark (circuit closed) and space (circuit open), or + and - polarity. In other systems, there may be more elementary states, but there must be at least two. The Morse alphabet uses marks and spaces of different lengths in small groups, while the ASCII alphabet uses a mark or space in a group of 7 time intervals. These are just two popular examples of a large number of signalling alphabets.

An alphabet is used to transmit a message over a communications link. The actual, readable message is called plain text. It is easy to transmit pictures by fax or digital file, and these may show a written message, just as if it were on paper. This is a somewhat different thing than what we are discussing, so we will not consider it here. The plain text message may be replaced by another with the same significance for a number of reasons. One may wish to economize on time of transmission or message length; or one may want to discover when errors are made, and even to correct them; or, in many instances, to prevent third parties from reading the message. The latter reason has been important throughout history and in military operations, but became specially important with wire, radio and telephone communications that were either broadcast or subject to surreptitious listening.

A cipher is a rearrangement or replacement of the characters of a message, almost always with the aim of secrecy. It differs from an alphabet, which is indeed a replacement, but a standard, constant one for all messages, and not for secrecy, but rather for clarity and efficiency. Actually, ciphers and alphabets are not confused at all. Of course, the Morse Code may be used for secrecy around individuals who cannot understand it. The enciphered plain text is called a cryptogram, and the process of reading, or breaking someone else's ciphers is called cryptanalysis, which is carried out by cryptographers. The science of secret writing is called cryptography, with a long history, chiefly in connection with the military. A cipher has the great advantage that it can be changed quickly, by merely changing a key word or equivalent, a great advantage in military applications.

The term code comes from codex, Latin for "book." The code book is the essential element of a code. A two-part code book is like a bilingual dictionary, giving equivalents in alphabetical (or some other appropriate) order. Five-letter code words are popular. For example KOLAM might correspond to "A threat has been made against your safety. Take all necessary security measures." Sometimes pronounceable code words are desired, or words with only one letter different or with any two adjoining letters transposed mean the same thing. The incorrect reception of one character or the transposition of two adjacent characters are the most common errors in manual transmission and reception. Also, alternatives are given for common words to avoid establishing patterns. Code words can be put into the code books that are never used, except by third parties creating false messages. Codes can be made very secure, since they must be broken word by word, and only after a large number of messages have been intercepted. Security depends on the security of the code books. It is difficult to change a code, since the new code books must be printed and distributed in order to do this. Commercial and telegraph codes are not (usually) used for secrecy, but for economy. For example, RDTN3 might mean "Please reserve a room for two with twin beds for three nights," and would certainly save telegram charges. Of course, there is always the danger that RTDN3 would be received, meaning "Please reserve a table for dinner, non-smoking, for three persons." Codes lack the redundancy that protects plain text from misinterpretation.

We have now defined the three terms alphabet, cipher and code more or less precisely, and can consider some aspects of each. The Morse Alphabet was designed by Alfred Vail sometime around 1844, and was, apparently, used on the first long-distance telegraph in the US between Washington and Baltimore. The tape appears to have been preserved, and if it is true, is a witness to the fact, together with the comments in Vail's 1845 book. However, there is always the possibility that such a tape was run off later when required for an historical exhibit. Such frauds are common in American technical history, so care is necessary. Vail's alphabet was always used for American land-line manual telegraphy.

The most commonly-used letters in English plain text are ETOANIRSH, in that order. Vail assigned them the shortest groups [ . - . . .- -. .. .. . ... ....]. Here, . represents a "dot" or the basic time interval. A "dash" is represented by - , three dots in length. Vail used three lengths of dashes, or marking states, and here only the shortest is used. The O is a spaced letter consisting of two dots separated by a double space. It is not difficult to recognize the difference between the characters I, O, and E E, especially when receiving the alphabet by ear. R is also a spaced letter. Designing an alphabet so that the most commonly used characters are sent in the shortest time is a rather advanced science with many applications today.

When the Morse system was adopted by the Austro-German Telegraph Union around 1851, the Vail code came with it, but was not really appropriate for the German language. There were the umlauts to consider, as well as the common digraph ch, and other matters. The spaced letters were also considered confusing. A revised alphabet was soon designed, which was adapted to German, but could still be used for other languages. All the letters of the Vail alphabet given above were retained, but the O now corresponded to the much longer - - -. This is not a mistake at all. The most frequent letters in German are ENRISTUDA, and it will be noted that O is not among them. Similarly, the C, .. . in the Vail alphabet (it is the twelfth most common letter in English) became - . - ., a good deal longer. On the other hand, C is not used alone in German (now almost never), while CH, a common combination, got its own character (----). R, a frequent letter, got .-., a short group. For a more detailed discussion of this, look at Morse Code.

Impressing information on a signal is called modulation, and the way it is done is called coding, which we know is really making an alphabet, but coding is an easy word. There is much to this science, including the maximum information flow through a channel of given bandwidth, error rates and the tradeoff between error rates and speed of transmission, and the adaptive coding of which the Morse Code is an example. See Reference 1, or any other book on Communication Systems, for the details.

Now let's talk about cryptography. The idea is to transmit a message from sender to receiver in such a way that a third party can extract no information from it, aside from the mere fact that a message was sent. Also, the receiver may want to be sure that the message came from the apparent sender, not from another source; that is, that the message was not a "spoof." A method that is secure against casual eavesdropping may not be secure when attacked by a trained cryptanalyst. There is also a time factor: a message may succumb to cryptanalysis, but only after a time when the information becomes useless. A single message, encrypted by an unknown method, dealing with an unknown subject and between unknown sender and receiver, is extremely difficult to decrypt, and may resist even expert efforts. Repeated messages where the subjects can be guessed approximately, encrypted by a roughly known method or machine, but without knowing keys, may eventually be decrypted. An example was the breaking of the Enigma cipher in World War II. Ease and security of use, the time factor and the nature of the antagonists are important considerations in selecting any encryption scheme.

If one wants to keep a message safe from only casual eavesdropping, not an expert attack by cryptanalysts, many classic methods are available. A method easily available and quite secure is to use a dictionary as a code book. Any dictionary with a sufficient number of words arranged on numbered pages will do; this can be a bilingual dictionary as well as an English dictionary, and best an obscure one. Each party must have the same dictionary, and it can be hidden in view among other books. A word can be expressed as the page number plus the ordinal number of the word on the page. For example, 43-10 is "and" in my Concise Oxford Dictionary, 4th edition. If the antagonists have no access to the books of either party, this code is quite secure, especially for a limited number of messages. Even if the antagonists do have access to the books, it will require some effort to see which one is the code book. It is easy to think up ways to make this a very annoying process.

Most casual secret writing uses ciphers, however. The simplest type is a rearrangement cipher that mixes up the letters of the message by some easy procedure. An easy way is to write the letters as a rectangular matrix following one order (perhaps by rows) and reading them off in another (perhaps by columns). For example, CWAOA NMTTE SYHOO ENURI DEWLP. Decipher this cryptogram by writing the letters in columns of three letters each, then reading them off by rows (answer below, but try it first!). There are some things to be noted here. First, the encrypted text is written in uniform groups of five letters. The decrypted text appears without spaces between the words, like ancient inscriptions, but these are easily filled in. There are some odd letters not part of the message, but inserted so that things will fit. These letters, here P (to fill out the last group) and D, L (to fill out the 3 x N matrix) are called nulls. Groups of other lengths can be used as well, such as 4 or 6 characters.

Our example cryptogram contained 3 E's, 3 O's and 2 A's, in fact 10 vowels in 25 letters, when on a random choice we would expect only 5 or 6. This, and the prevalence of the commonly-used letters, are a giveaway that a rearrangement cipher has been used. Now, it is only a puzzle to find out what rearrangement was used and to decipher the message. A long message, or a number of messages, quickly succumbs to a letter-frequency attack. You get very little security with such a cipher, but it is very easy to use and does prevent casual interception.

More security can be obtained at the cost of a little more work with a substitution cipher. A rearranged alphabet is written above the normal alphabet, and this table is used to encode and decode. If you take the 26 letters of the alphabet and 9 digits (let O do for zero), there are 35! possible permutations of the alphabet. Consider the cryptogram U6R9S 9T9QG 816YZ QGY85 6C2K4. It's obviously not a rearrangement, since there is one vowel, U, and lots of numbers. The key word (I can tell you) is WYOMING. Write this down, and put a small number beneath each letter giving its order in the alphabet (6753241). Now write the letters A-Z and numbers 1-9 in 5 rows of 7 letters each under the key word. Then, rearrange the columns according to the numbers under the letters of the key. Finally, read off the characters in columns: G, N, U, 2, 9, E, L and so on, and put each beside its letter in the normal alphabet: A = G, B = N, C = U and so on. This is then the key for enciphering and deciphering. This cipher is more secure than the rearrangement cipher, but will still succumb to frequency analysis if a lot of text is available. The enciphered text can also be enciphered again by rearrangement, which will introduce a little more bother for the antagonists. This cipher is easily changed by using a new key word, which can be transmitted by other means or arranged in advance.

This cipher is an example of a simple, or monoalphabetic, substitution cipher. An even simpler one does not rearrange the alphabet, but merely slides it along a certain number of places. Caesar's famous cipher used the third letter along: SAHOODHYZIRUPRYD (remember to use the Latin alphabet with 22 letters). This is not a very secure code at all, but is pretty good against someone looking over your shoulder. Caesar was not up against cryptanalysts. You can be a little more evasive by using different shifts, but the fact that the alphabet is not rearranged is a fatal defect. It does not take much time to try 25 (or 34) alphabets to see which one works, and this can be aided by a sliding strip. If you do use a monoalphabetic cipher, make sure that the alphabet is randomized.

A polyalphabetic cipher gives much improved security. Ideally, a different alphabet can be used for each letter. This is very tedious if done by hand, but a machine or computer can do it with ease. A key is used to set up the machine or computer. Then, for each plain text letter entered an encrypted letter is obtained from a new alphabet determined by the key word. On a computer, a random number generator can be used. The numbers are not actually random, of course, since if one starts from a given number, the "seed," the same sequence of numbers is obtained. These numbers, however, behave statistically exactly like random numbers, and can be used to encipher and decipher the characters of the text. A 64-digit random number generator gives a sequence of more than 1019 random numbers, each of which can be used to select 10 or more characters. If it took one second to check each possible key, it would require more time than the present age of the universe.

If no one knew what random number generator you used, a polyalphabetic cipher based on it would be quite secure, especially for a limited amount of text available and an unknown key. When one is considering code breaking, however, it is usually assumed that the antagonists know the method used, and sometimes even have limited knowledge of the keys. The ciphers also are publicly used for a large amount of material. Under these conditions, the time required to break a code can be worryingly short.

Each of the example ciphers above reads COME HERE WATSON I WANT YOU. Caesar says PUELLA EST FORMOSA.

In the American Civil War (1861-1865) the telegraph was used widely for the first time in warfare, and it was necessary to encrypt important messages. The Confederates used a polyalphabetic substitution cipher that was broken by the cryptographers of the United States Military Telegraph early in the war, in the sense that messages were easy to read with a little work that revealed the key words used to encrypt it. The United States army used a strange half-code half-rearrangement cipher that was never broken by the Confederates. General Forrest used to capture code books with dismaying regularity, but every time the code was immediately changed, and messages remained secure.

Here is an actual message in the Confederate cipher:

To Gen. J. E. Johnston, Jackson From Vicksburg Dec. 26, 1862
I prefer oaavvr, it has reference to xhvkjqchffabpzelreqpzwnyk to prevent anuzeyxswstpjw at that point, raeelpsghvelvtzfaut lihasltlhifnaigtsmmlfgecajd.
J. C. Pemberton, Lt. Gen. Comd'g

The confederate cipher used a 26 x 26 matrix of letters constructed from 26 alphabets shifted one letter for each line. The letter to be enciphered was looked up in the first column, while a letter from the key was looked up in the first row. The letter to be substituted was at the intersection of the row and column. One began with the first letters of the message and the key, and then used the key as many times as necessary. The key for the message above is Manchester Bluff, and the decrypted message reads:

To Gen. J. E. Johnston, Jackson From Vicksburg Dec. 26, 1862
I prefer Canton, it has reference to fortifications at Yazoo City to prevent passage of river at that point. Force landed about three thousand, above mouth of river.
J. C. Pemberton, Lt. Gen. Comd'g.

The first letter of the message is C, the first letter of the key is M, which gives O as the cipher text. Note that not all the words of the message are encrypted. If the cipher text is not run together, it is rather easy to guess a word or two, and the cipher is broken.

The Confederates also had a mail cipher, in which five arbitrary symbols were available for each letter. This made it very easy to guess words, since the lengths of words were not hidden. In one message, "reaches you" was guessed, and then the preceding words "before this." With these 10 letters known the rest followed quickly.

Anson Stager, general manager of Western Union, who was appointed Superintendent of Military Telegraphs, created a strange crypto procedure that was half code, half rearrangement. A code book gave arbitrary names to numbers that stated the number of lines in the message. This was the first word in the message. Then the message was written out in lines of six words. The seventh word was chosen from a list of check words, which were meaningless. Then, the columns were read in an agreed order to recover the message. Geographical names, officers, certain plans and phrases were given code words. For example, General Grant was Arabia, General McClelland Egypt, St Louis was Ham, Paducah was Darby. The messages seemed to contain random words in random order. This "six-column" code was the first used, and later codes were similar, but much more elaborate. Different words meant different routes in reading the columns, as well as different ways of writing out the lines. These codes were never broken by the enemy.

President Lincoln had a way of writing words so that they had the intended meaning when read backwards, while looking like nonsense when read normally. This secret writing was very easy to interpret, however, like Caesar's. A contemporary way to hide a message was also shown in the following: "John, 1/4. Papers do not come promptly. To-night I am sure dear papa will be disappointed. At home, all read the blessed Journal. Susie." The 1/4 means that the first word, and every fourth one thereafter, are to be omitted. With this clue, find out what Susie is trying to warn John about.

The ASCII (American Standard Code for Information Interchange) illustrates some other considerations. It is a 7-bit code, meaning that the state of the line is either marking (1) or spacing (0) in each of 7 time intervals, each interval representing a "bit." The ASCII Code is a descendant of the 5-bit Baudot Code used for teletype. The number 0, for example is represented by the bits 0110000, transmitted from left to right, or hexadecimal 30. Computers work with bytes, groups of 8 bits, not 7, however. The final bit, not part of the ASCII Code, is intended to be a parity bit. It is set according to the number of 1's and 0's in the code. If you choose even parity, then it is set so that there is always an even number of 1's. Our number 0 remains 8-bit 30, the parity bit being zero. The number 1 is represented by 0110001, however, so with even parity, it would be sent as 10110001, with four 1's. The receiver simply checks that the number of 1 bits is even in the whole received byte, and if so, fine. If not, there has been an error--one of the 8 bits has been flipped. This raises an error flag, and the sender can be requested to repeat the character. The case when two bits are in error is neglected, since it is much rarer than the misinterpretation of one bit. In practice, such isolated errors rarely occur: either the transmission is error-free, or no transmission is possible.

The idea of error detection can be carried further. Many transactions depend on a certain number, and if this number is erroneous, even in one digit, the outcome will be false. Think of your bank account number. It is good to know that a number has been transmitted correctly, without the common errors of transposition of two digits or a wrong digit. To do this, one can calculate a check number that can be calculated from the digits, which will be different if some error has been made. For example, consider the number 103001. To get a check number, add together the digits, multiplying alternate digits by 2. In this case, 1x1 + 2x0 + 3x1 + 0x2 + 0x1 + 1x2 = 6. The check digit is the difference between this number and the next higher power of 10, here 10 - 6 = 4, which is the check digit. What if we received 100301? Sum = 9, 10 - 9 = 1! What if we received 103000? Sum = 4, 10 - 4 = 6! Therefore, if we send 103001-4, we can check on the accuracy of transmission. There are many ways to do it; this is only one example. You will see these check numbers in many places. Often, they are just part of the number. It is possible not only to detect errors, but even to correct them! The redundancy of plain text allows the recognition of errors and corrections (but not always!).

References

  1. S. Haykin, Communication Systems, 3rd ed. (New York: John Wiley & Sons, 1994).
  2. L. D. Smith, Cryptography (New York: Dover, 1955).


Return to Telegraph Index

Composed by J. B. Calvert
Created 7 March 2001
Last revised 29 March 2001