Talking in Riddles II

R. Ramanujam, The Institute of Mathematical Sciences, Chennai

In the last issue, we saw how we can use substitution ciphers to code up text. This is a method by which we replace each letter by another letter. For instance, "ADD" becomes "XJJ". It is difficult to remember the code for each of the 26 letters, so we also saw how to use key-words. For instance, when the key-word is SMILEY, "ADD" becomes "SLL".
(Recall: Map the letters A to F to SMILEY, and from G onwards, use the first letter that is still available. Of course, some letters may get mapped to themselves but that is alright.)

Breaking the code
Now comes a really interesting question. Suppose that someone gives you a coded text, and you know some substitution cipher (like the above) has been used to generate it. Can you break the code? That is, can you somehow find the inverse map and figure out the original text?
For instance, suppose the coded message that we are handed looks like this:
CJ UYU SKO CJ UYU IBGUQ RDU WDKHU WKTHP AHEJP.
Can one make any headway on this?

Frequency analysis
Actually, substitution ciphers ruled for a long time. During the first millenium, this was the way most armies sent their secrets, and it was believed that nobody would be able to break them.
Help came eventually from unexpected quarters. In the ninth century CE, Yaqub ibn Ishaq al-Kindi, a great Islamic scholar from Baghdad, was analyzing the holy Quran. Al-Kindi was famous for his role in introducing the Indian numerals to the Islamic and Christian world. He studied how often each letter of the Arabic alphabet occurred in the Quran. This is now called frequency analysis.
But Al-Kindi did more, he immediately saw how this helps in breaking ciphers. He wrote a text called "On Deciphering Cryptographic Messages" which started the field of mathematical study known today as cryptanalysis.

How it works
How does Al-Kindi's idea work? First find out the frequency of occurrence of letters in the language. Then list how often each letter occurs in the coded text. So, if in the code X occurs most frequently, and in the language E is most frequent, then there is a very good chance that X codes E.
This is of course not completely sure, but has a good chance of working. Then we can use other knowledge. For instance, in English, a one-letter word is perhaps "A", a three-letter word ending with "e" is perhaps "the", and so on.
Let us try this on our give code. First we should know the frequency of letters for English. According to Fletcher Pratt, the table is as follows:

Frequency of Occurrence of Letters in English, Fletcher Pratt, Secret
and Urgent: the Story of Codes and Ciphers 1939 
E T A O N R I S H D L F C M U G Y P W B V K X J Q Z
That is, "E" occurs most often (most frequently) in the English language, followed by "T" and so on, while "Z" is the least used letter.
Now, let us count the frequency of letters in the code above:
U occurs 7 times; J, K and H occur 3 times; C, Y, D, W and P occur 2 times; S, O, I, B, G, Q, R, T, A and E occur once; the other 7 letters do not occur.
Now we can make an educated guess that U codes E. If we substitute in the cipher, we find:
CJ UYU SKO CJ UYU IBGUQ RDU WDKHU WKTHP AHEJP.
** E*E *** ** E*E ***E* **E ****E ***** *****.
Which English word can you think of that is of the form E*E?
EYE, of course.
So we get:
** EYE *** ** EYE ***E* **E ****E ***** *****.
Now notice that before EYE we have the pattern CJ each time. Can you
guess what it is? Yes, it is very likely to be "AN".
Thus we have:
AN EYE *** AN EYE ***E* **E ****E ***** ***N*.
Now it is not difficult to see that the intermediate word must be FOR.
We then get:
AN EYE FOR AN EYE ***E* **E **O*E *O*** ***N*.
We can proceed this way. Skipping the intermediate steps, we discover the text:
AN EYE FOR AN EYE MAKES THE WHOLE WORLD BLIND.
This is a statement attributed to Mahatma Gandhi.
Will you now get your friends to make up ciphers and then try and break them?