Index of Coincidence
The index of coincidence shows how likely is the situation that during comparing some two texts (letter by letter), two currently compared letters are the same.
A value of the index of coincidence is calculated based on the probability of occurrence of a specified letter and the probability of comparing it to the same letter from the second text (which is of course determined by the probability of occurrence of the letter in the second text). For the text of N-letter length and the alphabet with c different letters (for example, for the English alphabet c = 26) the value of the index of coincidence IC during comparing this text to the same text shifted relative to the first one by random number of letters may be presented as:
IC = (n1(n1-1) + ... + nc(nc-1)) / (N(N-1) / c)
,
where ni is a number of occurrences of the letter in the whole text. Click here to find out more.
In particular, while analysing letter frequencies in the specified language (fi) it is possible to calculate the expected value of the index of coincidence for this language (that means the expected value of the index of coincidence while comparing texts written in the same language):
ICexpected = (f12 + ... + fc2) / (1/c)
It is easy to notice that if all letters in a specified language were equally often, then the expected value would be equal to 1. Of course, in all the existing languages different letters occur with different frequencies so indexes of coincidence for different languages differ from each other. For English the expected value is equal to 1,73.
One will notice that the index of coincidence calculated for two texts written in two different languages is usually noticeably smaller than expected indexes of coincidence calculated for these languages. It is caused by the fact that the letters which are popular in the first text (in the first language), may be less popular in the second text (written in the second language). Thus, the probability of meeting the same letters in the compared texts is smaller.
Using IC in cryptography
The index of coincidence is used in cryptography for breaking substitution ciphers and simple XOR ciphers.
IC can be used to determine the length of the secret key if a secret message is encrypted using one of those ciphers. It may be achieved by comparing (letter by letter or byte by byte) the encrypted text with the same text shifted by a number of characters which is equal to the currently tested key size. For each testing possibility (so for each key size, from 1 until finding the solution) one must calculate the value of IC and remember its value.
When one tests the correct text offset, which is equal to the length of the secret key, the confusion introduced by the secret key will disappear:
- in the case of a substitution cipher, the letters in both texts at corresponding positions are shifted by the same number of characters, or
- in the case of a XOR cipher, changes of all bits in corresponding bytes are the same.
After finding a correct shift, all compared characters in the first and the second text (although they are not known) belong to the same language, so after calculating their index of coincidence, the result will be similar to the expected value of the index of coincidence for the specified language and it will be much different from other, previously testes, values of the index of coincidence (which were calculated for wrong shifts).
During comparing two texts with wrong text offset, letters (bytes) in the first text will be changed differently than in the second text. Therefore, it is possible to consider the letters as belonging to other languages, with different frequencies of letter occurrences in the first and the second text.
A significantly larger value of IC will be calculated for all shifts equal to the key length or its multiplicity (because the same key is repeated periodically).
Expected values for some languages
Indexes of coincidence can be calculated for different languages. They depend on average frequencies of letters. Of course, the frequencies can be determined only approximately because in different kind of texts (scientific, historical, fiction) the frequencies are slightly different.
- English - 1.73
- Russian - 1.76
- Spanish - 1.94
- Portuguese - 1.94
- Italian - 1.94
- French - 2.02
- German - 2.05
Sometimes, the values of indexes of coincidence are presented without the normalization (the normalized value depends on the number of letters in the alphabet). For example, for English language, the expected IC value without normalization is equal to:
1,73 / 26 = 0,067