Monday, May 4, 2009

Balancing the frequencies

Some might find useful to assign +1 or –1 weights to the letters in the English alphabet so that the weighted sum of the frequencies of the letters in the "generic" plain English text is, in absolute value, as small as possible. A possible weight assignment is given in the list below. The number listed at the beginning of every row is 100,000 times the proportion of the corresponding letter (based on the letter frequencies list displayed in Wikipedia, in reference to the "Relative Frequencies of Letters in General English Plain text", from Cryptographical Mathematics, by Robert Edward Lewand). In this weight assignment, the weighted sum is not zero - indeed it is 1 (this is because the original list of frequencies is not perfectly normalized - they add up to 99.999%) and thus optimal, since the sum of the numbers initializing the rows below is odd (which rules out a partition in two subsets with equal sums) . This corresponds to a frequency "defect" of 0.001%. This simple example may be seen in relation to the general "subset sum problem". We can use this "weight assignment" to generate +1/–1 "random walks" out of string characters.

12702(e)–1
9056(t)–1
8167(a)+1
7507(o)–1
6966(i)+1
6749(n)+1
6327(s)–1
6094(h)–1
5987(r)+1
4253(d)+1
4025(l)+1
2782(c)–1
2758(u)+1
2406(m)+1
2360(w)+1
2228(f)+1
2015(g)–1
1974(y)+1
1929(p)–1
1492(b)–1
978(v)+1
772(k)+1
153(j)+1
150(x)+1
95(q)–1
74(z)+1