By Jacob Strieb.
Published on October 9, 2022.
Recently, in the course of performing cryptanalysis, I needed the relative frequencies with which each character—symbols and punctuation included—occurs in English text. Web searches seemed to only return results for letter frequencies, with no symbols and no distinction between uppercase and lowercase letters, so I resolved to sample the frequencies myself.
I chose to sample text from books hosted by Project Gutenberg since they are free, legal, and easy to access. Even better, they’re available as plain text files.
Since this was a low-stakes, one-off task, I decided the ceremony of a full-blown program was overkill, and opted for a shell script instead. Plus text wrangling is the shell’s bread and butter.
The first major step was to get links to books to download. The snippet below:
NUM_BOOKS=16
curl "https://gutenberg.org/browse/scores/top" \
| grep -o 'href="/ebooks/[0-9][0-9]*"' \
| sort \
| uniq \
| sed 's/href="\/ebooks\/\([0-9][0-9]*\)"/https:\/\/gutenberg.org\/ebooks\/\1.txt.utf-8/' \
| shuf \
| head -n "${NUM_BOOKS}" \
# ...
The number of books to sample can be tuned for greater accuracy at the expense of longer running time. Since some of the Project Gutenberg works may be short or may not be representative of the distribution of symbols in typical English text, increasing this number reduces the risk of a skewed sample. Likewise, the shuf
can be removed for deterministic output.
After getting book text URLs, the next step was to fetch the text from each of the links, concatenate it, and clean it up. The snippet below:
test.txt
# ...
| xargs -L 1 -P 4 curl -L --compressed \
| tee text.txt \
| iconv -c -t ascii \
| tr -c -d '[:print:]\n' \
| tr '\n' ' ' \
| sed 's/[[:space:]][[:space:]]*/ /g' \
# ...
Next, the snippet below:
Finally, I got lazy and decided to pass the rest to a small Python script, since I’ve never been fond of doing math in the shell. The snippet below:
# ...
| python3 -c 'import sys, json;
lines = sys.stdin.readlines();
data = {l.strip(): int(n.strip()) for n, l in map(lambda x: x.split(), lines[:-1])};
data[" "] = int(lines[-1].strip());
total = sum(data.values());
data = {k: v / total for k, v in data.items()};
print(json.dumps(data, indent=2))' \
| tee frequency.json
Though this hardly counts as a one-liner, here is the whole pipeline in its full glory:
#!/bin/bash
# Calculate symbol frequency histogram from some of the top 100 Project
# Gutenberg books.
NUM_BOOKS=32
curl "https://gutenberg.org/browse/scores/top" \
| grep -o 'href="/ebooks/[0-9][0-9]*"' \
| sort \
| uniq \
| sed 's/href="\/ebooks\/\([0-9][0-9]*\)"/https:\/\/gutenberg.org\/ebooks\/\1.txt.utf-8/' \
| shuf \
| head -n "${NUM_BOOKS}" \
| xargs -L 1 -P 4 curl -L --compressed \
| tee text.txt \
| iconv -c -t ascii \
| tr -c -d '[:print:]\n' \
| tr '\n' ' ' \
| sed 's/[[:space:]][[:space:]]*/ /g' \
| sed 's/\(.\)/\1\n/g' \
| sort \
| uniq -c \
| sort -n \
| tee frequency.txt \
| python3 -c 'import sys, json;
lines = sys.stdin.readlines();
data = {l.strip(): int(n.strip()) for n, l in map(lambda x: x.split(), lines[:-1])};
data[" "] = int(lines[-1].strip());
total = sum(data.values());
data = {k: v / total for k, v in data.items()};
print(json.dumps(data, indent=2))' \
| tee frequency.json
The results of one of the runs can be found below. Due to the nondeterministic sampling of book URLs, the results will be slightly different after every run.
{
"^": 5.196361403817819e-08,
"%": 1.6108720351835239e-06,
"{": 1.922653719412593e-06,
"#": 2.234435403641662e-06,
"@": 2.3903262457561967e-06,
"}": 2.494253473832553e-06,
"$": 3.2737076844052257e-06,
"+": 7.794542105726728e-06,
"|": 8.31417824610851e-06,
"<": 1.2003594842819162e-05,
">": 1.2003594842819162e-05,
"=": 1.2211449298971873e-05,
"&": 1.4238030246460823e-05,
"Q": 4.0323764493626275e-05,
"[": 4.084340063400805e-05,
"]": 4.136303677438984e-05,
"9": 4.6663325406284015e-05,
"/": 4.785848852916211e-05,
"6": 4.957328779242199e-05,
"7": 5.393823137162896e-05,
"Z": 5.752372074026325e-05,
"4": 6.064153758255394e-05,
"5": 6.178473709139387e-05,
"*": 6.822822523212796e-05,
"8": 7.332065940786942e-05,
"0": 7.732185768880914e-05,
"3": 7.778953021515274e-05,
"X": 8.402516389973413e-05,
"2": 0.0001022124288130965,
"U": 0.00020135900439794048,
"K": 0.00020281398559100945,
"(": 0.00020951729180193444,
")": 0.0002103487096265453,
"1": 0.00024303382285655937,
"V": 0.00033511334693221115,
"_": 0.00037621656563641007,
"J": 0.000394559721391887,
"z": 0.0004528628963427229,
"Y": 0.00048269001080063716,
"F": 0.0006709541844609568,
"D": 0.0007090954771649795,
"O": 0.0007477044423953459,
"G": 0.0007549793483606909,
"R": 0.0007651642167121738,
"N": 0.0007676065065719682,
":": 0.0007801297375551691,
"L": 0.0009273946197393661,
"P": 0.000931447781634344,
"E": 0.0009454779574246522,
"!": 0.0009535323176005697,
"j": 0.0009649123490749307,
"q": 0.0009655359124433889,
"W": 0.0009925569917432415,
"?": 0.0010007152791472356,
"C": 0.001008146075954695,
"B": 0.0010093932026916112,
"x": 0.0012159485684933695,
"H": 0.0013806732249943944,
"M": 0.0014142417196630576,
"S": 0.0014376773095942758,
"'": 0.0015881119722348017,
";": 0.0017668148409120965,
"A": 0.001867260506847895,
"T": 0.0024605810519358136,
"-": 0.0027239326478813003,
"I": 0.00429868997130829,
"\"": 0.004835422140708633,
"k": 0.005644599538511144,
"v": 0.008026143933494888,
".": 0.009327364792624908,
"b": 0.010927168578018301,
"p": 0.012489298743234012,
",": 0.014844705440356553,
"g": 0.015086180354791968,
"y": 0.015144847275041071,
"f": 0.01664914193783229,
"w": 0.017037569952767673,
"c": 0.01866974706970685,
"m": 0.01892473252379219,
"u": 0.0228628989409036,
"l": 0.03078002321214639,
"d": 0.0334997468072906,
"r": 0.0453264575079238,
"s": 0.048719941322687026,
"h": 0.049453875407362256,
"i": 0.051417476454636936,
"n": 0.05463236132795094,
"o": 0.05965417695221452,
"a": 0.06139542569501983,
"t": 0.07050059407401749,
"e": 0.0993860239183319,
" ": 0.16554292753128405
}
Note that the most common letters form the sequence etaonihsrdlu
, which is pretty close to the infamous etaoin shrdlu
. This seems like a good sign that our sample was fairly representative, and that the shell script works!