Frequency Analysis From the Shell

Recently, in the course of performing cryptanalysis, I needed the relative frequencies with which each character—symbols and punctuation included—occurs in English text. Web searches seemed to only return results for letter frequencies, with no symbols and no distinction between uppercase and lowercase letters, so I resolved to sample the frequencies myself.

I chose to sample text from books hosted by Project Gutenberg since they are free, legal, and easy to access. Even better, they’re available as plain text files.

Since this was a low-stakes, one-off task, I decided the ceremony of a full-blown program was overkill, and opted for a shell script instead. Plus text wrangling is the shell’s bread and butter.

NUM_BOOKS=16

curl "https://gutenberg.org/browse/scores/top" \
  | grep -o 'href="/ebooks/[0-9][0-9]*"' \
  | sort \
  | uniq \
  | sed 's/href="\/ebooks\/\([0-9][0-9]*\)"/https:\/\/gutenberg.org\/ebooks\/\1.txt.utf-8/' \
  | shuf \
  | head -n "${NUM_BOOKS}" \
  # ...

The number of books to sample can be tuned for greater accuracy at the expense of longer running time. Since some of the Project Gutenberg works may be short or may not be representative of the distribution of symbols in typical English text, increasing this number reduces the risk of a skewed sample. Likewise, the shuf can be removed for deterministic output.

After getting book text URLs, the next step was to fetch the text from each of the links, concatenate it, and clean it up. The snippet below:

  # ... 
  | xargs -L 1 -P 4 curl -L --compressed \
  | tee text.txt \
  | iconv -c -t ascii \
  | tr -c -d '[:print:]\n' \
  | tr '\n' ' ' \
  | sed 's/[[:space:]][[:space:]]*/ /g' \
  # ...

  # ... 
  | sed 's/\(.\)/\1\n/g' \
  | sort \
  | uniq -c \
  | sort -n \
  | tee frequency.txt \
  # ...

Finally, I got lazy and decided to pass the rest to a small Python script, since I’ve never been fond of doing math in the shell. The snippet below:

  # ... 
  | python3 -c 'import sys, json;
lines = sys.stdin.readlines();
data = {l.strip(): int(n.strip()) for n, l in map(lambda x: x.split(), lines[:-1])};
data[" "] = int(lines[-1].strip());
total = sum(data.values());
data = {k: v / total for k, v in data.items()}; 
print(json.dumps(data, indent=2))' \
  | tee frequency.json

Though this hardly counts as a one-liner, here is the whole pipeline in its full glory:

#!/bin/bash

# Calculate symbol frequency histogram from some of the top 100 Project
# Gutenberg books.

NUM_BOOKS=32

curl "https://gutenberg.org/browse/scores/top" \
  | grep -o 'href="/ebooks/[0-9][0-9]*"' \
  | sort \
  | uniq \
  | sed 's/href="\/ebooks\/\([0-9][0-9]*\)"/https:\/\/gutenberg.org\/ebooks\/\1.txt.utf-8/' \
  | shuf \
  | head -n "${NUM_BOOKS}" \
  | xargs -L 1 -P 4 curl -L --compressed \
  | tee text.txt \
  | iconv -c -t ascii \
  | tr -c -d '[:print:]\n' \
  | tr '\n' ' ' \
  | sed 's/[[:space:]][[:space:]]*/ /g' \
  | sed 's/\(.\)/\1\n/g' \
  | sort \
  | uniq -c \
  | sort -n \
  | tee frequency.txt \
  | python3 -c 'import sys, json; 
lines = sys.stdin.readlines(); 
data = {l.strip(): int(n.strip()) for n, l in map(lambda x: x.split(), lines[:-1])}; 
data[" "] = int(lines[-1].strip()); 
total = sum(data.values()); 
data = {k: v / total for k, v in data.items()};  
print(json.dumps(data, indent=2))' \
  | tee frequency.json

The results of one of the runs can be found below. Due to the nondeterministic sampling of book URLs, the results will be slightly different after every run.

{
  "^": 5.196361403817819e-08,
  "%": 1.6108720351835239e-06,
  "{": 1.922653719412593e-06,
  "#": 2.234435403641662e-06,
  "@": 2.3903262457561967e-06,
  "}": 2.494253473832553e-06,
  "$": 3.2737076844052257e-06,
  "+": 7.794542105726728e-06,
  "|": 8.31417824610851e-06,
  "<": 1.2003594842819162e-05,
  ">": 1.2003594842819162e-05,
  "=": 1.2211449298971873e-05,
  "&": 1.4238030246460823e-05,
  "Q": 4.0323764493626275e-05,
  "[": 4.084340063400805e-05,
  "]": 4.136303677438984e-05,
  "9": 4.6663325406284015e-05,
  "/": 4.785848852916211e-05,
  "6": 4.957328779242199e-05,
  "7": 5.393823137162896e-05,
  "Z": 5.752372074026325e-05,
  "4": 6.064153758255394e-05,
  "5": 6.178473709139387e-05,
  "*": 6.822822523212796e-05,
  "8": 7.332065940786942e-05,
  "0": 7.732185768880914e-05,
  "3": 7.778953021515274e-05,
  "X": 8.402516389973413e-05,
  "2": 0.0001022124288130965,
  "U": 0.00020135900439794048,
  "K": 0.00020281398559100945,
  "(": 0.00020951729180193444,
  ")": 0.0002103487096265453,
  "1": 0.00024303382285655937,
  "V": 0.00033511334693221115,
  "_": 0.00037621656563641007,
  "J": 0.000394559721391887,
  "z": 0.0004528628963427229,
  "Y": 0.00048269001080063716,
  "F": 0.0006709541844609568,
  "D": 0.0007090954771649795,
  "O": 0.0007477044423953459,
  "G": 0.0007549793483606909,
  "R": 0.0007651642167121738,
  "N": 0.0007676065065719682,
  ":": 0.0007801297375551691,
  "L": 0.0009273946197393661,
  "P": 0.000931447781634344,
  "E": 0.0009454779574246522,
  "!": 0.0009535323176005697,
  "j": 0.0009649123490749307,
  "q": 0.0009655359124433889,
  "W": 0.0009925569917432415,
  "?": 0.0010007152791472356,
  "C": 0.001008146075954695,
  "B": 0.0010093932026916112,
  "x": 0.0012159485684933695,
  "H": 0.0013806732249943944,
  "M": 0.0014142417196630576,
  "S": 0.0014376773095942758,
  "'": 0.0015881119722348017,
  ";": 0.0017668148409120965,
  "A": 0.001867260506847895,
  "T": 0.0024605810519358136,
  "-": 0.0027239326478813003,
  "I": 0.00429868997130829,
  "\"": 0.004835422140708633,
  "k": 0.005644599538511144,
  "v": 0.008026143933494888,
  ".": 0.009327364792624908,
  "b": 0.010927168578018301,
  "p": 0.012489298743234012,
  ",": 0.014844705440356553,
  "g": 0.015086180354791968,
  "y": 0.015144847275041071,
  "f": 0.01664914193783229,
  "w": 0.017037569952767673,
  "c": 0.01866974706970685,
  "m": 0.01892473252379219,
  "u": 0.0228628989409036,
  "l": 0.03078002321214639,
  "d": 0.0334997468072906,
  "r": 0.0453264575079238,
  "s": 0.048719941322687026,
  "h": 0.049453875407362256,
  "i": 0.051417476454636936,
  "n": 0.05463236132795094,
  "o": 0.05965417695221452,
  "a": 0.06139542569501983,
  "t": 0.07050059407401749,
  "e": 0.0993860239183319,
  " ": 0.16554292753128405
}

Note that the most common letters form the sequence etaonihsrdlu, which is pretty close to the infamous etaoin shrdlu. This seems like a good sign that our sample was fairly representative, and that the shell script works!