Frequency Analysis From the Shell

By Jacob Strieb.

Published on October 9, 2022.


Recently, in the course of performing cryptanalysis, I needed the relative frequencies with which each character—symbols and punctuation included—occurs in English text. Web searches seemed to only return results for letter frequencies, with no symbols and no distinction between uppercase and lowercase letters, so I resolved to sample the frequencies myself.

I chose to sample text from books hosted by Project Gutenberg since they are free, legal, and easy to access. Even better, they’re available as plain text files.

Since this was a low-stakes, one-off task, I decided the ceremony of a full-blown program was overkill, and opted for a shell script instead. Plus text wrangling is the shell’s bread and butter.

The first major step was to get links to books to download. The snippet below:

  1. Pulls the list of top books
  2. Extracts links to books matching a known pattern
  3. Ensures that no book links are repeated
  4. Converts each relative book page path to an absolute link pointing to the respective ebook text file
  5. Randomly shuffles the order of the book URLs
  6. Takes 16 book URLs from the shuffled list
NUM_BOOKS=16

curl "https://gutenberg.org/browse/scores/top" \
  | grep -o 'href="/ebooks/[0-9][0-9]*"' \
  | sort \
  | uniq \
  | sed 's/href="\/ebooks\/\([0-9][0-9]*\)"/https:\/\/gutenberg.org\/ebooks\/\1.txt.utf-8/' \
  | shuf \
  | head -n "${NUM_BOOKS}" \
  # ...

The number of books to sample can be tuned for greater accuracy at the expense of longer running time. Since some of the Project Gutenberg works may be short or may not be representative of the distribution of symbols in typical English text, increasing this number reduces the risk of a skewed sample. Likewise, the shuf can be removed for deterministic output.

After getting book text URLs, the next step was to fetch the text from each of the links, concatenate it, and clean it up. The snippet below:

  1. Fetches the text from each of the links (parallelized across four cores) and writes the text to the standard output
  2. Duplicates the concatenated text to a file called test.txt
  3. Strips all non-printable, non-ASCII, and/or non-space characters
  4. Collapses all contiguous sequences of whitespace down to a single space character
  # ... 
  | xargs -L 1 -P 4 curl -L --compressed \
  | tee text.txt \
  | iconv -c -t ascii \
  | tr -c -d '[:print:]\n' \
  | tr '\n' ' ' \
  | sed 's/[[:space:]][[:space:]]*/ /g' \
  # ...

Next, the snippet below:

  1. Puts each character on its own line
  2. Counts the occurrences of each line and returns the results sorted least to greatest
  3. Saves the raw frequency data to a file
  # ... 
  | sed 's/\(.\)/\1\n/g' \
  | sort \
  | uniq -c \
  | sort -n \
  | tee frequency.txt \
  # ...

Finally, I got lazy and decided to pass the rest to a small Python script, since I’ve never been fond of doing math in the shell. The snippet below:

  1. Reads all lines in from the standard input
  2. Parses the input lines into a dictionary mapping letters to counts
  3. Calculates the total number of characters
  4. Creates a new dictionary calculating the proportional occurrence of each character
  5. Dumps that last dictionary as pretty-printed JSON with two-space indents
  # ... 
  | python3 -c 'import sys, json;
lines = sys.stdin.readlines();
data = {l.strip(): int(n.strip()) for n, l in map(lambda x: x.split(), lines[:-1])};
data[" "] = int(lines[-1].strip());
total = sum(data.values());
data = {k: v / total for k, v in data.items()}; 
print(json.dumps(data, indent=2))' \
  | tee frequency.json

Though this hardly counts as a one-liner, here is the whole pipeline in its full glory:

#!/bin/bash

# Calculate symbol frequency histogram from some of the top 100 Project
# Gutenberg books.

NUM_BOOKS=32

curl "https://gutenberg.org/browse/scores/top" \
  | grep -o 'href="/ebooks/[0-9][0-9]*"' \
  | sort \
  | uniq \
  | sed 's/href="\/ebooks\/\([0-9][0-9]*\)"/https:\/\/gutenberg.org\/ebooks\/\1.txt.utf-8/' \
  | shuf \
  | head -n "${NUM_BOOKS}" \
  | xargs -L 1 -P 4 curl -L --compressed \
  | tee text.txt \
  | iconv -c -t ascii \
  | tr -c -d '[:print:]\n' \
  | tr '\n' ' ' \
  | sed 's/[[:space:]][[:space:]]*/ /g' \
  | sed 's/\(.\)/\1\n/g' \
  | sort \
  | uniq -c \
  | sort -n \
  | tee frequency.txt \
  | python3 -c 'import sys, json; 
lines = sys.stdin.readlines(); 
data = {l.strip(): int(n.strip()) for n, l in map(lambda x: x.split(), lines[:-1])}; 
data[" "] = int(lines[-1].strip()); 
total = sum(data.values()); 
data = {k: v / total for k, v in data.items()};  
print(json.dumps(data, indent=2))' \
  | tee frequency.json

The results of one of the runs can be found below. Due to the nondeterministic sampling of book URLs, the results will be slightly different after every run.

{
  "^": 5.196361403817819e-08,
  "%": 1.6108720351835239e-06,
  "{": 1.922653719412593e-06,
  "#": 2.234435403641662e-06,
  "@": 2.3903262457561967e-06,
  "}": 2.494253473832553e-06,
  "$": 3.2737076844052257e-06,
  "+": 7.794542105726728e-06,
  "|": 8.31417824610851e-06,
  "<": 1.2003594842819162e-05,
  ">": 1.2003594842819162e-05,
  "=": 1.2211449298971873e-05,
  "&": 1.4238030246460823e-05,
  "Q": 4.0323764493626275e-05,
  "[": 4.084340063400805e-05,
  "]": 4.136303677438984e-05,
  "9": 4.6663325406284015e-05,
  "/": 4.785848852916211e-05,
  "6": 4.957328779242199e-05,
  "7": 5.393823137162896e-05,
  "Z": 5.752372074026325e-05,
  "4": 6.064153758255394e-05,
  "5": 6.178473709139387e-05,
  "*": 6.822822523212796e-05,
  "8": 7.332065940786942e-05,
  "0": 7.732185768880914e-05,
  "3": 7.778953021515274e-05,
  "X": 8.402516389973413e-05,
  "2": 0.0001022124288130965,
  "U": 0.00020135900439794048,
  "K": 0.00020281398559100945,
  "(": 0.00020951729180193444,
  ")": 0.0002103487096265453,
  "1": 0.00024303382285655937,
  "V": 0.00033511334693221115,
  "_": 0.00037621656563641007,
  "J": 0.000394559721391887,
  "z": 0.0004528628963427229,
  "Y": 0.00048269001080063716,
  "F": 0.0006709541844609568,
  "D": 0.0007090954771649795,
  "O": 0.0007477044423953459,
  "G": 0.0007549793483606909,
  "R": 0.0007651642167121738,
  "N": 0.0007676065065719682,
  ":": 0.0007801297375551691,
  "L": 0.0009273946197393661,
  "P": 0.000931447781634344,
  "E": 0.0009454779574246522,
  "!": 0.0009535323176005697,
  "j": 0.0009649123490749307,
  "q": 0.0009655359124433889,
  "W": 0.0009925569917432415,
  "?": 0.0010007152791472356,
  "C": 0.001008146075954695,
  "B": 0.0010093932026916112,
  "x": 0.0012159485684933695,
  "H": 0.0013806732249943944,
  "M": 0.0014142417196630576,
  "S": 0.0014376773095942758,
  "'": 0.0015881119722348017,
  ";": 0.0017668148409120965,
  "A": 0.001867260506847895,
  "T": 0.0024605810519358136,
  "-": 0.0027239326478813003,
  "I": 0.00429868997130829,
  "\"": 0.004835422140708633,
  "k": 0.005644599538511144,
  "v": 0.008026143933494888,
  ".": 0.009327364792624908,
  "b": 0.010927168578018301,
  "p": 0.012489298743234012,
  ",": 0.014844705440356553,
  "g": 0.015086180354791968,
  "y": 0.015144847275041071,
  "f": 0.01664914193783229,
  "w": 0.017037569952767673,
  "c": 0.01866974706970685,
  "m": 0.01892473252379219,
  "u": 0.0228628989409036,
  "l": 0.03078002321214639,
  "d": 0.0334997468072906,
  "r": 0.0453264575079238,
  "s": 0.048719941322687026,
  "h": 0.049453875407362256,
  "i": 0.051417476454636936,
  "n": 0.05463236132795094,
  "o": 0.05965417695221452,
  "a": 0.06139542569501983,
  "t": 0.07050059407401749,
  "e": 0.0993860239183319,
  " ": 0.16554292753128405
}

Note that the most common letters form the sequence etaonihsrdlu, which is pretty close to the infamous etaoin shrdlu. This seems like a good sign that our sample was fairly representative, and that the shell script works!