Friday, December 21, 2012

Fun with dictionaries

Last night I saw a tweet from @acoyne Andrew Coyne, asking what letter is first in the most english words, and what pair and triplet of letters. That kind of question really grabs me, and as a linux geek I'm also in a position to give a definitive answer with just a few lines of code. I saw the tweet late last night and wished I could bang out the solution right there, but I was on an iPad so opening a terminal emulator to log in to a linux system, then using the virtual keyboard to type the code seemed more trouble than I could justify.

So as soon as I found myself with a free moment today, I logged in and tackled this. I'll recount the steps I took for those who want to know how to do this sort of thing. Answers are toward the end of the post for those who don't care how it was done.

Step 1: get a text file containing a list of english words. That's easy, as the linux operating system has long had such files for use in its spell-check utilities:
% dpkg -l ispell

Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name           Version        Description
+++-==============-==============-============================================
ii  ispell         3.1.20.0-7     International Ispell (an interactive spelling


There's a whole post lurking in the decision of which dictionary to choose, but to keep this brief, I chose the Canadian english list already installed on a system we run in our department at UofT. A brief look at the file told me it includes proper nouns (the first batch of words all start with a capital letter.) I decided to skip these, on the basis that (a) they aren't legal in Scrabble(TM), and (b) it just felt right:

% ls -l /usr/share/dict/words

total 1832
-rw-r--r-- 1 root root    199 Jan 12  2011 README.select-wordlist
-rw-r--r-- 1 root root 931708 Mar 30  2009 american-english
-rw-r--r-- 1 root root 929744 Mar 30  2009 canadian-english 
[ ... ]

So I chose the Canadian-english dictionary, and filtered out words starting with a capital, as well as those with apostrophe-s at the end (included as useful to the spellchecker, but not distinct words for our purpose here):

% cd ~
% mkdir word-stuff
% cd  word-stuff
% egrep "^[a-z]" /usr/share/dict/canadian-english | grep -v "'" > words
% wc words
 63879  63879 594410 words

So that leaves us with 63,879 distinct Canadian english words that aren't proper nouns or acronyms.

Now the payoff:
% for L1 in {a..z}; do echo -n $L1 " "; grep -c "^$L1" words; done > counts-firstletter
% cat counts-firstletter

a  3591
b  3689
c  6213
d  4087
e  2583
f  2819
g  2065
h  2294
i  2659
j  568
k  459
l  1946
m  3325
n  1167
o  1565
p  5136
q  320
r  3770
s  7689
t  3262
u  1627
v  977
w  1762
x  11
y  190
z  105

So the first-letter winner is letter s with 7689 distinct words starting with s in this dictionary; second place goes to letter c with 6213, then d with 4087. Letters that start over 3000 words include a, b, m, r and t. Last place goes to x with just eleven.


Mr. Coyne's follow-up questions take us beyond what I can comfortably do on a single line at the command prompt. No worries though, as a few lines of shell script will cover this:

% vim coyne.sh
#!/bin/bash

> first-letter-counts
> two-letter-counts
for L1 in {a..z}
do
        C1=`grep -c "^$L1" words`
        echo "$C1       $L1" >> first-letter-counts
        for L2 in {a..z}
        do
                C2=`grep -c "^$L1$L2" words`
                echo "$C2       $L1$L2" >> two-letter-counts
        done
done


I'm stopping at counts for pairs of letters. The run times will show why:

% chmod +x coyne.sh
% time ./coyne.sh

real    0m13.652s
user    0m1.956s
sys     0m1.240s

Just under 14 seconds. That was less time than it took to type in the script, but still longer than it could have been. This code is reading through all 63,879 words in the dictionary for each of the 676 combinations of two letters. How many steps is that?

% dc
63879
676
*
p
43182204
q

So the two-letter test used over 43 million tests for "does this line start with these two letters"?
Running the three-letter test would have taken 26 times as long - over 5 minutes to run. As the great Bill Wasserstom said through Calvin, "Six minutes to microwave this!! Who's got that kind of time?"

Clearly we can do this with far fewer steps, but to do so neatly in /bin/bash will be daunting. So, time to switch into perl (others may prefer python - let's all just try to get along.)

% vim coyne.pl
#!/usr/bin/perl

open(WORDS, "<words");
WORD: while() {
        chomp();
        $first=substr($_,0,1);
        $one{$first}++;
        if ($one{$first} > $max_one) {
                $L1_max=$first;
                $max_one=$one{$first};
                }
        next WORD if (length($_) < 2);
        $firsttwo=substr($_,0,2);
        $two{$firsttwo}++;
        if ($two{$firsttwo}> $max_two) {
                $L2_max=$firsttwo;
                $max_two=$two{$firsttwo};
                }
        next WORD if (length($_) < 3);
        $firstthree=substr($_,0,3);
        $three{$firstthree}++;
        if ($three{$firstthree}> $max_three) {
                $L3_max=$firstthree;
                $max_three=$three{$firstthree};
                }
        }
close(WORDS);
printf "Letter %s starts %d words\n", $L1_max, $one{$L1_max};
printf "Letters %s start %d words\n", $L2_max, $two{$L2_max};
printf "Letters %s start %d words\n", $L3_max, $three{$L3_max};


% time ./coyne.pl
Letter s starts 7689 words
Letters co start 2535 words
Letters con start 969 words

real    0m0.171s
user    0m0.160s
sys     0m0.012s

So by reading through the word list just once, and storing stats on the first 1, 2 and 3 letters of that word into counters indexed by those strings of letters (using perl's amazing 'hash' feature), this script ran in just 171 ms.

Anyway, there are your results, Mr. Coyne: 's', 'co,' and 'con.' 

You can see why 'co' would be so prevalent, as it's a very useful prefix in its own right, giving us coexist, cooperate, cohabitate, etc., besides all the simple words it contributes to (cob, cobble, corn, cow, coy, etc.) On top of that, this digraph also gains by starting off both the winning trigraph 'con' - itself an important prefix for conceive, conserve, conservative, conservation, etc., as well as another important prefix 'com' (commerce, compromise, command, commit, complex, etc.)

Another few lines added to coyne.pl will allow me to show the runners up in each category. I imagine that 'com' may place in the top ten initial trigraphs, just from thinking of examples for the previous paragraph. I'll wrap up here for now, but I may post again with more wordplay stemming from this great little question.

- Jim Prall
Toronto, Canada




No comments:

Post a Comment