So as soon as I found myself with a free moment today, I logged in and tackled this. I'll recount the steps I took for those who want to know how to do this sort of thing. Answers are toward the end of the post for those who don't care how it was done.
Step 1: get a text file containing a list of english words. That's easy, as the linux operating system has long had such files for use in its spell-check utilities:
% dpkg -l ispell
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Description
+++-==============-==============-============================================
ii ispell 3.1.20.0-7 International Ispell (an interactive spelling
There's a whole post lurking in the decision of which dictionary to choose, but to keep this brief, I chose the Canadian english list already installed on a system we run in our department at UofT. A brief look at the file told me it includes proper nouns (the first batch of words all start with a capital letter.) I decided to skip these, on the basis that (a) they aren't legal in Scrabble(TM), and (b) it just felt right:
% ls -l /usr/share/dict/words
total 1832
-rw-r--r-- 1 root root 199 Jan 12 2011 README.select-wordlist
-rw-r--r-- 1 root root 931708 Mar 30 2009 american-english
-rw-r--r-- 1 root root 929744 Mar 30 2009 canadian-english
[ ... ]
So I chose the Canadian-english dictionary, and filtered out words starting with a capital, as well as those with apostrophe-s at the end (included as useful to the spellchecker, but not distinct words for our purpose here):
% cd ~
% mkdir word-stuff
% cd word-stuff
% egrep "^[a-z]" /usr/share/dict/canadian-english | grep -v "'" > words
% wc words
63879 63879 594410 words
Now the payoff:
% for L1 in {a..z}; do echo -n $L1 " "; grep -c "^$L1" words; done > counts-firstletter
% cat counts-firstletter
a 3591
b 3689
c 6213
d 4087
e 2583
f 2819
g 2065
h 2294
i 2659
j 568
k 459
l 1946
m 3325
n 1167
o 1565
p 5136
q 320
r 3770
s 7689
t 3262
u 1627
v 977
w 1762
x 11
y 190
z 105
So the first-letter winner is letter s with 7689 distinct words starting with s in this dictionary; second place goes to letter c with 6213, then d with 4087. Letters that start over 3000 words include a, b, m, r and t. Last place goes to x with just eleven.
Mr. Coyne's follow-up questions take us beyond what I can comfortably do on a single line at the command prompt. No worries though, as a few lines of shell script will cover this:
% vim coyne.sh
#!/bin/bash
> first-letter-counts
> two-letter-counts
for L1 in {a..z}
do
C1=`grep -c "^$L1" words`
echo "$C1 $L1" >> first-letter-counts
for L2 in {a..z}
do
C2=`grep -c "^$L1$L2" words`
echo "$C2 $L1$L2" >> two-letter-counts
done
done
I'm stopping at counts for pairs of letters. The run times will show why:
% chmod +x coyne.sh
% time ./coyne.sh
real 0m13.652s
user 0m1.956s
sys 0m1.240s
Just under 14 seconds. That was less time than it took to type in the script, but still longer than it could have been. This code is reading through all 63,879 words in the dictionary for each of the 676 combinations of two letters. How many steps is that?
% dc
63879
676
*
p
43182204
q
So the two-letter test used over 43 million tests for "does this line start with these two letters"?
Running the three-letter test would have taken 26 times as long - over 5 minutes to run. As the great Bill Wasserstom said through Calvin, "Six minutes to microwave this!! Who's got that kind of time?"
Clearly we can do this with far fewer steps, but to do so neatly in /bin/bash will be daunting. So, time to switch into perl (others may prefer python - let's all just try to get along.)
% vim coyne.pl
#!/usr/bin/perl
open(WORDS, "<words");
WORD: while() {
chomp();
$first=substr($_,0,1);
$one{$first}++;
if ($one{$first} > $max_one) {
$L1_max=$first;
$max_one=$one{$first};
}
next WORD if (length($_) < 2);
$firsttwo=substr($_,0,2);
$two{$firsttwo}++;
if ($two{$firsttwo}> $max_two) {
$L2_max=$firsttwo;
$max_two=$two{$firsttwo};
}
next WORD if (length($_) < 3);
$firstthree=substr($_,0,3);
$three{$firstthree}++;
if ($three{$firstthree}> $max_three) {
$L3_max=$firstthree;
$max_three=$three{$firstthree};
}
}
close(WORDS);
printf "Letter %s starts %d words\n", $L1_max, $one{$L1_max};
printf "Letters %s start %d words\n", $L2_max, $two{$L2_max};
printf "Letters %s start %d words\n", $L3_max, $three{$L3_max};
% time ./coyne.pl
Letter s starts 7689 words
Letters co start 2535 words
Letters con start 969 words
real 0m0.171s
user 0m0.160s
sys 0m0.012s
So by reading through the word list just once, and storing stats on the first 1, 2 and 3 letters of that word into counters indexed by those strings of letters (using perl's amazing 'hash' feature), this script ran in just 171 ms.
Anyway, there are your results, Mr. Coyne: 's', 'co,' and 'con.'
You can see why 'co' would be so prevalent, as it's a very useful prefix in its own right, giving us coexist, cooperate, cohabitate, etc., besides all the simple words it contributes to (cob, cobble, corn, cow, coy, etc.) On top of that, this digraph also gains by starting off both the winning trigraph 'con' - itself an important prefix for conceive, conserve, conservative, conservation, etc., as well as another important prefix 'com' (commerce, compromise, command, commit, complex, etc.)
Another few lines added to coyne.pl will allow me to show the runners up in each category. I imagine that 'com' may place in the top ten initial trigraphs, just from thinking of examples for the previous paragraph. I'll wrap up here for now, but I may post again with more wordplay stemming from this great little question.
- Jim Prall
Toronto, Canada