Friday, December 21, 2012

Fun with dictionaries

Last night I saw a tweet from @acoyne Andrew Coyne, asking what letter is first in the most english words, and what pair and triplet of letters. That kind of question really grabs me, and as a linux geek I'm also in a position to give a definitive answer with just a few lines of code. I saw the tweet late last night and wished I could bang out the solution right there, but I was on an iPad so opening a terminal emulator to log in to a linux system, then using the virtual keyboard to type the code seemed more trouble than I could justify.

So as soon as I found myself with a free moment today, I logged in and tackled this. I'll recount the steps I took for those who want to know how to do this sort of thing. Answers are toward the end of the post for those who don't care how it was done.

Step 1: get a text file containing a list of english words. That's easy, as the linux operating system has long had such files for use in its spell-check utilities:
% dpkg -l ispell

Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name           Version        Description
+++-==============-==============-============================================
ii  ispell         3.1.20.0-7     International Ispell (an interactive spelling


There's a whole post lurking in the decision of which dictionary to choose, but to keep this brief, I chose the Canadian english list already installed on a system we run in our department at UofT. A brief look at the file told me it includes proper nouns (the first batch of words all start with a capital letter.) I decided to skip these, on the basis that (a) they aren't legal in Scrabble(TM), and (b) it just felt right:

% ls -l /usr/share/dict/words

total 1832
-rw-r--r-- 1 root root    199 Jan 12  2011 README.select-wordlist
-rw-r--r-- 1 root root 931708 Mar 30  2009 american-english
-rw-r--r-- 1 root root 929744 Mar 30  2009 canadian-english 
[ ... ]

So I chose the Canadian-english dictionary, and filtered out words starting with a capital, as well as those with apostrophe-s at the end (included as useful to the spellchecker, but not distinct words for our purpose here):

% cd ~
% mkdir word-stuff
% cd  word-stuff
% egrep "^[a-z]" /usr/share/dict/canadian-english | grep -v "'" > words
% wc words
 63879  63879 594410 words

So that leaves us with 63,879 distinct Canadian english words that aren't proper nouns or acronyms.

Now the payoff:
% for L1 in {a..z}; do echo -n $L1 " "; grep -c "^$L1" words; done > counts-firstletter
% cat counts-firstletter

a  3591
b  3689
c  6213
d  4087
e  2583
f  2819
g  2065
h  2294
i  2659
j  568
k  459
l  1946
m  3325
n  1167
o  1565
p  5136
q  320
r  3770
s  7689
t  3262
u  1627
v  977
w  1762
x  11
y  190
z  105

So the first-letter winner is letter s with 7689 distinct words starting with s in this dictionary; second place goes to letter c with 6213, then d with 4087. Letters that start over 3000 words include a, b, m, r and t. Last place goes to x with just eleven.


Mr. Coyne's follow-up questions take us beyond what I can comfortably do on a single line at the command prompt. No worries though, as a few lines of shell script will cover this:

% vim coyne.sh
#!/bin/bash

> first-letter-counts
> two-letter-counts
for L1 in {a..z}
do
        C1=`grep -c "^$L1" words`
        echo "$C1       $L1" >> first-letter-counts
        for L2 in {a..z}
        do
                C2=`grep -c "^$L1$L2" words`
                echo "$C2       $L1$L2" >> two-letter-counts
        done
done


I'm stopping at counts for pairs of letters. The run times will show why:

% chmod +x coyne.sh
% time ./coyne.sh

real    0m13.652s
user    0m1.956s
sys     0m1.240s

Just under 14 seconds. That was less time than it took to type in the script, but still longer than it could have been. This code is reading through all 63,879 words in the dictionary for each of the 676 combinations of two letters. How many steps is that?

% dc
63879
676
*
p
43182204
q

So the two-letter test used over 43 million tests for "does this line start with these two letters"?
Running the three-letter test would have taken 26 times as long - over 5 minutes to run. As the great Bill Wasserstom said through Calvin, "Six minutes to microwave this!! Who's got that kind of time?"

Clearly we can do this with far fewer steps, but to do so neatly in /bin/bash will be daunting. So, time to switch into perl (others may prefer python - let's all just try to get along.)

% vim coyne.pl
#!/usr/bin/perl

open(WORDS, "<words");
WORD: while() {
        chomp();
        $first=substr($_,0,1);
        $one{$first}++;
        if ($one{$first} > $max_one) {
                $L1_max=$first;
                $max_one=$one{$first};
                }
        next WORD if (length($_) < 2);
        $firsttwo=substr($_,0,2);
        $two{$firsttwo}++;
        if ($two{$firsttwo}> $max_two) {
                $L2_max=$firsttwo;
                $max_two=$two{$firsttwo};
                }
        next WORD if (length($_) < 3);
        $firstthree=substr($_,0,3);
        $three{$firstthree}++;
        if ($three{$firstthree}> $max_three) {
                $L3_max=$firstthree;
                $max_three=$three{$firstthree};
                }
        }
close(WORDS);
printf "Letter %s starts %d words\n", $L1_max, $one{$L1_max};
printf "Letters %s start %d words\n", $L2_max, $two{$L2_max};
printf "Letters %s start %d words\n", $L3_max, $three{$L3_max};


% time ./coyne.pl
Letter s starts 7689 words
Letters co start 2535 words
Letters con start 969 words

real    0m0.171s
user    0m0.160s
sys     0m0.012s

So by reading through the word list just once, and storing stats on the first 1, 2 and 3 letters of that word into counters indexed by those strings of letters (using perl's amazing 'hash' feature), this script ran in just 171 ms.

Anyway, there are your results, Mr. Coyne: 's', 'co,' and 'con.' 

You can see why 'co' would be so prevalent, as it's a very useful prefix in its own right, giving us coexist, cooperate, cohabitate, etc., besides all the simple words it contributes to (cob, cobble, corn, cow, coy, etc.) On top of that, this digraph also gains by starting off both the winning trigraph 'con' - itself an important prefix for conceive, conserve, conservative, conservation, etc., as well as another important prefix 'com' (commerce, compromise, command, commit, complex, etc.)

Another few lines added to coyne.pl will allow me to show the runners up in each category. I imagine that 'com' may place in the top ten initial trigraphs, just from thinking of examples for the previous paragraph. I'll wrap up here for now, but I may post again with more wordplay stemming from this great little question.

- Jim Prall
Toronto, Canada




Sunday, December 9, 2012

Saving energy

Saving energy by shaving some watts


I recently received a handy interactive, whole-house power usage display. It's the PowerCost Monitor(TM) by Blueline Innovations of St. John's Newfoundland, and it was provided free of charge by my local electric utility, Toronto Hydro, as part of their "PeakSaver Plus" demand-reduction program.

They installed a sender over top of my existing time-of-use "smart meter;" the indoor portion is a wireless LCD display, powered by two AA batteries (provided) and including a small backup battery to retain settings when replacing the AAs. There is a fair amount of info already programmed in (which I wouldn't want to have to re-enter with the overloaded few-button interface): times of day for the peak, shoulder and off-peak rates on weekdays (weekends are all off-peak rate), as well as the current price per kWh at each of the three rates. It also knows and displays the current time and day of the week, and the outside air temperature (hey, they were already setting up one outdoor sensor, so why not toss in one more?) It also averages your recent usage to predict your expected monthly usage total. It's tracking that pretty well, I'd say. (I don't know if it's programmed to know your billing date, or if it's just projecting thirty days worth of your recent usage.)

Thanks to all this preloaded info, the unit knows which rate is in effect at the moment, and how much that is (if the utility revises the rates or the timetable, someone will need to reprogram the display - it's not a web-enabled kiosk.) If I want, I can set it to show exactly how many cents per hour we are consuming from electricity usage. But I know the rates and times, and prefer to keep the display set to kilowatts, as that's the 'comparable' I want to focus on in isolating which devices in our home are the power hogs. By watching the kW display change as I turn items on or off, I can isolate that item's draw (barring any confounding factors like the fridge or furnace motor cycling on or off automatically - I just listen for their motors first.) The read-out is "near" real-time - there is a lag in the response of up to ten seconds or so - I guess it is gathering data from something that blinks or rotates past its optical sensor, and it needs some time to detect the change in rate.

The display shows kW with one decimal place, so its resolution is 100W increments. That's a lower level of resolution than offered by the Kill-A-Watt from P3, but it gives info on the whole house including built-ins (light fixtures, ceiling pot lights, oven and dishwasher) and the 220V dryer that can't be fed through a one-outlet 110V meter like Kill-A-Watt. (You could use a clamping AC ammeter, which I own, but you'd need access to a single-phase wire for that outlet - too scary!)

I had already tested several items one-by-one on the Kill-A-Watt, noting e.g. that my small Samsung b&w laser printer draws 550W during warm-up, 45W while printing, and 10W on standby. Finding that has prompted me to keep it powered off when not in use, given that I print with it less than once a week on average. Powering it up looks to equate to around 20 minutes of standby time.

With the whole-house display, I can now develop a much clearer and more inclusive sense of where to focus to cut our energy use, whether by switching off high-demand items we don't need right now (back porch light, kitchen ceiling lights) or by working out if there are places I could save by replacing/upgrading bulbs, and perhaps one day appliances (though nothing is due for replacement for a while, happily.)

The eye-catching particulars:
  • our 220V electric dryer consumes some 5 to 6 kW while running
  • The dishwasher comes in around 1 kW - I haven't isolated it exactly, plus it cycles and the draw likely varies depending on what it's doing
  • The eight halogen pot-lights and one incandescent in the kitchen ceiling together draw nearly 500W - ouch! I hadn't thought that through, but halogens are not really much more efficient than incandescents. There are more in our upstairs hall and bathroom. For now I'm focusing on these rooms to keep the lights off when unoccupied.
    These fixtures take GU10 bulbs, with two fat round notched pins. None of the halogens has burned out in the 10+ years since the reno when they were installed. About a year ago I found a compact fluorescent bulb in GU10 form-factor, but the light is dimmer and far toward the blue-purple; also I expect this CF is not dimmable, and we have dimmers on both the kitchen and bathroom switch. (Add a dimmer in your bathroom - it's great when you have to pee at 4 am and just want to avoid banging into the sink without wrecking your night vision/waking yourself up more than you have to!) Anyway we didn't make a big switch to GU10 CFs. I looked up prices for GU10 LED bulbs - these run over $25 each at the local Home Depot (ouch), but ordering from China off eBay (free shipping by post, 15 to 25 business days for delivery) they're going for as little as $6.50. I may order a few to try them out. They claim to use CREE LEDs (real or imitation? Anyone's guess.)
  • base demand in our house, with everything turned off that can be, is 0.2 kW, i.e. ~200W, when the fridge is not running (the fridge doesn't show up as a big draw, I was happy to discover.) Items I know are contributing to the base load (the first four 'comms' items should probably go on a UPS battery back up, in case Rogers can keep service going during a local power outage - something not that uncommon in our neighbourhood):
    • cable modem plus wifi router, left on 24/7 because (a) we use VOIP for our main tel. #, and (b) we may want internet access at any hour of the day or night (iPad 1, iPad mini, Macbook, iPod Touch; there's even a wifi NIC in my new stereo for streaming radio.)
    • the NetTalk Duo VOIP SIP, so our phone can ring on incoming calls, and
    • Panasonic base unit for our cordless phones
    • charging bases for the additional phone handsets (I connected one through the Kill-A-Watt, but didn't register even 1W since the phone is displaying "fully charged" - so I listened to dialtone for a minute then tried again - the charging light came on briefly but still only drew 1W. This is just a pair of AAA NiMH batteries, so charging won't draw much even from deep discharge. Using the 'VA' mode on the Kill-A-Watt I see that topping up the charge draws 3 to 5VA, and the base unit draws 1VA of 'ghost load' even when the phone is not cradled. That's the benefit of having the VA mode - it reveals these small ghost loads where wattage is at or near zero.
    • Scientific Atlanta PVR, always on to be able to record shows at random times of day or night
    • AppleTV
    • the 'standby' power on electronics:
      • AppleTV
      • 1 LCD and 1 LED TV set
      • stereo & subwoofer
    • clock displays on:
      • stove
      • microwave
      • 2 clock radios
    • indicator lights on humidifier
    • timer thermostat for the furnace (LCD display and clock - whatever constant load these impose is more than paid back in energy savings from not heating while we're asleep or at work)
    • and of course a raft of little wall-wart device chargers (these should be on a switched power bar or just unplugged when not in use, to avoid 'ghost loads'):
      • iPad (always charging when not on the move)
      • iPad mini
      • iPod (needs charging once or twice a week)
      • Samsung cellphone (charges in minutes, twice a week)
      • portable iPod speakers (kept charged for indoor use - not travelling with them yet as it's -1C outside tonight, with freezing rain)
  • Over and above the ca. 200W base load I can't eliminate, there are some cycling loads that turn on and off intermittently on their own if they're in use:
    • furnace - natural gas fired hot water, with circulating pumps feeding the radiators
      • electronic ignition (avoiding gas-powered pilot light) may be 1kW for ca. 10 sec. each cycle - I haven't confirmed this yet, but I was watching the meter and it jumped up 1kW briefly just now, and that would fit with the furnace cycling on. It's a cold night.
      • The circulating pumps are also thermostatically controlled, but I don't know either their power draw nor their cycling pattern. I could in theory take the display down to the basement, and wait for the heat to cycle on to see what the meter does...
    • Items with a plug and/or power switch that run intermittently:
      • fridge - maybe 200W? Again, I'd need to sit with the display and listen for it to start/stop to know. It's nearly built-in, so the plug is not accessible; it's heavy and far too much trouble to pull out to read directly with the Kill-A-Watt. (Maybe the draw could be listed in the online specs?)
      • de-humidifier (used in summer in the basement) - not in use now
      • humidifier - hot water radiator heating is great, but the house gets achingly dry unless I use the humidifier. It has a large diameter fan over a tank with two evaporative pad inserts - beautifully quiet on the lowest setting. I hooked this up to the Kill-A-Watt just now, and got these VA readings:
        • speed 4: 55 VA (noisy)
        • speed 3: 27 VA
        • speed 2: 15 VA (audible, but barely)
        • speed 1: 13 VA (virtually silent)
        • speed 0: 17 VA (fan off - but the indicator panel still lit!)
So why does "off" draw more than "low"? Hunch: speed zero turns on six segments of the 7-segment green LED digit on the display, while speed 1 uses only the two segments on the right side. Whatever small draw the fan uses at speed 1 may be outdone by the ~ 1W per segment for the 7-segment LED. (To verifty this I'd have to take the unit apart, but I can't risk breaking this essential piece of winter kit!)

Update!

I fiddled with the controls on the humidifier while watching the readout on the Kill-A-Watt, and determined that simply powering off the fan pushes the draw from 13 up to 17 VA; it's not the seven-segment LED that's doing this, because this time I started and stopped the fan by altering the set-point of the humidistat control, rather than the power setting. The humidistat also uses short green light segments (LED? EL?) to show the 'bar graph' of the desired and current humidity levels; so I was altering the number of illuminated segments this way as well, but those changes showed no effect on the VA readout. The only time I saw the jump from 13 to 17 VA was the moment the fan powered off.
I'm thinking the controller circuit drives a relay that switches the current to the fan. Perhaps the relay incurs some power cost being in the off position? At work I have an office full of electrical engineering professors and grads whom I can quiz this week to see if that hunch is plausible, or if they have a better explanation.