Thursday, December 20, 2007

Less code

I was just reading Steve Yegge's rant against code size and realized that he managed to put into words exactly the feelings that have been drawing me to python in recent years. In particular, I managed to mostly skip the Java step in my journey from Pascal, through assembler, up to C, and then the leap to high-level languages including perl, and more recently python. I don't really know why, but Java never felt "right" -- for anything. To this day, I can't think of too many applications that I would say Java was the best tool for the job. For which, I think Steve hit the nail on the head when he writes:
Java is like a variant of the game of Tetris in which none of the pieces can fill gaps created by the other pieces, so all you can do is pile them up endlessly.

Hallelujah, brother.

Anyway, I strongly agree with Steve's general points about the merits of small code bases, but I won't go so far to say that smaller is necessarily always better. Python hits a sweet spot for me (at least for now) between compactness and comprehensiveness. Certainly a good number of problems could be expressed more succinctly in a functional language such as Erlang or Haskell, but you lose readability. In fact, as elegantly as many problems can be expressed in a functional language, they quickly start to look like line noise when the problems exceed textbook examples.

Programming language preferences aside, what I agree with most from Steve's blog post was not so much that more succinct languages are better, but that less code is better. His post is written so as to suggest that Java itself is a problem -- which may certainly be true -- but he doesn't clarify whether he thinks it is Java the language, or Java the set of libraries.

Python, for example, combines a great set of standard libraries with a language syntax that makes it easy to use those libraries. All the lines of code hidden away in libraries are effectively "free" code. You don't have to manage their complexity. Give me a language that makes leveraging as many libraries as possible painless, then I can glue them together to make great programs with low apparent complexity. In reality, the lines of code might be astronomical, but I don't have to manage the complexity of all of it -- just the part I wrote -- so it doesn't matter.
Python does a great job here, whereas Java (and C++'s STL) largely get it wrong.

In particular, I would argue that, in addition to python's straightforward syntax, the fact that so many of python's libraries are written in C is a large factor in why they are so easy to use. There may be a huge amount of complexity, and a huge number of lines of code, in the C implementation of a library. However, the API boundary between python and C acts a sort of line of demarcation -- no complexity inherent in the implementation of the library can leak out into the python API without the programmer explicitly allowing it. That is, the complexity of libraries written in C and callable from python is necessarily encapsulated.

As a personal anecdote, in one project I work on, we use ctypes to make foreign function calls to a number of Windows APIs. One thing that really bothers me about this technique is that I find myself re-implementing a number of data structures in ctypes that are already defined in C header files. If I make a mistake, then I introduce a bug. Ironically, since I could leverage more existing code, often times there would be fewer lines of code and less complexity had I just used C to call the APIs. Of course, other parts of the program would become hugely unwieldy, but the point of this anecdote is that libraries (more specifically, being able to leverage known-good code) can be much more effective in reducing code than the implementation language.

So long as the implementation language isn't Java. Java just sucks. :)

Tuesday, December 11, 2007

Lies, Damn Lies, and ... Economics?

Today I'm going to venture out of any field that I have the slightest expertise in and flounder about in the field of basic macro-economics.

But before I demonstrate my utter lack of knowledge, I'm going to touch on a subject that I at least have some familiarity with -- mathematics. I'm going to share with you a math problem that I cannot solve. It looks something like this:

Given:

A A'
--- = 111 = ----
B B'

and

B = B' * 1.41

What is the ratio between A and A'?
Elementary algebra would seem to imply the answer should be 1.41:

A A'
--- = ----
B B'

A * B' = A' * B

A * B' = A' * B' * 1.41 ==> A = A' * 1.41

By itself, I would have expected this problem would be trivial.
However, what perplexes me is that the observed value for the ratio between A and A' is nowhere near 1.41, but rather approximately 1.0.

And here is where I demonstrate my lack of understanding in the field of economics...

The thing that has been puzzling me for years (even before the recent foreign exchange craze) is that the U.S. dollar has experienced fairly consistent inflation for these 13 years while the Japanese Yen has experienced almost none, yet the exchange rate remains the same.

You see, A' / B' is the average value of the Japanese Yen in U.S. dollars for November 2007. And A / B is the average value of the Japanese Yen in U.S. dollars for April 1994. Both ratios just happen to be approximately 111. (Source: Board of Governors of the Federal Reserve System)

The ratio B' / B is the purchasing power of the U.S. dollar in 2007 relative to dollars in 1994. That is, one 1994 dollar (B) is worth 1.41 2007 dollars (B'). (Source: Bureau of Labor Statistics)

The ratio A' / A is the purchasing power of the Japanese Yen in 2007 relative to yen in 1994. It so happens that one Japanese yen buys the same amount in 2007 that it did in 1994. (Source: Japanese Ministry of Internal Affairs and Communications)

How can this be?
How can two currencies can have different inflation rates, but yet maintain the same conversion rate?

I'm no economist, but I cannot help but wonder if the answer lies outside of math and in the realm of human irrationality. Or that the CPI values uses to calculate the purchasing power of a currency are inconsistent and/or flawed. Actually, I know that CPI calculation methods differ amongst countries, but for some reason I expected Japan to have used to the calculation method that U.S. does. I admit I haven't investigated that explanation yet.

Does anyone have a better explanation (preferably one that does not violate basic math principles)? I would love to put this puzzle to rest.

Monday, December 3, 2007

Python: asserting code runs in specific thread

My buddy JJ's post to the BayPiggies mailing list reminded me of a little snippet I wrote a while back that others might find useful as well. Personally, I avoid threads like the plague, but if you are forced to use them it is generally handy to keep accurate tabs on how you use them. In particular, as JJ suggested in his post, it is a good idea to assert that code is called in the thread context you expect it to be called in. This can go a long way toward avoiding one of many classes of hard-to-find logic bugs multi-threading invites. Anyway, on to the code...
def assertThread(*threads):
"""Decorator that asserts the wrapped function is only
run in a given thread
"""
# If no thread list was supplied, assume the wrapped
# function should be run in the current thread.
if not threads:
threads = (threading.currentThread(), )

def decorator(func):
def wrapper(*args, **kw):
currentThread = threading.currentThread()
name = currentThread.getName()
assert currentThread in threads or name in threads, \
"%s erroniously called in %s thread " \
context" % (func.__name__, name)
return func(*args, **kw)

if __debug__:
return wrapper
else:
return func
return decorator

You can restrict execution to one or more threads, each specified by either the thread object or thread name.

Note the trick at the end to make the decorator effectively a no-op in production. Using this decorator around your functions and methods helps you spot logic errors during development without impacting the performance of your production code. Of course, if you are of the school that assertions should never be disabled, feel free to replace the final if __debug__:/else block with an unconditional return of wrapper.

Thursday, November 22, 2007

My first OpenSearch plugin

Well, yesterday I crawed out of my cave and discovered OpenSearch, the XML-based description format for search engines. Both Firefox 2 and IE 7 can import search engine settings from OpenSearch description files and add the search engine to their respective search bars. In fact, there are already description files for just about every search engine you can think of.

There is a great guide to writing your own OpenSearch description and how to link to the description file such that Firefox or IE automatically discover it. The specification is relatively straightforward so, excited, I decided to whip up an OpenSearch plugin for Kelly & Cristin's Wine Reviews.

I only encountered two obstacles:
  1. Getting an icon for my search plugin. I didn't already have a shortcut icon for my site that I could simply re-use, so I had to first make an icon.
  2. I typed the XML document by hand; I left out the final slash in the OpenSearch XML namespace declaration. Firefox will not recognize the file as an OpenSearch document unless the namespace is exactly right. This should have been quick to diagnose except that Firefox gives you no error message to aid in debugging - it simply refuses to load the file.

The text of my OpenSearch description is here. But if you visit the web site using Firefox 2.0 or later, you'll notice that icon next to your search bar gets a blue highlight; if you click on the icon, you'll see an option to add my Wine Review search to your search bar.

Anyway, without the half hour it took to figure out why Firefox wouldn't recognize my OpenSearch plugin, it would have only taken a few minutes to add a search plugin and plugin auto-detection code (one link tag in the HTML head) to my web site. I don't seriously expect anyone to want to be able to search my silly wine review site from their browser search bar, but certainly was neat to be able to add it.

Tuesday, November 20, 2007

Back on-line

This is just a quick note to mention that I finally got my Internet access hooked up this past weekend. After picking up a 100V power supply in Akihabara, my little Soekris 4801 is back up and serving the posi.net domain again.

I'm still getting settled into my new routine at work, but I hope to have time to start regularly writing posts again soon...

Wednesday, November 7, 2007

Unix storage solution

I've been lax in updating my blog lately, mainly due to lack of Internet access at my apartment since we've moved to Tokyo. It is nice and all that I have a variety of high-speed Internet service providers to choose from, but it would be really great if any of them could actually hook up my service is less than a month. :(

Anyway, until I get my own Internet service, I've been "borrowing" access via an open AP with a signal just strong enough that I can associate if I stand right next to my window. Sitting is no good; I have to have the laptop pretty high in the air to get a signal. As you can imagine, I've been reluctant to get on the Internet any more than I have to.

That said, I couldn't pass up the opportunity to share a surprising Unix storage solution I discovered today. Actually, my wife picked it up unintentionally while shopping at the Don Quixote across the street...


I kid you not, these are cheap little plastic refrigerator storage boxes. I couldn't believe my eyes when she showed them to me. Now if I could only find ones that said "FreeBSD" and "Linux" I'd be set on souvenirs the next time I go back to the U.S.

Monday, October 15, 2007

posi.net going offline

My personal domain, posi.net, will be going offline for the next few days while I move. This means that any mail sent to the domain will bounce (I'm not setting up a backup MX on purpose, since 99.99% of my mail is spam). It also means the Cristin and Kelly's absolutely fabulous wine reviews, my 地球ラジオ archive, and other miscellaneous amusements will be temporarily offline too.

Friday, October 12, 2007

My last BayPiggies

Well, last night was my last BayPiggies meeting for a few years. The speakers were Mikeal Rogers & Adam Christian from the Open Source Applications Foundation, presenting their new automated web U.I. testing framework, Windmill.

It was a pretty neat demo; if I actually did any serious web development, I would consider using it (or the preexisting Selenium tool) for testing. Being that it was a python user's group rather than a product show-and-tell, I would have liked more details about how Windmill worked, especially from the python side, but I am assured it is "really really really really cool". ;) I guess I'll have to find time to read the code myself.

Anyway, afterwards I got to hang out with my buddies JJ and Zach for a while and chat over a burger. I'm afraid mine was not particularly stimulating conversation, but I'm really looking forward to finding out what exactly it is JJ has been coding up over at MultiCosmic. He assures me it will go live in a few days, and that I'll have to wait to learn what it is when everyone else does. :) I just know it involves FaceBook.

I've only been going to BayPiggies for the past couple of months, but I'm going to miss the monthly chance to learn about some new tool or technology utilizing python. More importantly, I'm going to miss the opportunity to hang out and hear the state-of-web-development address from JJ (he admitted he loves Rails, by the way :) ). Hopefully, I'll still see Zach from time to time in Tokyo.

Speaking of which, I have yet to find a web page for a Python user's group in Tokyo; pyJug doesn't appear to actually hold casual group meetings just weekend retreats and workshop-style meetings. That sounds like fun, but not quite the casual get-together I've gotten used to with BAFUG and BayPiggies meetings.

Tuesday, October 9, 2007

Python: A better wx.ListCtrl

I'm not going to repeat the entire post here, but I would like to direct your attention to my friend and former coworker Zach's recent blog post regarding implementing a better ListCtrl via wxWidgets' virtual ListCtrl.

For those familiar with wxPython, what Zach has done is combine wx.ListCtrl and wx.lib.mixins.listctrl.ColumnSorterMixin into a single easy-to-use class. Except, rather than implement it as an entirely new class, he implements a function that transmutes a generic ListCtrl into his new & improved ListCtrl. That advantage here, as he points out, is that you don't need to modify any XRC files to gain the new functionality.

The post is on NTT MCL's recently-introduced company blog which, unfortunately, doesn't appear to accept comments (and says "Japan Window" for some strange reason). As such, I'll point out that Zach also occasionally posts to his own personal blog, which is also worth checking out.

Sold my car

We sold my car today. It was a 1996 Ford Escort.

I realize this isn't particularly interesting to anyone else, but since it was the first car I ever bought on my own -- heck, I even moved it with me from Virginia -- it is kind of significant to me.

Anyway, I bought it for $9800 back in 1995 and it is still running well at 154,000 miles. I sold it today for $700 (thanks to craigslist!). That means, excepting gas and maintenance costs, the depreciation was $9100 over 12 years or approximately $760/year. Given that the car got 32 miles to the gallon new and has lately been averaging around 29, it even got good gas mileage (by today's low standards, anyways).

In fact, if I assume an average of 30 MPG over the life of the car, and an average of $2.00/gallon, that comes to about $10,300 in fuel costs. You may think $2.00 a gallon seems low, but a good number of those miles were driven in Virginia, where the price of gas fluctuated between $0.85 and $1.35 before we moved to California. The price of gas has certainly been higher here in California, but I've been riding CalTrain and/or walking to work since long before the recent surge in gas prices, so I think $2.00 a gallon may be pretty close to our average per-gallon price.

Ignoring maintenance costs, which weren't extraordinary, that is $19,500 to own and fuel my Ford Escort for just over 12 years. I'd say it was all around a good purchase. I certainly won't badmouth Ford cars anymore. A hybrid will cost you more than that just to own, not to mention fuel costs.

It certainly didn't turn any heads, but then again, I never once worried about anyone stealing it, dinging it, or hitting it. In fact, I would say that having a dumpy car has its own value -- in peace of mind.

Tuesday, October 2, 2007

The price of spam

Like anyone, I get lots of spam. Having kept the same e-mail address since 1995 and having that e-mail address posted all over the web may contribute to absolutely inordinate amount of spam I receive. I get over 1000 pieces of spam mail a day.

To make matters worse, I run my own mail server on a Soekris 4801 at home. I use postfix running on FreeBSD 6.2 with SpamAssassin to identify spam. My spam situation is desperate, so anything that looks remotely like spam gets immediately sent to the bit bucket. Still, over 50 spam mail a day make it to my Inbox.

My Soekris box is my mail server, file server, firewall, and PPPoE tunnel end-point (for my DSL connection). I also run a low-traffic web site off the box. That said, between pppd, the postfix and spamd processes, receiving and processing spam consumes almost 100% of the CPU all day, every day. My load average rarely dips below 1.0 and, at the heaviest times, the inbound mail queue grows to a few hundred messages.

Now, this isn't the fastest machine in the world. But when I started my first ISP back in 1995, we ran dozens of web sites (admittedly, mostly static content) off a single 100Mhz Pentium server with 128MB of RAM. Our entire 27000+ newgroup Usenet feed was hosted on another 100MHz Pentium server, also with 128MB of RAM. But in 2007, I'm here to tell you that it takes a 266Mhz Pentium-class machine (with the same 128MB of RAM) running 24/7 just to deliver mail for two e-mail accounts.

So I can tell you personally that the price of spam in 2007 is roughly one Soekris 4801 plus disk space. It may not seem like much in comparison to today's top-of-the-line computers, but it is enough to make me sick just thinking about.

Friday, September 21, 2007

Moving to Tokyo

If anyone reads my blog, they would have noticed that I haven't posted any good programming tidbits here in a while. The reason is that I've been busy making arrangements for my move to Japan.

Next month (October 16th to be exact), my wife and I are moving to Tokyo where I will start my new job at NTT Communications' headquarters in Hibiya. My last day here at NTT MCL will be October 5th.

I've worked at NTT MCL for over 5 and half years, making it the longest I have ever worked at a single company. I've really enjoyed working at NTT MCL, so I'm a bit sad to leave. Luckily, my job at NTT Communications will entail further development and maintenance of the servers we developed at NTT MCL for the HOTSPOT wireless network, so it looks like I'll keep in regular contact with my current co-workers. By the way, if you find yourself in Tokyo, or just about any major city in Japan, you should purchase a 1-day pass for HOTSPOT. You get wireless Internet access at a huge number of locations for just 500 yen (about 4 dollars). Having worked on HOTSPOT for over 5 years, I was shocked when I discovered T-Mobile charges $10/day for a similar service in the U.S.

Also, I can't stress enough how nice of a place NTT MCL is to work at. There are lots of positions open, the work environment is relaxed and relatively low-stress, and you'll get to work on a variety of different projects. I know I am going to miss working here, but I simply cannot pass up this opportunity to work abroad for a few years. Hopefully, they'll take me back when I return to the U.S.

Monday, September 17, 2007

Sony getting out of Cell production

According to Asahi Shinbun, Sony will be selling their semiconductor production facilities, including those for the production of their Cell processor, to Toshiba for approximately 100 billion yen (~870 million dollars). The sale is expected to happen next spring. There is speculation that Sony does not foresee new applications for their Cell processor and is seeking to reclaim a large amount of their investment capital.

Sony's semiconductor subsidiary, Sony Semiconductor, is selling the LSI (Large Scale Integration) facilities in their Nagasaki Technology Center on the island of Kyushu. Besides being the production facilities for the Cell processor, the plant also makes image processing chips for game devices.

Sony will continue to develop the Cell processor, but they are reconsidering producing the chips themselves. Instead, the will be putting their effort into next generation audio/video devices. In particularly, they intend to put emphasis on the production of CMOS sensors like those used to record images in digital cameras.

Monday, September 3, 2007

Keepon

I saw this last night on a Japanese TV show. Apparently, a U.S. and Japanese researcher have been working on a little robot called "keepon", experimenting with human interaction. Here is a short movie demonstrating some basic emotive interaction:


The American researcher filmed a demo video of the little robot responding to a song by a L.A. band called "Spoon" and uploaded it to YouTube:


Apparently, the video was so popular that the band caught notice and contacted the researchers about using the robot in one of their official music videos. The Japanese researcher appears in the video with the robot (the video was filmed in Tokyo):


Like probably just about everyone else who has seen the video, I checked to see if I could buy one. You can't.

Update 2007/09/04 1:30am (JST):
As with everything else you can think of, Wikipedia already has an entry about Keepon. Sometimes I wonder why I bother writing anything. You could replace my entire blog with nothing but links to Wikipedia articles. :)

Sunday, September 2, 2007

Python: Reconstructing datetimes from strings

Previously, I posted a small snippet for converting the str() representation of a python timedelta object back to its equivalent object. This time I'm going to address doing the same for datetime objects:

def parseDateTime(s):
"""Create datetime object representing date/time
expressed in a string

Takes a string in the format produced by calling str()
on a python datetime object and returns a datetime
instance that would produce that string.

Acceptable formats are: "YYYY-MM-DD HH:MM:SS.ssssss+HH:MM",
"YYYY-MM-DD HH:MM:SS.ssssss",
"YYYY-MM-DD HH:MM:SS+HH:MM",
"YYYY-MM-DD HH:MM:SS"
Where ssssss represents fractional seconds. The timezone
is optional and may be either positive or negative
hours/minutes east of UTC.
"""
if s is None:
return None
# Split string in the form 2007-06-18 19:39:25.3300-07:00
# into its constituent date/time, microseconds, and
# timezone fields where microseconds and timezone are
# optional.
m = re.match(r'(.*?)(?:\.(\d+))?(([-+]\d{1,2}):(\d{2}))?$',
str(s))
datestr, fractional, tzname, tzhour, tzmin = m.groups()

# Create tzinfo object representing the timezone
# expressed in the input string. The names we give
# for the timezones are lame: they are just the offset
# from UTC (as it appeared in the input string). We
# handle UTC specially since it is a very common case
# and we know its name.
if tzname is None:
tz = None
else:
tzhour, tzmin = int(tzhour), int(tzmin)
if tzhour == tzmin == 0:
tzname = 'UTC'
tz = FixedOffset(timedelta(hours=tzhour,
minutes=tzmin), tzname)

# Convert the date/time field into a python datetime
# object.
x = datetime.strptime(datestr, "%Y-%m-%d %H:%M:%S")

# Convert the fractional second portion into a count
# of microseconds.
if fractional is None:
fractional = '0'
fracpower = 6 - len(fractional)
fractional = float(fractional) * (10 ** fracpower)

# Return updated datetime object with microseconds and
# timezone information.
return x.replace(microsecond=int(fractional), tzinfo=tz)

Last time Lawrence Oluyede was kind enough to point out that the dateutil module can likely do this and a lot more. However, I'm trying to stick to modules in the base library. Of course, it wouldn't be a bad thing if dateutil were to make it into the base library....

Speaking of which, the snipped above relies on the FixedOffset tzinfo object described in the documentation for the datetime.tzinfo module. Being that the documentation is part of the standard python distribution, I guess you could call that code part of the base library, even if you can't import it. :|

Update 2007/09/03 1:39pm (JST):
Fix example format examples in doc-string per comment from David Goodger.

Friday, August 17, 2007

Bought a Mac


My wife and I have been considering buying a new computer for a couple of months now; yesterday we bought a new 24" iMac. The odd thing is, even though my wife was comfortable on a Mac and I was drawn to the BSD underpinnings, we had decided against getting a Mac largely on the grounds of the expense (and the inability to easily uninstall software). Instead, we had settled on a small form-factor Dell. That is, until we visited the Apple Store on University Ave. Sunday afternoon.

We were in the neighborhood to pick up some prints at Wolf Camera and I wanted to stick my head in the Apple Store to check out the new Aluminum iMacs. They are gorgeous. I've never seen a screen quite like it. But the kicker was when my wife asked whether the $1799 price on the card next to the computer was for the computer we were looking at.

It wasn't. We were gawking at the 20" iMac; the placard was for the 24" iMac next to us. Now that was an impressive machine. And what was more amazing was that the $1799 price tag was on-par, if not cheaper, than the equivalent Dell desktop we had been considering! That's right, we could get the gorgeous iMac for the same price as the bland Dell we were planning to buy.

I have read claims in the past the recent-model Macs were cheaper than equivalent PCs, but there it was staring me in the face. Just to prove I'm not making this up, I just re-spec'ed the Dell machine we were going to buy to compare it to the Apple machine we did buy. As of 2007/08/17:

ComponentDell Optiplex 745 Small Form FactorApple iMac 24"
CPUIntel Core 2 Duo Processor E6600 (2.40GHz, 4M, 1066MHz FSB)Intel Core 2 Duo Processor T7700 (2.40GHz, 4M, 800MHz FSB)
OSWindows Vista Ultimate, with media, 32 Edition, EnglishMac OS X v10.4.10 Tiger
Memory1.0GB DDR2 Non-ECC SDRAM, 667MHz, (1DIMM)1.0GB DDR2 Non-ECC PC2-5300 SDRAM, 667MHz, (1SO-DIMM)
Hard Disk250GB SATA 3.0Gb/s320GB SATA 7200-rpm
Removable Disc8X Slimline DVD+/-RWSlot-loading 8x SuperDrive (DVD±R DL/DVD±RW/CD-RW)
Video Card256MB ATI Radeon X1300PRO256MB (GDDR3) ATI Radeon HD 2600 PRO
SpeakersDell™ A225 SpeakersBuilt-in stereo speakers
WiFiLinksys WUSB54GC Wireless-G USB AdapterAirPort Extreme wireless networking (802.11n)
Price w/o Monitor$1,198N/A
DisplayUltraSharp 2407WFP-HC 24-inch Widescreen Flat Panel LCD
$569
24-inch widescreen TFT active-matrix LCD
Total$1,767$1,799


Anyway, we left the Apple Store to head over to Fry's to do some shopping. But while we were at Fry's, my wife looked over at me and "we should get the Mac". She didn't have to say that twice. In my 15 years or so in the computer industry, I have never before seen a machine that I really wanted to buy. So we bought of copy of Parallels Desktop at Fry's and headed back over to the Apple Store.

Unfortunately, we were too late. They close at 6:00pm on Sundays. We tried again Monday night but both the University Ave. and Stanford Shopping Center stores were sold out of the 24" iMacs. We went ahead and ordered on on-line, but we were going to have to wait two weeks for it to even be shipped. So yesterday morning we made it up to the University Ave. store at 10:15am (they open at 10:00am) and bought the last 24" model in that day's shipment. That's right: sold out in 15 minutes. Needless to say, we cancelled our on-line order.

I was up to 4:00am last night setting up Parallels to run Win2k for some of our old software (namely, Microsoft Money and our favorite game: Settlers 3), installing Firefox and Thunderbird, importing our photos into iPhoto, etc. That screen is simply awesome. I don't think I've had this much fun since I was a kid. Certainly not with a computer.

Thursday, August 16, 2007

License for code posted on my blog

It occurred to me that while many people post code on their blogs, they seldom grant anyone else license to use their code. Google encourages their BlogSpot users to mark their posts with the Creative Commons License, but does not enforce users to do so nor do many blogs I've seen follow this recommendation. Now, the poster may know they have no intent to sue anyone over the use of their snippet, but how can the reader be sure?

To address the problem, I'm hereby declaring that all code posted on my blog (kbyanc.blogspot.com) is covered by the license below unless otherwise indicated in the body of the post itself:
Copyright (c) 2007, Kelly Yancey

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.

I hope that others will follow suite and clarify the license terms of use for content posted on their blogs too. Use a different license if you like (it is your code after all), but don't be fooled into thinking that simply posting code on your blog makes it open source.

And before someone writes that I should just declare the code to be public domain, I should point out that you can't do that (at least it probably won't become public domain until I'm so long dead that the idea of using the code I posted at the turn of the century seems quaint).

Wednesday, August 15, 2007

Python: Reconstructing timedeltas from strings

For some reason, the date/time objects implemented by the datetime module have no methods to construct them from their own string representations (as returned by calling str() or unicode() on the objects). It so happens that reconstructing a timedelta from its string representation can be implemented using a relatively simple regular expression:
import re
from datetime import timedelta

def parseTimeDelta(s):
"""Create timedelta object representing time delta
expressed in a string

Takes a string in the format produced by calling str() on
a python timedelta object and returns a timedelta instance
that would produce that string.

Acceptable formats are: "X days, HH:MM:SS" or "HH:MM:SS".
"""
if s is None:
return None
d = re.match(
r'((?P<days>\d+) days, )?(?P<hours>\d+):'
r'(?P<minutes>\d+):(?P<seconds>\d+)',
str(s)).groupdict(0)
return timedelta(**dict(( (key, int(value))
for key, value in d.items() )))

But the other types are not quite so easy. Next time, I'll post my implementation for reconstructing datetime objects from their string representation.

Tuesday, August 7, 2007

NTTMCL hiring again

I guess it comes as no surprise since it seems like everyone is hiring these days, but NTTMCL is hiring again. Of particular interest (to me at least, since it is replacing my position) is the Unix Kernel Developer. I know, you wouldn't know it from my blog posts, but originally I was hired as a FreeBSD kernel developer. Since I am being transferred to Tokyo in a couple of months, and they don't really have the infrastructure in the office over there for kernel development, NTTMCL is hiring a kernel developer to replace me here in the Bay Area.

I don't know who wrote the job description, but there appear to be some typographical errors in it. For example, "enhanced 3 switch" should probably read "enhanced layer 3 switch" and "stabilize the developed wireless network system" should probably be "stabilize the existing wireless network system". Did I mention that NTTMCL is a subsidiary of a Japanese company? We have lots of non-native speakers.

Anyway, besides the kernel developer position, we also have a position open in our security research group, and two positions in our IP technologies group (formerly the IPv6 group) [1, 2]. If you are really into researching and playing with the latest hardware, we have a position in our Business Development group that will get to check out all of the latest gear that NTT Communications might be interested in.

We're a relatively small research and development subsidiary of NTT; currently we have about 35 people, the majority of which are software engineers of some sort. We have offices in both San Mateo, CA and San Jose, CA. The work environment is kind of a mix of big-company and startup: we're small and work with many cutting-edge technologies, but our income is stable being that we are a wholly-owned subsidiary of one of the largest ISPs in the world. There are no stock options, but there are also no 18-hour work days.

Anyway, I really like it here. In my 5 and half years here, I've worked on everything from FreeBSD kernel programming low-level networking, and implementing high-performance Un*x daemons (all in C, of course), to perl and embedded perl interpreters, to web application development (both with and without AJAX), to developing GUI-based applications on Windows using python. If you really like learning and applying new technologies, this is a great place to do so without having to worry about where your next paycheck is coming from.

If that sounds good to you too, send a copy of your resume to jobs@nttmcl.com.
Update 2007/08/16 07:07pm:
I got most of the errors in the job description for the kernel programmer corrected with H.R.; I haven't checked the others, though. Speaking of which, we're up to seven open positions now. Bring some friends! :)

Monday, August 6, 2007

One-liner to crash IE6

A Japanese fellow going by the name Hamachiya2 has stumbled upon one line of HTML/CSS code that crashes IE6. The magic line is:

<style>*{position:relative}</style><table><input></table>

You can try it yourself at: http://hamachiya.com/junk/ie_crash.html.

Of course, if you are running IE6 or anything that embeds IE6 as a component, you can expect it to crash. All other browsers appear to render the code just fine.

I think I may have just found a new signature. :)

Update 2007/08/07 01:16pm:
Zeth wrote in that he noticed that IE7 pukes on the same line of HTML; it just waits until you try to visit a different page before it crashes.

Thursday, August 2, 2007

Python: Typed attributes using descriptors

A few weeks a saw a post by David Stanek on Planet Python regarding using python descriptors to implement typed attributes. At the time, I needed something very similar and I've been trying to find something descriptors were good for (besides re-implementing the property builtin) so I decided to give his trick a try. The only problem was, I was coding on CalTrain at the time so I couldn't access his blog to reference the code in his post. The worst part was that I struggled with problems that, as it turns out, were pointed out in the comments to his posting (specifically, attributes implemented via descriptors would erroneously share state between instances of the classes the attributes were assigned to). By the time I got to the office and could consult David's blog post, this is the implementation I had working:
from collections import defaultdict

class TypedAttr(object):
"""Descriptor implementing typed attributes, converting
assigned values to the given type if necessary

Constructed with three parameters: the type of the attribute,
an initial value for the attribute, and an (optional)
function for converting values of other types to the desired
type.

If the converter function is not specified, then the type
factory is called to perform the conversion.
"""
__slots__ = ('__type', '__converter', '__value')
def __init__(self, type, initvalue=None, converter=None):
if converter is None:
converter = type
self.__type = type
self.__converter = converter
initvalue = self.convert(initvalue)
self.__value = defaultdict(lambda: initvalue)

def convert(self, value):
if not isinstance(value, self.__type) \
and value is not None:
value = self.__converter(value)
assert isinstance(value, self.__type) \
or value is None
return value

def __get__(self, instance, owner):
if instance is None:
return self
return self.__value[instance]

def __set__(self, instance, value):
self.__value[instance] = self.convert(value)

With this, I could write my classes like so:
class Example(object):
Name = TypedAttr(str)
Type = TypedAttr(str, "cheezy example")

Mainly I'm using the typed attributes for data validation in objects populated from values supplied by an untrusted source. It would be really nice if I could compose descriptors to build more complex managed attributes (sort of like you can with validators in Ian Bicking's FormEncode package). Then I could make a descriptor, for example, Unsigned or NotNone and compose them with TypedAttr like so:
class CompositionExample(object):
Name = TypedAttr(NotNone(str))
Age = Unsigned(TypedAttr(int))

I'll admit I haven't put a whole lot of thought into it yet, but at first glance it appears that it would be impossible to compose descriptors in python. I would love to be proven wrong.

Monday, July 23, 2007

Python: Serializer benchmarks

I am working on a project in which clients will be submitting more data than my current server implementation knows what to do with. The reason the current implementation doesn't use all of the submitted data is that I don't yet know what the quality of the data will be until the client is deployed in the wild. I want to record all of the submitted data, though, in the expectation that a future implementation will be able to use it. So I was considering formats for logging the submitted data such that it would be easy to parse in the future.

Since I'm already storing a subset of the submitted data in a database, the most obvious solution is to make a table of submissions which has a column for each submitted data element. However, it turns out that this is quite slow and given that I'm not sure how much of the extra data I'll ever need or when I may update the server implementation to use it, I hate to pay a hefty price to store it now. For now, I can consider the data write-only. If and when I need to use that data, then I can write an import script that updates the server database using the saved data.

So I've been considering simply logging the submissions to a file. It is considerably faster to append to a flat file than it is to write to a database -- which makes sense since the database supports read/write access, whereas I only need write-only access for now.

The next question is what format to write the data to the log file. I have a python dictionary of the submitted data; at first I considered writing the dictionary to the log file in JSON format. The JSON format is relatively easy to convert to/from python data structures and python has quality implementations to do it. Furthermore, unlike the pickle text format, it is trivial to visually interpret the serialized data. This latter point is also important to me since I need to be able to judge the quality of the data in order to discern what portions I can use in the future.

However, to my chagrin, it turns out that the JSON module I have been using, simplejson, is slower than I had imagined. Profiling of my server implementation found that, after the database update logic, serializing the submitted data into JSON format was my second largest consumer of CPU cyles. I hate the thought of wasting so much time logging the data when it is an operation that is essentially pure overhead.

Hence I started considering other serialization formats, benchmarking them as I went. Here are the results of my benchmarks:


SerializerRun 1 (secs)Run 2 (secs)Mean (secs)
pyYAML 3.0521953.1825482.6123717.89
pySyck 0.61.23107.062805.382956.22
pprint2364.912368.422366.67
pickle1509.311665.161587.23
pickle/protocol=21359.401330.711345.05
simplejson 1.7.1710.78604.13657.46
cPickle159.27172.26165.77
repr73.5077.2475.37
cjson 1.0.363.9474.2869.11
cPickle/protocol=250.9757.7254.34
marshal12.5213.3212.92

All numbers were obtained using the timeit module to serialize the dictionary created by the expression "dict([ (str(n), n) for n in range(100) ])".
The tests were run under Python 2.5 (r25:51908, Mar 3 2007, 15:40:46) built using [GCC 3.4.6 [FreeBSD] 20060305] on freebsd6. The simplejson, cjson, pyYAML, and pySyck modules were installed from their respective FreeBSD ports (I had to update the FreeBSD pySyck port to install 0.61.2 since it otherwise installs 0.55).

I guess I should not have been surprised, but it turns out that simply calling repr() on the dictionary is almost 9 times faster than calling simplejson.dumps(). In fact, taking repr() as a baseline (100%), I calculated how long each of the other serializers took relative to repr():

SerializerMean (secs)Relative to Baseline
pyYAML 3.0523717.8931469%
pySyck 0.61.22956.223922%
pprint2366.673140%
pickle1587.232106%
pickle/protocol=21345.051785%
simplejson 1.7.1657.46872%
cPickle165.77220%
repr75.37100%
cjson 1.0.369.1191.7%
cPickle/protocol=254.3472.1%
marshal12.9217.1%

The numbers in the last column are how much longer it took to serialize the test dictionary using the given serializer than it was using repr().

So now I'm thinking of sticking with JSON as my log format, but using the cjson module rather than simplejson. cPickle's latest binary format (protocol=2) is even faster, but I would lose the ability to visually scan the log file to get a feel for the quality of the data I'm not currently using.

Now, before I get a horde of comments I should point out that I am aware that simplejson has an optional C speedups module. Unfortunately, it does not appear to be installed by default on either FreeBSD (my server) or on Windows (my current client). I wouldn't be the least bit surprised if the C version of simplejson is just as fast as the cjson module, but it doesn't matter if it isn't installed. As such, it looks like I'll be switching to cjson for my JSON serialization needs from now on.

Update 2007/07/25 07:07pm:
In response to paddy3118's comment, I added benchmarks for the python pprint module to the tables above.

Update 2007/07/27 12:26pm:
In response to David Niergarth's comment, I added benchmarks for pyYAML 3.05 and pySyck 0.61.2.

Sunday, July 22, 2007

Python: Mapping arguments to their default values

Using inspect.getargspec or the getargspec implemention in my previous post, we can build a dictionary mapping a callable's argument names to their default values:
def getargmap(obj, default=None):
"""Get dictionary mapping callable's arguments to their
default values

Arguments without default values in the callable's argument
specification are mapped to the value given by the default
argument.
"""
spec = getargspec(obj)
argmap = dict.fromkeys(spec[0], default)
argmap.update(zip(spec[0][-len(spec[3]):], spec[3]))
return argmap

Python: A (more) generic getargspec()

In my last post, I presented a generic way to get the arguments passed to the current function such that you can iterate through them. This time, I present a way to get the arguments that a callable accepts/expects. Actually, the standard inspect module already has a getargspec() function that returns the argument specification of a function. However, it only works on functions and methods, not any other python callable. It turns out that there is no way to get the argument specification for built-in callables, but we can implement a version of getargspec() that can get the specification for classes and callable objects:
import inspect

def getargspec(obj):
"""Get the names and default values of a callable's
arguments

A tuple of four things is returned: (args, varargs,
varkw, defaults).
- args is a list of the argument names (it may
contain nested lists).
- varargs and varkw are the names of the * and
** arguments or None.
- defaults is a tuple of default argument values
or None if there are no default arguments; if
this tuple has n elements, they correspond to
the last n elements listed in args.

Unlike inspect.getargspec(), can return argument
specification for functions, methods, callable
objects, and classes. Does not support builtin
functions or methods.
"""
if not callable(obj):
raise TypeError, "%s is not callable" % type(obj)
try:
if inspect.isfunction(obj):
return inspect.getargspec(obj)
elif hasattr(obj, 'im_func'):
# For methods or classmethods drop the first
# argument from the returned list because
# python supplies that automatically for us.
# Note that this differs from what
# inspect.getargspec() returns for methods.
# NB: We use im_func so we work with
# instancemethod objects also.
spec = list(inspect.getargspec(obj.im_func))
spec[0] = spec[0][1:]
return spec
elif inspect.isclass(obj):
return getargspec(obj.__init__)
elif isinstance(obj, object) and \
not isinstance(obj, type(arglist.__get__)):
# We already know the instance is callable,
# so it must have a __call__ method defined.
# Return the arguments it expects.
return getargspec(obj.__call__)
except NotImplementedError:
# If a nested call to our own getargspec()
# raises NotImplementedError, re-raise the
# exception with the real object type to make
# the error message more meaningful (the caller
# only knows what they passed us; they shouldn't
# care what aspect(s) of that object we actually
# examined).
pass
raise NotImplementedError, \
"do not know how to get argument list for %s" % \
type(obj)
This version returns exactly the same argument specification tuple as inspect's getargspec() does with one notable exception: if called on a method, the argument list returned in the first tuple element will not include the implicit 'self' argument. The reason is that python implicitly supplies that argument so the caller does not pass it explicitly. I find it more useful to only return the argument specification as seen by callers. If you need a drop-in replacement for inspect.getargspec(), then you will need to slightly modify the method/classmethod case to not remove the first element in the argument list.

Monday, July 16, 2007

Python: Aggregating function arguments

Python has three ways to pass arguments to functions: enumerated named arguments, unenumerated named arguments, and unnamed positional arguments. Enumerated named arguments are familiar to most programmers since most modern languages use this style of naming arguments (perl being a notable exception). For example, the following function specifies that it accepts 3 arguments, and assigns those arguments the names larry, moe, and curly for the scope of the function:
    def Stooge(larry, moe, curly):
...

If you call this function as Stooge(1, 2, 3) then the variable named larry equals 1, moe equals 2, and curly equals 3 when the function Stooge starts. Python, like C++ and Java, also allows you to specify the arguments explicitly; calling the function as Stooge(moe=2, curly=3, larry=1) or Stooge(1, 2, curly=3) again causes larry to equal 1, moe to equal 2, and curly to equal 3 when the function starts. I call this form of argument passing enumerated named arguments since names are assigned to each argument and all acceptable arguments are enumerated in the function declaration.

Python also supports unenumerated named arguments by specifying a "catch all" argument, prefixed with two asterisks. For example:
    def Stooge2(larry, moe, **kw):
...

In this case, Stooge2 accepts two arguments, larry and moe that may be specified either positionally or by name just as in the previous example. However, it also accepts any number of additional named arguments. For example, we could call the function as Stooge2(1, moe=2, shemp=3) or Stooge2(1, 2, shemp=3, curly=4). In both cases, as before, larry would start equal to 1 and moewould start equal to 2. However, now the kw argument would be populated with a dictionary mapping all other named parameters with their argument values. For example, it might contain {'shemp': 3, 'curly': 4}.

Before we move on to unnamed positional arguments, let me interrupt to touch on the point of this posting: how do you iterate over all named arguments whether they be enumerated or not?

If your function enumerates all accepted named arguments, then you can trivially get a dictionary mapping the argument names to their values if you call the builtin function locals() at the beginning of your function. For example:
    def WhichStooges(larry, moe, curly):
stooges = locals()
...

This would populate stooges with a dictionary with keys, "larry", "moe", and "curly". You could then iterate through the arguments and their values with a standard loop over stooges.items().

Now, if you add unenumerated named arguments into the picture, it gets a bit trickier. The most straightforward way is to use the fact that "catch all" argument is a standard dictionary and update it from locals() at the beginning of the function:
    def WhichStooges2(larry, moe, **stooges):
stooges.update(locals())
...

The only problem with this approach is that stooges still appears in the argument list, which is probably not what you want. This can be remedied like so:
    def WhichStooges2(larry, moe, **stooges):
stooges.update(locals())
del stooges['stooges']
...

Which only leaves the minor issue of the requirement for locals() to be called at the top of the function, before any other variables are defined in the function's scope. Wouldn't it be nice if we could enumerate the function arguments anywhere in the function? And wouldn't it be even better if we could encapsulate the logic for aggregating the function arguments into a utility function?

Before I get to the solution to those problems, for the sake of completeness I should cover unnamed positional arguments too. Unnamed positional arguments are additional positional arguments that are captured in a single list argument by prefixing the argument named with a single asterisk (*) in Python. For example:
    def WhichStooges3(larry, moe, *args):
...

In this case, larry and moe may still be passed values either by name or position as in previous examples. In addition, additional values may be specified but they cannot be named. Calling this function as WhichStooges3(1, 2, 3, 4) causes larryto start with the value 1, moe to start with the value 2, and args to start as a list containing (3, 4). The rules for mixing unnamed positional arguments and named arguments are non-trivial and covered in the Python documentation so I won't rehash them here.

Finally, he can construct one utility function that returns a dictionary of all named parameters (enumerated or not) as well as a list of all unnamed positional parameters. By using Python's inspect module we can encapsulate the logic into a single common routine that can be called anywhere within a function's scope.
    def arguments():
"""Returns tuple containing dictionary of calling function's
named arguments and a list of calling function's unnamed
positional arguments.
"""
from inspect import getargvalues, stack
posname, kwname, args = getargvalues(stack()[1][0])[-3:]
posargs = args.pop(posname, [])
args.update(args.pop(kwname, []))
return args, posargs

This routine removes the 'catch all' arguments (i.e. the positional catch all argument prefixed with a single asterisk and/or the keyword catch all argument prefixed with two asterisks) from the returned dictionary of named arguments for you.


Update 2009/09/29:
I updated the arguments() function to fix a bug that was brought to my attention by drewm1980's comment.

Friday, July 13, 2007

Parsing Japanese addresses

Last night Steven Bird, Ewan Klein, and Edward Loper gave a presentation about their Natural Language Toolkit at the monthly baypiggies meeting. The gist of the presentation seemed to be that their toolkit is just that: a set of basic tools commonly needed in implementing more complicated natural language processing algorithms and a set of corpora for training and benchmarking those algorithms. Given their background as academics, this makes sense as it allows them to quickly prototype and explore new algorithms as part of their research. However, I got the impression that a number of the attendees were hoping for more of a plug-and-play complete natural language processing solution they could integrate into other programs without needing to be versed in the latest research themselves.

When I get some time, I would like to try using NLTK to solve a recurring problem I encounter at work: parsing Japanese addresses. There is a commercial tool that claims to do a good job parsing Japanese postal addresses, but I've found the following python snippet does a pretty good job on the datasets I've been presented so far:
  # Beware of greedy matching in the following regex lest it
# fail to split 宮城県仙台市泉区市名坂字東裏97-1 properly
# as (宮城県, None, 仙台市, 泉区, 市名坂字東裏97-1)
# In addition, we have to handle 京都府 specially since its
# name contains 都 even though it is a 府.
_address_re = re.compile(
ur'(京都府|.+?[都道府県])(.+郡)?(.+?[市町村])?(.+?区)?(.*)',
re.UNICODE)
def splitJapaneseAddress(addrstr):
"""Splits a string containing a Japanese address into
a tuple containing the prefecture (a.k.a. province),
county, city, ward, and everything else.
"""
m = _address_re.match(addrstr.strip())
(province, county, city, ward, address) = m.groups()
address = address.strip()
# 東京都 is both a city and a prefecture.
if province == u'東京都' and city is None:
city = province
return (province, country, city, ward, address)

I should add that, unlike English, it does not make sense to separate and store the Japanese street address as its own value since the full address string is commonly what is displayed. So even though the routine above returns the street address as the final tuple item, I never actually use the returned value for anything.

Anyway, as you can see this regular expression is pretty naive. During last night's meeting I kept thinking that I should put together a corpus of Japanese addresses and their proper parses so that I can experiment with writing a better parser. The Natural Language Toolkit seems to be designed for doing just this kind of experimentation. I'm hoping that next time I'm given a large dataset for import into our database at work I can justify the time to spend applying NLTK to the task.

Thursday, July 5, 2007

Python and MochiKit for the win

Last Friday we concluded the first ever NTT MCL prototyping contest. The rules were simple: we could form teams of up to 3 employees and had one month to prototype any idea we wanted. We had to submit an entry form with our idea and the team members at the beginning of the contest. The judging of submissions would not only consider the technical aspects of the idea but also the feasibility of developing it into a full-scale product to be sold/marketed by NTT Communications or one of our sister companies. Basically, cheap market research. :)

Obviously, I cannot go into the details of the submissions, except to say that one team (of three) implemented theirs in C++, one team (of three) used Java, another team (of two) used C and perl, and my team (of two) used Python and JavaScript. Of course, we all implemented our own project ideas so the amount of work per project varied greatly.

The verdict is in: my team won. And I think it was all thanks to the rapid prototyping made possible by modern scripting languages. The C/perl team dropped out at the last minute due to a content provider their project depended on going off-line since the day before the deadline and presentation. The other two teams (using C++ and Java) had interesting ideas and working prototypes, but in both cases the prototypes were just barely functional. It was a prototyping contest, so that is to be expected.

However, we demonstrated a fully-working dynamic web-based application with real-time updates (graphs and charts would literally change while you were looking at them in response to external data). Not to sound like I'm bragging, but it was polished.

I have to say, I haven't done full-on web development in years, and I was refreshed at how much easier it has gotten. In particular, I found that I could apply more-or-less traditional client-server design with the client implemented in JavaScript and running in the browser, using AJAX to make requests to a JSON server implemented in python. MochiKit, as promised, made JavaScript suck less. Come to think of it, since I used Bob Ippolito's MochiKit on the client and simplejson python module on the server, I guess you could say Bob won the competition for us.

Anyway, the one thing that really disappointed me was that no one asked how we did it. I didn't actually care whether we won or not, but I am really proud of the technology under the hood. I expected that, presenting to 20+ engineers at a research and development company someone would say "cool, how did you do that?" To my chagrin, not one person asked (although, my coworker Zach came by my office later to ask). I know it is cheezy, but I was really hoping someone would ask if I used Ruby on Rails so I could respond, "no, it was Kelly on Caffeine." :)

In case anyone out there reading this is curious: I didn't use TurboGears, Pylons, or Django either. I'll be first to admit it was just a prototype rather than a full-blown production web application, but I found the python cgi and wsgiref modules, flup's FastCGI WSGI server, and Bob Ippolito's simplejson module more than adequate to implement a fast JSON server that interfaced with a PostgreSQL database backend. No proprietary templating languages, no object-relational mappers trying (futilely) to apply python syntax to SQL, no cryptic stack traces through 18 layers of unfamiliar framework code. Just SQL queries, JSON, and good old fashioned client-server request handling (where requests came via CGI). All of the user interface compontents were implemented by logic that executed on the client-side. I can't imagine any web application framework being faster either in terms of developer or CPU time.

Given the choice, however, I would have preferred to not have had to use JavaScript to implement the client. Suck "less" indeed.

Wednesday, July 4, 2007

Kupo! - Idempotent Perl Obfuscator

I wrote a perl obfuscator back in 2002 for use on a project at work. Jokes about perl's already obfuscated syntax aside, we had evaluated a number of alternatives including perlcc and perl2exe, but both failed to meet our needs for reasons I can't remember some 5 years later. Source filtering hacks such as Acme::Bleach were ruled out because they are trivially reversible.

Anyway, I finally got sign-off by my manager back in May of 2006 to open source our obfuscator under the GPL. It has been up on my public CVS repository ever since then but I just now finally got around to putting together its own site. That said, it appears the state of the art (such that it is) has advanced considerably since I first developed this tool in 2002. For example, while I have not evaluated it yet, Stunnix appears to have put together a superior perl obfuscator. I suspect that if Stunnix's product had been around in 2002, I would have never had to delve into the deepest darkest recesses of TIMTOWTDI to write our own.

If I had to do it all over again, I would probably have used Syntax::Highlight::Perl::Improved for the syntax parsing rather than trying to do it myself (in perl, no less). Hairball doesn't begin to describe perl's syntax. In any event, what is done is done. And, for better or for worse, it is now open sourced for others to use or improve:
http://www.posi.net/software/kupo/

The whole ordeal really drove home the importance of simple syntax in language design. Maybe one of these days I'll get a chance to write up my experiences with perl's implementation from when I was writing a library to simplify embedding perl in C programs (hint: perlembed doesn't even begin to cut it). The lesson I walked away from with that exercise was that the face a language presents to the world, i.e. its syntax, can tell you a lot about what its implementation looks like. And I don't want to go there again.

Sunday, June 24, 2007

Useless geolocation information?

I just ran across the Windows' geographical location information API hidden amongst their national language support functions. One thing that caught my eye was that you can get latitude and longitude coordinates via the API. For example, in python:

>>> from ctypes import *
>>> nation = windll.kernel32.GetUserGeoID(16)
>>> buf = create_string_buffer(20)
>>> GEO_LATITUDE=0x0002
>>> GEO_LONGITUDE=0x0003
>>> windll.kernel32.GetGeoInfoA(nation, GEO_LATITUDE,
... byref(buf), sizeof(buf), 0)
6
>>> buf.value
'39.45'
>>> windll.kernel32.GetGeoInfoA(nation, GEO_LONGITUDE,
... byref(buf), sizeof(buf), 0)
8
>>> buf.value
'-98.908'

At first glance, this looks pretty cool. You can get geospacial coordinates on any Windows XP or Vista box. Except that the API only returns the latitude and longitude of the country selected in the "Regional and Language Options" control panel. That is: the coordinates returned are for the user's selected location (i.e. country).

Sure enough, plugging the returned latitude and longitude into Yahoo Maps, I find that the coordinates fall squarely in the middle of the country. A good thousand miles from where I sit in Palo Alto, California. Of course, being that I don't have a GPS receiver attached to my laptop, I didn't really expect the coordinates to be accurate.

For the life of me, I can't think of a single application of this information. But there it is: someone at Microsoft had to populate the database mapping countries to geospacial coordinates and some engineer had to code the API to retrieve them. For what purpose? What can you do with the latitude and longitude of a country (whatever that is supposed to mean)? Can anyone think of a use for this data?

Wednesday, June 20, 2007

"Web 2.0" - Intelligent Design for the Internet

Personally, I hate the term "web 2.0".

Assigning a discrete version number to the web seems to completely miss the point of the evolutionary, organic growth of the World Wide Web that the "2.0" moniker is ostensibly trying to describe. It presumes that there was a clear-cut switch from the "1.0" web and the "2.0" web. There wasn't. That doesn't even make sense: the web isn't some singular entity that is being released by a central authority with new features on a periodic basis. Implying as much just demonstrates you don't know how the web works.

It is the web. There was no 1.0, there is no 2.0, there will never be a 2.1 or 3.0 or whatever. Any more so than there was a life 1.0, life 2.0, life 2.1 or whatever (Second Life jokes aside). Calling it "web 2.0" is the moral equivalent of intelligent design for the Internet. Sure, there are standardization bodies for the protocols the web is built on, but do you really believe the modern web itself can be "best explained by an intelligent cause, not an undirected process such as natural selection"? I don't think so.

This has been out for a few months now, but if you haven't already seen it, a group of Kansas State University students put together a movie that describes the current state of evolution of the web:

Unfortunately, they make the mistake of attaching the 2.0 misnomer to the web. Which is truly a shame since they seem to "get it" in all other respects. The web is us and the last time I checked, there was no us 2.0.

Tuesday, June 19, 2007

Python: property attribute tricks

Here is my twist on the Python Cookbook recipe for class property definition: instead of using a non-standard doc="" variable for passing the documentation string to the property() function, I use a decorator that copies the doc-string from the decorated function and passes that to property() explicitly. First, my decorator:

def prop(func):
"""Function decorator for defining property attributes

The decorated function is expected to return a dictionary
containing one or more of the following pairs:
fget - function for getting attribute value
fset - function for setting attribute value
fdel - function for deleting attribute
This can be conveniently constructed by the locals() builtin
function; see:
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/205183
"""
return property(doc=func.__doc__, **func())

An example of using this decorator to declare a managed attribute:

class MyExampleClass(object):
@prop
def foo():
"The foo property attribute's doc-string"
def fget(self):
return self._foo
def fset(self, value):
self._foo = value
return locals()

The only subtle difference between my method and the property attribute definition method described in the Python Cookbook is that my version allows you to specify document the attribute using python's standard doc-string syntax.

Monday, June 18, 2007

Python 2.5 consumes more memory than 2.4.4

Since python 2.5 contains a new object allocator that is able to return memory to the operating system, I was surprised to find that python 2.5.1 consumes more memory at start-up than 2.4.4 does (at least on Windows). I have been assuming that 2.5.1 could only use the same or less memory than python 2.4.4. However, simply starting each interactive interpreter on my Windows XP machine, the Task Manager reports:
Version "Mem Usage" "VM Size"
Python 2.4.4 3440k 1788k
Python 2.5.1 5828k 3868k

The specific builds tested where:
  • Python 2.4.4 (#71, Oct 18 2006, 08:34:43) [MSC v.1310 32 bit (Intel)] on win32
  • Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit (Intel)] on win32
Which were installed using the Windows Installers on the python.org web site as of today.

Of course, depending on your application, I would expect python 2.5.1 to still use less memory overall. But what surprised me was how much of a handicap python 2.5.1 has after just loading the interpreter: it starts out consuming over 2 megabytes more memory.

Friday, June 15, 2007

PowerPoint centimeters

Me, I'm 5'10" tall. That would be approximately 178 centimeters. Or 168 PowerPoint centimeters.

Unfortunately, Google Calculator doesn't recognize PowerPoint centimeters as a unit yet so you'll need to whip out a calculator and do the conversion yourself:
1 inch = 2.4 PowerPoint centimeters
1 centimeter = 1.0583 PowerPoint centimeters

How did this unit come about? From Microsoft's Knowledge Base article 189826:
To improve its usability, PowerPoint slightly misdefines the size of a centimeter to make the invisible gridlines fall at convenient points on the ruler. With this conversion, there are 5 picas per centimeter and the gridlines fall at very convenient points on the ruler. So convenient, in fact, that working in the metric system is really easier than working in the English system.

Apparently, the gridlines in PowerPoint 97 were on pica boundaries, which aligned nicely to the inch ruler since a pica is defined as 1/6 of an inch. But if you were to use PowerPoint 97 in one of those metric-using nations, the pica gridlines didn't fall on convenient points on the centimeter ruler. Solution: redefine centimeters to be an even multiple of picas, which are defined to be an integral fraction of an inch. In other words, Microsoft redefined the meaning of centimeters in PowerPoint such that a centimeter was defined in terms of an inch.

Microsoft had the arrogance to redefine an international standard measurement unit to fit their software's user interface. What makes the arrogance all the more palpable is that their Knowledge Base article acknowledging the existence of PowerPoint centimeters doesn't sound the least bit apologetic for the discrepancy.

Personally, I kind of prefer PowerPoint centimeters to metric centimeters: 168 PowerPoint centimeters tall is a lot easier to remember to an American like me than 177.8 metric centimeters tall is. <wink>

Update 2007/06/15 03:45pm:
My wife pointed out that apparently Microsoft redefined the size of a pica too: their Knowledge Base claims 12 picas to an inch, whereas there are supposed to be only 6 to an inch. I would like to suggest the name "PowerPicas" for these new pint-sized picas. That would make me 840 PowerPicas tall.

Python's unittest module ain't that bad

Collin Winter was kind enough to speak to BayPiggies last night about his unittest module replacement, test_harness. The basic premise of the talk is that unittest does not support extensions very well, hence he wrote his own testing framework that did. The same argument is presented in Collin's blog posting titled "Python's unittest module sucks".

However, I have an issue with several of Collin's claimed deficiencies in unittest: they simply aren't there. For example, he claimed that extensions cannot be composed (i.e. multiple extensions cannot be applied to a single test case) easily. I raised the point in the meeting that python's decorators are trivially composable and the TODO annotation described in his blog is trivially implementable as a decorator, usable via unittest, nose, or just about any testing framework. In his presentation, he claimed TODO annotations required over a hundred lines of code across 5 classes to implement using unittest. This simply isn't true: I implemented it in 8 lines while he spoke; adding some polish it's up to 11 plus doc-string, but nowhere near 100 and there isn't a class in sight:

def TODO(func):
"""unittest test method decorator that ignores
exceptions raised by test

Used to annotate test methods for code that may
not be written yet. Ignores failures in the
annotated test method; fails if the text
unexpectedly succeeds.
"""
def wrapper(*args, **kw):
try:
func(*args, **kw)
succeeded = True
except:
succeeded = False
assert succeeded is False, \
"%s marked TODO but passed" % func.__name__
wrapper.__name__ = func.__name__
wrapper.__doc__ = func.__doc__
return wrapper

Collin demonstrated a platform-specific test annotation in his framework. He claimed this would require almost 200 lines of code to implement in unittest, but that too is an overstatement. I had it implemented before he could finish the slide:

def PlatformSpecific(platformList):
"""unittest test method decorator that only
runs test method if os.name is in the
given list of platforms
"""
def decorator(func):
import os
def wrapper(*args, **kw):
if os.name in platformList:
return func(*args, **kw)
wrapper.__name__ = func.__name__
wrapper.__doc__ = func.__doc__
return wrapper
return decorator

The point is that python decorators are a language feature that allow you to trivially wrap any callable with another callable; the latter of which can perform any pre- or post- processing or even avoid calling the decorated function at all. You get transparent composition for free:

class ExampleTestCase(unittest.TestCase):
@TODO
def testToDo(self):
MyModule.DoesNotExistYet('boo')

@PlatformSpecific(('mac', ))
def testMacOnly(self):
MyModule.SomeMacSpecificFunction()

@TODO
@PlatformSpecific(('nt', 'ce'))
def testComposition(self):
MyModule.PukePukePuke()

(If you aren't familar with decorators in python, IBM has a pretty thorough article on the subject)

For the record, I also implemented a proof-of-concept of Collin's reference counting example in a similarly-succinct decorator. In the example presented at BayPiggies, Collin ran the test case 5 times, checking a reference count after each run. I missed how he was getting references counts (len(gc.get_referrers(...)) maybe?) so you need to fill in how to get your object reference counts:

def CheckReferences(func):
def wrapper(*args, **kw):
refCounts = []
for i in range(5):
func(*args, **kw)
refCounts.append(XXXGetRefCount())
assert min(refCounts) != max(refCounts), \
"reference counts changed"
wrapper.__name__ = func.__name__
wrapper.__doc__ = func.__doc__
return wrapper

Adding the repetition count as a parameter would be trivial (see the PlatformSpecific decorator above for an example how). I hard-coded 5 repetitions since that is what Collin used in his presentation.

In all, it took me about 5 minutes to write all three decorators and test them (admittedly, I already had unittest test cases to try my decorators on). I didn't tinker a bit nor was there any poking or prodding of unittest; I didn't subclass anything. You can implement these extensions using standard python syntax and the standard pytingn unittest module. To claim otherwise is simply disingenuous.

Of course, Collin's testframework module uses decorators too, so Collin was clearly aware of their existence. Which prompted me to question Collin's claims of 100+ lines to implement these features using unittest when simple decorators are sufficient. His response was that his numbers were the number of lines of code that would be necessary to implement the TODO and platform-specific annotations using the unittest module without decorators. Which seemed inconsistent with his examples, involving decorators, of how easy it is to use these annotations with test_harness. I wanted to ask him about this contradiction face-to-face after the presentation, but unfortunately he had to catch the Google shuttle home immediately after his talk.

One point that Collin did repeatedly come back to was that logging extensions cannot be implemented using decorators. For example, you cannot have the unittest module log the test run results to a database by wrapping test methods in decorators. In theory you just need to implement your own TestRunner and TestResult subclasses and pass the TestRunner subclass to unittest.TestProgram(). However, if Sebastian Rittau's XML logger TestRunner class for unittest is any indication, changing loggers is non-trivial.

Collin said in his presentation, and I would have to agree, extending unittest logging is painful; composing multiple loggers is prohibitively painful. Of course, if more TestRunner implementations were included in the python standard library, half of this argument would be moot as there would be less need to extend. Right now, only a text logger TestRunner is included.

But to be honest, I don't expect most people really need to replace the logging mechanism (which may be why the standard library doesn't include more loggers). Marking tests as TODO or platform-specific or whatever is pretty universal; recording test run results to a database for analysis is probably far outside the realm of what most people (or even companies outside the Fortune 500 for that matter) need from their test framework. Which may be more of a comment on the sad state of testing than anything, but I digress. In any event, Collin's re-implementation adds value by facilitating logger composition, but to say that not facilitating logger composition makes unittest "suck" seems like a gross overstatement to me.

In all, I left BayPiggies last night having thought a lot more about unittest than I ever have before. And I can't help but think that, for the vast majority of us python hackers down in the trenchs, python's unittest ain't that bad.

Update 2007/06/15 10:05am:
I found the lines-of-code numbers quoted in Collin's presentation in his blog also. My memory was pretty close on the supposed ~100 lines to implement TODO annotations. But it looks like I may have confused his ~200 lines quote for implementing composition of TODO and reference counting with the supposed number of lines to implement platform-specific test annotations. To be clear, though, composition using decorators as I described above requires 0 core classes and 2 lines of code (see my testComposition example above).

Sunday, June 10, 2007

Keyword arguments in XML-RPC

This isn't the least bit novel, for example I know I've been using this trick for years, but nonetheless here is a way to simulate named arguments in XML-RPC. XML-RPC only natively supports positional parameters, but by passing a single positional argument that is itself an XML-RPC struct (which is actually a mapping), you can simulate named and/or optional arguments. Rather than reproduce a sample XML-RPC document demonstrating this usage, I'll refer you to one of my earlier posts that utilized this technique; you'll see that the method is called with two named parameters: path and args.

If you are familiar with perl, you may also be aware of the trick perl 5, which also only natively supports positional arguments, uses to simulate named parameters. In perl 5, it is common to pass a hash of name/value pairs as arguments. However, what perl actually does under the scenes, and which is different from this XML-RPC trick, is to serialize the hash into an array of alternating names and values; it then passes this array as the positional argument list for the subroutine being called. The called subroutine then de-serializes the name and value pairs from the argument array, reconstructing the original hash. This flattening of a hash has to be a documented protocol between the subroutine and its callers.

Of course, you could do exactly the same thing using XML-RPC: serialize the argument dictionary into an array of alternating names and values and populate the method's param list with this array's elements. The XML-RPC server method could then reconstruct the original dictionary from the param list.

But XML-RPC also supports passing dictionaries and dictionaries: using the struct data type. Hence my original suggestion. Since to support named (or generic optional arguments) we have to document a protocol between the caller and the method, we might as well make the protocol as straightforward as possible. Rather than serialize and deserialize a dictionary of named arguments, just pass the dictionary as-is, as the one and only positional argument.