The Other Kelly Yancey: July 2007

Monday, July 23, 2007

Python: Serializer benchmarks

I am working on a project in which clients will be submitting more data than my current server implementation knows what to do with. The reason the current implementation doesn't use all of the submitted data is that I don't yet know what the quality of the data will be until the client is deployed in the wild. I want to record all of the submitted data, though, in the expectation that a future implementation will be able to use it. So I was considering formats for logging the submitted data such that it would be easy to parse in the future.

Since I'm already storing a subset of the submitted data in a database, the most obvious solution is to make a table of submissions which has a column for each submitted data element. However, it turns out that this is quite slow and given that I'm not sure how much of the extra data I'll ever need or when I may update the server implementation to use it, I hate to pay a hefty price to store it now. For now, I can consider the data write-only. If and when I need to use that data, then I can write an import script that updates the server database using the saved data.

So I've been considering simply logging the submissions to a file. It is considerably faster to append to a flat file than it is to write to a database -- which makes sense since the database supports read/write access, whereas I only need write-only access for now.

The next question is what format to write the data to the log file. I have a python dictionary of the submitted data; at first I considered writing the dictionary to the log file in JSON format. The JSON format is relatively easy to convert to/from python data structures and python has quality implementations to do it. Furthermore, unlike the pickle text format, it is trivial to visually interpret the serialized data. This latter point is also important to me since I need to be able to judge the quality of the data in order to discern what portions I can use in the future.

However, to my chagrin, it turns out that the JSON module I have been using, simplejson, is slower than I had imagined. Profiling of my server implementation found that, after the database update logic, serializing the submitted data into JSON format was my second largest consumer of CPU cyles. I hate the thought of wasting so much time logging the data when it is an operation that is essentially pure overhead.

Hence I started considering other serialization formats, benchmarking them as I went. Here are the results of my benchmarks:

Serializer	Run 1 (secs)	Run 2 (secs)	Mean (secs)
pyYAML 3.05	21953.18	25482.61	23717.89
pySyck 0.61.2	3107.06	2805.38	2956.22
pprint	2364.91	2368.42	2366.67
pickle	1509.31	1665.16	1587.23
pickle/protocol=2	1359.40	1330.71	1345.05
simplejson 1.7.1	710.78	604.13	657.46
cPickle	159.27	172.26	165.77
repr	73.50	77.24	75.37
cjson 1.0.3	63.94	74.28	69.11
cPickle/protocol=2	50.97	57.72	54.34
marshal	12.52	13.32	12.92

All numbers were obtained using the timeit module to serialize the dictionary created by the expression "dict([ (str(n), n) for n in range(100) ])".
The tests were run under Python 2.5 (r25:51908, Mar 3 2007, 15:40:46) built using [GCC 3.4.6 [FreeBSD] 20060305] on freebsd6. The simplejson, cjson, pyYAML, and pySyck modules were installed from their respective FreeBSD ports (I had to update the FreeBSD pySyck port to install 0.61.2 since it otherwise installs 0.55).

I guess I should not have been surprised, but it turns out that simply calling repr() on the dictionary is almost 9 times faster than calling simplejson.dumps(). In fact, taking repr() as a baseline (100%), I calculated how long each of the other serializers took relative to repr():

Serializer	Mean (secs)	Relative to Baseline
pyYAML 3.05	23717.89	31469%
pySyck 0.61.2	2956.22	3922%
pprint	2366.67	3140%
pickle	1587.23	2106%
pickle/protocol=2	1345.05	1785%
simplejson 1.7.1	657.46	872%
cPickle	165.77	220%
repr	75.37	100%
cjson 1.0.3	69.11	91.7%
cPickle/protocol=2	54.34	72.1%
marshal	12.92	17.1%

The numbers in the last column are how much longer it took to serialize the test dictionary using the given serializer than it was using repr().

So now I'm thinking of sticking with JSON as my log format, but using the cjson module rather than simplejson. cPickle's latest binary format (protocol=2) is even faster, but I would lose the ability to visually scan the log file to get a feel for the quality of the data I'm not currently using.

Now, before I get a horde of comments I should point out that I am aware that simplejson has an optional C speedups module. Unfortunately, it does not appear to be installed by default on either FreeBSD (my server) or on Windows (my current client). I wouldn't be the least bit surprised if the C version of simplejson is just as fast as the cjson module, but it doesn't matter if it isn't installed. As such, it looks like I'll be switching to cjson for my JSON serialization needs from now on.

Update 2007/07/25 07:07pm:: In response to paddy3118's comment, I added benchmarks for the python pprint module to the tables above.
Update 2007/07/27 12:26pm:: In response to David Niergarth's comment, I added benchmarks for pyYAML 3.05 and pySyck 0.61.2.

Sunday, July 22, 2007

Python: Mapping arguments to their default values

Using inspect.getargspec or the getargspec implemention in my previous post, we can build a dictionary mapping a callable's argument names to their default values:

def getargmap(obj, default=None):
    """Get dictionary mapping callable's arguments to their
       default values

    Arguments without default values in the callable's argument
    specification are mapped to the value given by the default
    argument.
    """
    spec = getargspec(obj)
    argmap = dict.fromkeys(spec[0], default)
    argmap.update(zip(spec[0][-len(spec[3]):], spec[3]))
    return argmap

Python: A (more) generic getargspec()

In my last post, I presented a generic way to get the arguments passed to the current function such that you can iterate through them. This time, I present a way to get the arguments that a callable accepts/expects. Actually, the standard inspect module already has a getargspec() function that returns the argument specification of a function. However, it only works on functions and methods, not any other python callable. It turns out that there is no way to get the argument specification for built-in callables, but we can implement a version of getargspec() that can get the specification for classes and callable objects:

import inspect

def getargspec(obj):
    """Get the names and default values of a callable's
       arguments

    A tuple of four things is returned: (args, varargs,
    varkw, defaults).
      - args is a list of the argument names (it may
        contain nested lists).
      - varargs and varkw are the names of the * and
        ** arguments or None.
      - defaults is a tuple of default argument values
        or None if there are no default arguments; if
        this tuple has n elements, they correspond to
        the last n elements listed in args.

    Unlike inspect.getargspec(), can return argument
    specification for functions, methods, callable
    objects, and classes.  Does not support builtin
    functions or methods.
    """
    if not callable(obj):
        raise TypeError, "%s is not callable" % type(obj)
    try:
        if inspect.isfunction(obj):
            return inspect.getargspec(obj)
        elif hasattr(obj, 'im_func'):
            # For methods or classmethods drop the first
            # argument from the returned list because
            # python supplies that automatically for us.
            # Note that this differs from what
            # inspect.getargspec() returns for methods.
            # NB: We use im_func so we work with
            #     instancemethod objects also.
            spec = list(inspect.getargspec(obj.im_func))
            spec[0] = spec[0][1:]
            return spec
        elif inspect.isclass(obj):
            return getargspec(obj.__init__)
        elif isinstance(obj, object) and \
             not isinstance(obj, type(arglist.__get__)):
            # We already know the instance is callable,
            # so it must have a __call__ method defined.
            # Return the arguments it expects.
            return getargspec(obj.__call__)
    except NotImplementedError:
        # If a nested call to our own getargspec()
        # raises NotImplementedError, re-raise the
        # exception with the real object type to make
        # the error message more meaningful (the caller
        # only knows what they passed us; they shouldn't
        # care what aspect(s) of that object we actually
        # examined).
        pass
    raise NotImplementedError, \
          "do not know how to get argument list for %s" % \
          type(obj)

This version returns exactly the same argument specification tuple as inspect's getargspec() does with one notable exception: if called on a method, the argument list returned in the first tuple element will not include the implicit 'self' argument. The reason is that python implicitly supplies that argument so the caller does not pass it explicitly. I find it more useful to only return the argument specification as seen by callers. If you need a drop-in replacement for inspect.getargspec(), then you will need to slightly modify the method/classmethod case to not remove the first element in the argument list.

Monday, July 16, 2007

Python: Aggregating function arguments

Python has three ways to pass arguments to functions: enumerated named arguments, unenumerated named arguments, and unnamed positional arguments. Enumerated named arguments are familiar to most programmers since most modern languages use this style of naming arguments (perl being a notable exception). For example, the following function specifies that it accepts 3 arguments, and assigns those arguments the names larry, moe, and curly for the scope of the function:

    def Stooge(larry, moe, curly):
       ...

If you call this function as Stooge(1, 2, 3) then the variable named larry equals 1, moe equals 2, and curly equals 3 when the function Stooge starts. Python, like C++ and Java, also allows you to specify the arguments explicitly; calling the function as Stooge(moe=2, curly=3, larry=1) or Stooge(1, 2, curly=3) again causes larry to equal 1, moe to equal 2, and curly to equal 3 when the function starts. I call this form of argument passing enumerated named arguments since names are assigned to each argument and all acceptable arguments are enumerated in the function declaration.

Python also supports unenumerated named arguments by specifying a "catch all" argument, prefixed with two asterisks. For example:

    def Stooge2(larry, moe, **kw):
       ...

In this case, Stooge2 accepts two arguments, larry and moe that may be specified either positionally or by name just as in the previous example. However, it also accepts any number of additional named arguments. For example, we could call the function as Stooge2(1, moe=2, shemp=3) or Stooge2(1, 2, shemp=3, curly=4). In both cases, as before, larry would start equal to 1 and moewould start equal to 2. However, now the kw argument would be populated with a dictionary mapping all other named parameters with their argument values. For example, it might contain {'shemp': 3, 'curly': 4}.

Before we move on to unnamed positional arguments, let me interrupt to touch on the point of this posting: how do you iterate over all named arguments whether they be enumerated or not?

If your function enumerates all accepted named arguments, then you can trivially get a dictionary mapping the argument names to their values if you call the builtin function locals() at the beginning of your function. For example:

    def WhichStooges(larry, moe, curly):
       stooges = locals()
       ...

This would populate stooges with a dictionary with keys, "larry", "moe", and "curly". You could then iterate through the arguments and their values with a standard loop over stooges.items().

Now, if you add unenumerated named arguments into the picture, it gets a bit trickier. The most straightforward way is to use the fact that "catch all" argument is a standard dictionary and update it from locals() at the beginning of the function:

    def WhichStooges2(larry, moe, **stooges):
       stooges.update(locals())
       ...

The only problem with this approach is that stooges still appears in the argument list, which is probably not what you want. This can be remedied like so:

    def WhichStooges2(larry, moe, **stooges):
       stooges.update(locals())
       del stooges['stooges']
       ...

Which only leaves the minor issue of the requirement for locals() to be called at the top of the function, before any other variables are defined in the function's scope. Wouldn't it be nice if we could enumerate the function arguments anywhere in the function? And wouldn't it be even better if we could encapsulate the logic for aggregating the function arguments into a utility function?

Before I get to the solution to those problems, for the sake of completeness I should cover unnamed positional arguments too. Unnamed positional arguments are additional positional arguments that are captured in a single list argument by prefixing the argument named with a single asterisk (*) in Python. For example:

    def WhichStooges3(larry, moe, *args):
       ...

In this case, larry and moe may still be passed values either by name or position as in previous examples. In addition, additional values may be specified but they cannot be named. Calling this function as WhichStooges3(1, 2, 3, 4) causes larryto start with the value 1, moe to start with the value 2, and args to start as a list containing (3, 4). The rules for mixing unnamed positional arguments and named arguments are non-trivial and covered in the Python documentation so I won't rehash them here.

Finally, he can construct one utility function that returns a dictionary of all named parameters (enumerated or not) as well as a list of all unnamed positional parameters. By using Python's inspect module we can encapsulate the logic into a single common routine that can be called anywhere within a function's scope.

    def arguments():
        """Returns tuple containing dictionary of calling function's
           named arguments and a list of calling function's unnamed
           positional arguments.
        """
        from inspect import getargvalues, stack
        posname, kwname, args = getargvalues(stack()[1][0])[-3:]
        posargs = args.pop(posname, [])
        args.update(args.pop(kwname, []))
        return args, posargs

This routine removes the 'catch all' arguments (i.e. the positional catch all argument prefixed with a single asterisk and/or the keyword catch all argument prefixed with two asterisks) from the returned dictionary of named arguments for you.

Update 2009/09/29:: I updated the arguments() function to fix a bug that was brought to my attention by drewm1980's comment.

Friday, July 13, 2007

Parsing Japanese addresses

Last night Steven Bird, Ewan Klein, and Edward Loper gave a presentation about their Natural Language Toolkit at the monthly baypiggies meeting. The gist of the presentation seemed to be that their toolkit is just that: a set of basic tools commonly needed in implementing more complicated natural language processing algorithms and a set of corpora for training and benchmarking those algorithms. Given their background as academics, this makes sense as it allows them to quickly prototype and explore new algorithms as part of their research. However, I got the impression that a number of the attendees were hoping for more of a plug-and-play complete natural language processing solution they could integrate into other programs without needing to be versed in the latest research themselves.

When I get some time, I would like to try using NLTK to solve a recurring problem I encounter at work: parsing Japanese addresses. There is a commercial tool that claims to do a good job parsing Japanese postal addresses, but I've found the following python snippet does a pretty good job on the datasets I've been presented so far:

  # Beware of greedy matching in the following regex lest it
  # fail to split 宮城県仙台市泉区市名坂字東裏97-1 properly
  # as (宮城県, None, 仙台市, 泉区, 市名坂字東裏97-1)
  # In addition, we have to handle 京都府 specially since its
  # name contains 都 even though it is a 府.
  _address_re = re.compile(
     ur'(京都府|.+?[都道府県])(.+郡)?(.+?[市町村])?(.+?区)?(.*)',
     re.UNICODE)
  def splitJapaneseAddress(addrstr):
     """Splits a string containing a Japanese address into
        a tuple containing the prefecture (a.k.a. province),
        county, city, ward, and everything else.
     """
     m = _address_re.match(addrstr.strip())
     (province, county, city, ward, address) = m.groups()
     address = address.strip()
     # 東京都 is both a city and a prefecture.
     if province == u'東京都' and city is None:
         city = province
     return (province, country, city, ward, address)

I should add that, unlike English, it does not make sense to separate and store the Japanese street address as its own value since the full address string is commonly what is displayed. So even though the routine above returns the street address as the final tuple item, I never actually use the returned value for anything.

Anyway, as you can see this regular expression is pretty naive. During last night's meeting I kept thinking that I should put together a corpus of Japanese addresses and their proper parses so that I can experiment with writing a better parser. The Natural Language Toolkit seems to be designed for doing just this kind of experimentation. I'm hoping that next time I'm given a large dataset for import into our database at work I can justify the time to spend applying NLTK to the task.

Thursday, July 5, 2007

Python and MochiKit for the win

Last Friday we concluded the first ever NTT MCL prototyping contest. The rules were simple: we could form teams of up to 3 employees and had one month to prototype any idea we wanted. We had to submit an entry form with our idea and the team members at the beginning of the contest. The judging of submissions would not only consider the technical aspects of the idea but also the feasibility of developing it into a full-scale product to be sold/marketed by NTT Communications or one of our sister companies. Basically, cheap market research. :)

Obviously, I cannot go into the details of the submissions, except to say that one team (of three) implemented theirs in C++, one team (of three) used Java, another team (of two) used C and perl, and my team (of two) used Python and JavaScript. Of course, we all implemented our own project ideas so the amount of work per project varied greatly.

The verdict is in: my team won. And I think it was all thanks to the rapid prototyping made possible by modern scripting languages. The C/perl team dropped out at the last minute due to a content provider their project depended on going off-line since the day before the deadline and presentation. The other two teams (using C++ and Java) had interesting ideas and working prototypes, but in both cases the prototypes were just barely functional. It was a prototyping contest, so that is to be expected.

However, we demonstrated a fully-working dynamic web-based application with real-time updates (graphs and charts would literally change while you were looking at them in response to external data). Not to sound like I'm bragging, but it was polished.

I have to say, I haven't done full-on web development in years, and I was refreshed at how much easier it has gotten. In particular, I found that I could apply more-or-less traditional client-server design with the client implemented in JavaScript and running in the browser, using AJAX to make requests to a JSON server implemented in python. MochiKit, as promised, made JavaScript suck less. Come to think of it, since I used Bob Ippolito's MochiKit on the client and simplejson python module on the server, I guess you could say Bob won the competition for us.

Anyway, the one thing that really disappointed me was that no one asked how we did it. I didn't actually care whether we won or not, but I am really proud of the technology under the hood. I expected that, presenting to 20+ engineers at a research and development company someone would say "cool, how did you do that?" To my chagrin, not one person asked (although, my coworker Zach came by my office later to ask). I know it is cheezy, but I was really hoping someone would ask if I used Ruby on Rails so I could respond, "no, it was Kelly on Caffeine." :)

In case anyone out there reading this is curious: I didn't use TurboGears, Pylons, or Django either. I'll be first to admit it was just a prototype rather than a full-blown production web application, but I found the python cgi and wsgiref modules, flup's FastCGI WSGI server, and Bob Ippolito's simplejson module more than adequate to implement a fast JSON server that interfaced with a PostgreSQL database backend. No proprietary templating languages, no object-relational mappers trying (futilely) to apply python syntax to SQL, no cryptic stack traces through 18 layers of unfamiliar framework code. Just SQL queries, JSON, and good old fashioned client-server request handling (where requests came via CGI). All of the user interface compontents were implemented by logic that executed on the client-side. I can't imagine any web application framework being faster either in terms of developer or CPU time.

Given the choice, however, I would have preferred to not have had to use JavaScript to implement the client. Suck "less" indeed.

Wednesday, July 4, 2007

Kupo! - Idempotent Perl Obfuscator

I wrote a perl obfuscator back in 2002 for use on a project at work. Jokes about perl's already obfuscated syntax aside, we had evaluated a number of alternatives including perlcc and perl2exe, but both failed to meet our needs for reasons I can't remember some 5 years later. Source filtering hacks such as Acme::Bleach were ruled out because they are trivially reversible.

Anyway, I finally got sign-off by my manager back in May of 2006 to open source our obfuscator under the GPL. It has been up on my public CVS repository ever since then but I just now finally got around to putting together its own site. That said, it appears the state of the art (such that it is) has advanced considerably since I first developed this tool in 2002. For example, while I have not evaluated it yet, Stunnix appears to have put together a superior perl obfuscator. I suspect that if Stunnix's product had been around in 2002, I would have never had to delve into the deepest darkest recesses of TIMTOWTDI to write our own.

If I had to do it all over again, I would probably have used Syntax::Highlight::Perl::Improved for the syntax parsing rather than trying to do it myself (in perl, no less). Hairball doesn't begin to describe perl's syntax. In any event, what is done is done. And, for better or for worse, it is now open sourced for others to use or improve:
http://www.posi.net/software/kupo/

The whole ordeal really drove home the importance of simple syntax in language design. Maybe one of these days I'll get a chance to write up my experiences with perl's implementation from when I was writing a library to simplify embedding perl in C programs (hint: perlembed doesn't even begin to cut it). The lesson I walked away from with that exercise was that the face a language presents to the world, i.e. its syntax, can tell you a lot about what its implementation looks like. And I don't want to go there again.

The Other Kelly Yancey