Friday, July 13, 2007

Parsing Japanese addresses

Last night Steven Bird, Ewan Klein, and Edward Loper gave a presentation about their Natural Language Toolkit at the monthly baypiggies meeting. The gist of the presentation seemed to be that their toolkit is just that: a set of basic tools commonly needed in implementing more complicated natural language processing algorithms and a set of corpora for training and benchmarking those algorithms. Given their background as academics, this makes sense as it allows them to quickly prototype and explore new algorithms as part of their research. However, I got the impression that a number of the attendees were hoping for more of a plug-and-play complete natural language processing solution they could integrate into other programs without needing to be versed in the latest research themselves.

When I get some time, I would like to try using NLTK to solve a recurring problem I encounter at work: parsing Japanese addresses. There is a commercial tool that claims to do a good job parsing Japanese postal addresses, but I've found the following python snippet does a pretty good job on the datasets I've been presented so far:
  # Beware of greedy matching in the following regex lest it
# fail to split 宮城県仙台市泉区市名坂字東裏97-1 properly
# as (宮城県, None, 仙台市, 泉区, 市名坂字東裏97-1)
# In addition, we have to handle 京都府 specially since its
# name contains 都 even though it is a 府.
_address_re = re.compile(
ur'(京都府|.+?[都道府県])(.+郡)?(.+?[市町村])?(.+?区)?(.*)',
re.UNICODE)
def splitJapaneseAddress(addrstr):
"""Splits a string containing a Japanese address into
a tuple containing the prefecture (a.k.a. province),
county, city, ward, and everything else.
"""
m = _address_re.match(addrstr.strip())
(province, county, city, ward, address) = m.groups()
address = address.strip()
# 東京都 is both a city and a prefecture.
if province == u'東京都' and city is None:
city = province
return (province, country, city, ward, address)

I should add that, unlike English, it does not make sense to separate and store the Japanese street address as its own value since the full address string is commonly what is displayed. So even though the routine above returns the street address as the final tuple item, I never actually use the returned value for anything.

Anyway, as you can see this regular expression is pretty naive. During last night's meeting I kept thinking that I should put together a corpus of Japanese addresses and their proper parses so that I can experiment with writing a better parser. The Natural Language Toolkit seems to be designed for doing just this kind of experimentation. I'm hoping that next time I'm given a large dataset for import into our database at work I can justify the time to spend applying NLTK to the task.

No comments: