Friday, July 16, 2010

HTML encoding of form inputs

I suppose this is common knowledge amongst professional web developers but I just discovered myself that if a user enters characters into a HTML form input that is not representable in the character set of the page the form is in, browsers will HTML-encode the non-representable characters when the form is submitted. I just spent over an hour assisting a coworker to track down a bug in one of our web applications that was due to this poorly-documented -- but reasonable -- behavior.

I say "reasonable" because, as obscure as it is, this is really the best thing I think a browser can do given the situation.

To recap, here is the scenario:
  • You have a web page with a form in it that is served using some locale-specific encoding. In our case it was Shift-JIS, but the default ISO8859-1 encoding leads to the same problem.
  • The user enters text into a form input field that is not representable in the displayed page's character set or encoding. For example, entering Cyrillic characters into a form displayed on an ISO8859-1 page.
  • When the user submits the form, the browser tries to convert the inputs to the encoding of the page. Any character not representable in that page's character set or encoding has its Unicode character code point encoded as an HTML numeric character reference (e.g. DŽ).
  • The web application or CGI receiving this input needs to a) know the character encoding of the page that was used to submit the form data so it knows how to interpret the data as characters and b) be prepared to convert any embedded HTML numeric character references back to their corresponding characters.

I like that last part where web applications (or CGIs) have to know the encoding of an HTML page served to the client in order to be able to properly parse input from that client. This fact shatters any remaining fantasies I had of HTTP being stateless.

Anyway, the real surprise is that a web application or CGI needs to be prepared to unencode HTML entities in form input. I quick check of perl's CGI.pm and python's cgi module indicates that neither of them do entity decoding of inputs automatically. And considering that information on the web regarding this behavior is sparse , I suspect that most web developers are unaware of it. At the time of writing, I can only find two references [1][2] that document HTML character reference encoding in the scenario described above.

Luckily, there is a really simple solution: always serve pages in UTF8 encoding and always expect form input to be in UTF8 encoding. One of the many great things about UTF8 encoding is that all characters are representable, so you never have to worry about the browser resorting to HTML character reference encoding.

Monday, July 12, 2010

Sharp pointy sticks

This past Saturday my wife and I bought recurve bows with the intent to make a hobby out of archery. It turns out there are a number of parks with free-to-the-public archery ranges in the Bay Area. A couple of the ranges we have found are:
We actually got started a couple of weeks ago when we went for a free lesson from the Kings Mountain Archery Club. That was a lot of fun and our instructors were very friendly, helpful, and patient. They give free two-hour introductory lessons about once a month; information about how to signup is available on their web site.

We picked up our bows and arrows at a shop intimidatingly called Predator's Archery down in Gilroy. Much like our experience with Kings Mountain Archery, the staff at Predator's Archery were pretty friendly and helpful and they offer a great "starter package" which includes everything you need as well as 5 lessons.

Friday, July 2, 2010

In other news: Subversion still sucks

OK, I've had 4 months to get used to Subversion now. And it is growing on me. Or perhaps it is Stockholm Syndrome. But there are still a lot of annoyances.

After being bashed as a "troll" by one of Subversion's authors after daring to suggest it wasn't all ponies and rainbows, I thought I would check to see if others where sharing my pain transitioning from CVS to Subversion.

Not surprisingly, I did. I found a wonderful summary of all the frustrations I've been experiencing, thoughtfully compiled by no less than David O'Brien of the FreeBSD community.

I would add to his list, as my friend John pointed out in comments to my previous post, that it is really annoying to have to depend on external tools (ironically, CVS) to see commits across branches.

Thursday, July 1, 2010

Serving file downloads with non-ASCII filenames

Recently, while helping out one of my coworkers, it came to my attention that there is no universally-agreed on way to download a file to a web browser while suggesting a filename that contains non-ASCII characters.

The common way to tell a browser to download a file (rather than try to display it in-browser) is to include a Content-Disposition header in the HTTP response; the header's value should be "attachment". Additionally, the server can include a filename parameter in the Content-Disposition header as a suggestion to the browser for what filename to save the file as.

As a bit of history, the Content-Disposition header was originally defined in RFC 1806 which was obsoleted and replaced by RFC 2183. However, the Content-Disposition header was originally defined for use in MIME messages and, while RFC 2616 (HTTP 1.1) makes reference to the Content-Disposition header, it does so only to note that:
Content-Disposition is not part of the HTTP standard, but since it is widely implemented, we are documenting its use and risks for implementors.

Luckily, while not officially standardized for use in HTTP, the Content-Disposition header is "widely implemented" indeed; it seems that all modern browsers implement the header. If the web server responds to a request with application/octet-stream data and a Content-Disposition header of "attachment", your browser will display the familiar "Save As..." dialog. If the server included a filename parameter in that Content-Disposition header, your browser will likely pre-fill the filename input field of the "Save As..." dialog with the specified filename.

But here is where things start getting murky.

RFC 2183, skirts the issue of international filenames by disclaiming responsibility:
Current [RFC 2045] grammar restricts parameter values (and hence Content-Disposition filenames) to US-ASCII. We recognize the great desirability of allowing arbitrary character sets in filenames, but it is beyond the scope of this document to define the necessary mechanisms.

So, as long as the downloaded files' names are always representable in the ASCII charactet set, any browser should properly display the filename (although I've seen rumors that some browsers, such as IE, do enforce a limit on the length of the filename). However, I work at a Japanese company, making products largely for the Japanese market, so we don't have the privilege of assuming the whole world is ASCII.

By the way, in case you are curious, even iso-8869-1 (latin1) isn't consistently supported across browsers so Europeans are left high-and-dry too.

You are probably thinking, like I was, that surely this is a solved problem. And actually, it is. Kind of. The Content-Disposition header originates with the MIME protocol which, since the publication of RFC 2231 in 1997, now supports non-ASCII character encodings for header values. So, for example, the filename "foo-ä.html" can be represented in the Content-Disposition header like so:
Content-Disposition: attachment; filename*=UTF-8''foo-%c3%a4.html


The problem is that few browsers actually implement this RFC 2231 syntax. For example, Firefox 3.6 and Opera 10 appear to support the RFC 2231 syntax. On the other hand, for Internet Explorer, Microsoft's developers choose to simply perform URL-style percent-decoding and then interpret the result as bytes of UTF8-encoded characters. So a server would need to send the Content-Disposition header as
Content-Disposition: attachment; filename="foo-%c3%a4.html"
for an MSIE user to see "foo-ä.html" in the "Save As..." dialog.

Despite requests for IETF working group members to fix it, Google's Chrome browser also does not comply with RFC 2231, preferring to follow Microsoft's lead and use simple URL-style percent decoding.

As a result, there is no consistent cross-browser way to suggest a non-ASCII filename for a file download. I'm sure it doesn't help that the Content-Disposition header has never formally been part of the HTTP specification, but yet it is used by all major browsers to implement file download functionality.

Julian Reschke has compiled a test suite and publishes a nifty page illustrating all of incompatibilities between browsers regarding handling of the Content-Disposition header. In addition, as part of the IETF Network Working Group, he is working on an RFC to formally define the interpretation of the Content-Disposition header in the HTTP context.

Unfortunately, because the ambiguity has been left unresolved for so long, some web servers have adopted the MSIE/Chrome encoding technique for their non-ASCII filenames. Actually, my gut feeling is that probably most have, although I don't have any hard numbers to back up that claim. The good news is that since the MSIE/Chrome encoding is only used for parameters in the form filename="..." while the RFC 2231-style encoding used by Firefox, Opera, and Julian's proposal uses filename*=... it is possible for the two to coexist in the same Content-Disposition header (note the presence of the * in the RFC 2231 format to differentiate it).

In fact, probably the most important section of Julian's proposal is section 4.2 where defines the HTTP client's behavior when the server responds with both filename=... and filename*=..., allowing for an easy upgrade path for MSIE and Chrome.

For now, however, Julian's test results show that when presented both traditional and extended formats, only Firefox and Opera will select the extended filename*=... format.

This opens an opportunity for those of us that need to serve file downloads containing non-ASCII filenames: we can include the filename in non-standard encoding supported by MSIE and Chrome first in the Content-Disposition header, followed by the filename in extended RFC 2231 encoding. According to Julian's tests, MSIE and Chrome will always take the first parameter while Firefox and Opera will properly selecte the extended-syntax parameter, no matter what order it appears in.

For example, if the server includes the header:
Content-Disposition: attachment; filename="foo-%c3%a4.html" filename*=UTF-8''foo-%c3%a4.html
all four major browsers should properly display the filename "foo-ä.html" in the "Save As..." dialog. Unfortunately, WebKit-based browsers, like Apple's Safari browser, would display the raw percent-encoded value "foo-%c3%a4.html" as the filename. At least for now, though, I'm afraid this is the best we can do.