i18n and social web: We still haven't figured it out
Published by Martin Kleppmann on 24 Nov 2008.
Internationalisation (i18n for short, where 18 represents the 18 letters in the middle of this long
word) is still an unsolved problem.
A lot of things fall under i18n, such as correct handling of
character sets (not everybody uses the Latin alphabet), time zones (not everybody uses PST or GMT),
numbers (not everybody uses the dot as decimal separator), currencies (there is a world outside the
US dollar), writing direction (not everybody writes from left to right) and of course translation
into different languages. Traditionally, i18n has been an issue which software engineers have loved
to ignore, because (a) it’s difficult, (b) it’s not cool, and (c) if you’re in North America, you
can find enough customers in North America for the first few years, so there isn’t a strong business
requirement to work internationally.
Now that we have the web, and people from many different
languages and cultures interacting on the web, getting internationalisation right is absolutely
essential. And it’s getting better – I think most developers now appreciate that you’ve got to use
Unicode, that you’ve got to store dates and times with timezones, that you need to make text
translatable. Ok, at least that’s something. But I think that soon we will get to a point where the
traditional approach to translating applications breaks down.
Let me explain.
The way software
engineers usually make their applications work in multiple languages is to mark every human-readable
bit of text in the program in some particular way, to extract those bits of text, give them to a
translation bureau, and get back an equivalent set of texts to substitute in-place. The result
should hopefully be that the application appears entirely in a new language.
Even this simple form
of translation isn’t working everywhere. For example, if you update your Twitter status in a German
web browser, it says that you updated ”weniger als 5 Sekunden ago”. Here “ago” is an English word,
but the rest is German; to a German-speaker, this reads like “vor less than 5 seconds” reads to an
English-speaker. This is simply a case of some of those bits of text getting forgotten; not a big
problem, and I’m sure they will fix it soon, but enough to make German-speakers uncomfortable using
the application because it’s constantly offending against their sense of language.
However, things
get harder once you start building sentences based on variables which may change. The Twitter
message is a simple example of this: the time is a variable, and because it’s in the past (not in
the future), we need to stick “ago” on the end, in English at least. The code which does this will
look something like this:
To translate this program, make sure that the function
distance_of_time_in_words returns words in the right language (e.g. for German, “weniger
als 5 Sekunden” – Twitter does this correctly), and replace
"#{time_in_words} ago" with the right grammatical construction for the target language
(e.g. for German, "vor #{time_in_words}" – this is the bit Twitter has missed out).
The way we use time_in_words as a variable we can insert words before or after it to make the right
grammatical construction. Fairly flexible, but is this enough?
Over the past few days I have been developing a Facebook application. It’s not even internationalised,
it’s just plain English, but nevertheless I have come across a problem of grammatical constructions.
Take for example a message in the news feed, which might look different depending on who looks at it:
Mike has given John a pat on the shoulder because he has done well. (seen by a bystander)
Mike has given you a pat on the shoulder because you have done well. (seen by John)
You have given John a pat on the shoulder because he has done well. (seen by Mike)
First of all, note that the subject (Mike) and object (John) can be either a name,
or ‘you’. Facebook offers the facility to do this using the
<fb:name> tag; as developers, we are
even given the opportunity to say whether we want the “you” to start with a capital letter
(depending on whether it occurs at the beginning or in the middle of the sentence). Next, we use a
pronoun – “because he has done well”. This depends on the gender of John, but we know from his
profile that he is male, so it has to be “he” not “she”. Facebook does this using the
<fb:pronoun> tag.
So far, so good. Now note that the verb form changes between second and third person (have/has).
There is no way in Facebook to do this (it was
requested more than a year ago but
still hasn’t happened). In case you ever
wondered why the Facebook news feed is always in the simple past but never in present perfect –
that’s because in English, the verb forms for second and third person are the same in simple past,
which is not the case in other tenses! (In “You threw a banana at Bob” and “Alice threw a banana at
you”, the verb is “threw” in both cases; in “You have thrown a banana at Bob” and “Alice has thrown
a banana at you”, the verb forms “have thrown” and “has thrown” are not the same.)
The part “a pat
on the shoulder” stays the same in all sentences, but that’s just because English happens to work
that way – in other languages or cultures, it may have to be modified depending on e.g. the gender
or even the age of the people involved. Or the verb (have/has given) may have to change depending on
some parameter of the direct object (‘a pat on the shoulder’). There may be languages in which the
name of a person changes if it is used as an indirect object. And so on.
You can get up to a
certain point by duplicating text, e.g. providing a separate snippet of text for each possible
combination of values which may occur; then translators can deal with each individually, and ensure
that it is correct. But very quickly you get into a situation where the number of combinations is so
large that this approach just doesn’t scale.
The bottom line is that in general you can’t build a
grammatically correct sentence by just sticking words together, even if you make the word order
variable. In languages with declension and other grammatical changes to words, or languages where
the word order changes depending on some variable, or languages with separable verbs… well, I
don’t know what happens then. It’s an unsolved problem.
(There is another subtle point here, which
is that cultural differences may mean that even if you translate something in a grammatically
correct way, it may have the wrong connotations in the reader’s culture; for instance, consider a
culture in which patting on the shoulder is considered offensive. But in such cases, it could be
replaced with another gesture which has the desired meaning. I will just concentrate on the
grammatical level for now.)
The only step I’ve seen so far towards making internationalisable grammar is the
‘pluralize’
function which you find in some frameworks such as Rails. It takes a number (such as 0, 1, 2, 29 or
518514) and a singular noun (such as ‘camel’) and returns the text in the appropriate singular or
plural form (e.g. “29 camels”) – and it works in a wide selection of languages (e.g. in German,
pluralize(29, 'Kamel') == '29 Kamele').
To illustrate the problem, here is one of the source files from
Bid for Wine. It is a very minor feature – it just creates a short
paragraph of text explaining the purpose of a particular form which a buyer needs to fill in.
However, there can be a lot of different things in that form, depending on the type of auction. This
generated text gets sent out by email (to both the seller and the buyer) and is also displayed on
the website, and few people will ever notice how much effort has gone into building that paragraph
of text.
And you thought English was simple?
You can even even see the remains of grammatical cases (you_or_they vs. you_or_them). If this needs
to be translated one day, we will need to duplicate the logic and mess about with the way variables
are inserted into the snippets of text to make the grammar work. It will be horrible to maintain.
But that’s the best we can do with the state of the art. Shouldn’t there be a better way?
If you found this post useful, please
support me on Patreon
so that I can write more like it!