If you were expecting a website called WorldReady.net, but arrived at my site instead, don’t panic! You are in the right place!
I created WorldReady.net a few years ago, and as of yesterday, it now redirects to my new site — pbi18n.com. All the same blog articles and other content is here.
I started worldready.net a few years ago at a time when I was contemplating trying to build a company around web, software and mobile internationalization consulting. But then, as these things often go, I ended up going to work at Intel instead. I realized there really wasn’t much personal gain in trying to maintain a worldready.net ‘brand’ if I wasn’t going to do anything with it.
So, as of yesterday, I’ve simply set up a redirect from worldready.net to this site, where my ‘brand’ is — myself! P for Paul, B for Bennett, and i18n for everything else.
If you were searching for something and get an error page instead, never fear – just enter your query again in the search box provided, and you should get to where you wanted to go without further delay.
My sincere apologies for the inconvenience of my brute-force redirect, and I hope the major search engines will catch up with my changes soon.
After installing iOS 6.0 today, I noticed a large number of new locales in the Region Format list (Settings.app). Here is my own “diff delta” comparing the Region Format list on my phone now, with what I had copied into an Excel spreadsheet before I upgraded:
Below is justthe new locales in iOS 6, not the complete list:
In the Irish language, the vowels (AEIOU) may add an acute accent (e.g. é).
People in Ireland are complaining that if they send SMS in Irish, they are being charged up to 3 times as much compared to writing their message in English. Effectively, a usage tax on language choice?
This is a fact of life with SMS messaging. You might be surprised to learn that SMS uses it’s own extremely limited character set, called GSM 03.38, and it’s not even compatible with ISO-8859-1. Anything you type in a text message that is not in the GSM 03.38 character set will cause the message to be encoded in UCS-2. There are a couple of big problems with this:
UCS-2 is kind-of-Unicode — an extremely old and outdated kind, which really has no merit these days. If you look around on the internet, many people seem to think that UCS-2 and UTF-16 are the same thing because they are both 16-bit, but they are absolutely not the same standard. UCS-2 does not support all of Unicode. UTF-16 does. Because of this confusion, some people think that an SMS message containing a character not encoded in GSM 03.38 will cause the message to be encoded in UTF-16, but this is incorrect. UCS-2 will be used.
The SMS spec restricts one message to 140 octets. So you can probably guess what will happen if your messages are encoded in GSM 03.38 — you can type up to 160 characters in a single message (GSM 03.38 is a 7-bit character set). But if your message is going to get encoded in UCS-2 because you types one character that’s not in the GSM charset, now you can only fit 70 16-bit characters into one SMS message. As a user, you’re not going to know this, of course!
Here’s the catch: the 7-bit GSM 03.38 character set includes é, but it doesn’t include any other vowel (a, i, o or u) with acute accents. Therefore, sending 160 é chars in a single SMS message is OK, but if you replace one é with another accented vowel (e.g. ú), this will cause the entire message to be encoded as 16-bit UCS-2, with the result being the message must now be split into three messages – 70 chars x 70 chars x 20 chars to accommodate all 160 characters – and 3x the cost to the user.
Very interesting EU report: “Europeans and their Languages” 2012 edition.
46% of Europeans are monolingual, 19% are bilingual, 25% are trilingual and 10% speak 4 or more languages.
Just 2% of Europeans believe their children don’t need to learn a second language.
The ability to speak at least two languages has actually decreased in Eastern Europe. Five countries are called out in particular:
Slovakia (-17 percentage points to 80%)
the Czech Republic (-12 percentage points to 49%)
Bulgaria (-11 percentage points to 48%)
Poland (-7 percentage points to 50%)
Hungary (-7 percentage points to 35%)
Several countries are achieving trilingual ability (trilingual defined as national language + English + another language): Luxembourg (84%), the Netherlands (77%), Slovenia (67%), Malta (59%), Denmark (58%), Latvia (54%), Lithuania (52%) and Estonia (52%)
English is much more likely to be cited by respondents as the first i.e. most fluent foreign language spoken (32%), than the second (11%) or third (3%)
ICANN revealed the applications received for new generic top-level domains (gTLDs) this week, which will be equivalent to .com, .net, etc. Among them are several internationalized domain names (IDN). Very interesting to see Google and Amazon attempting to own several, at $185,000 USD each:
For convenience/added clarification I have added the English translations in parentheses after each domain.
All web browsers pass language information to web servers when requesting content. If your website makes content available to users in a variety of languages, this information can be read by code running on the server and return content in the most appropriate language. Many websites (e.g. Google) use this information.
I was searching around on the internet for a test page that would show this raw data to me. I couldn’t quite find what I was looking for, so I wrote my own. Feel free to use it yourself to see what your browser is telling web services about your language preferences.
Here is an example of the data one web browser (IE9) sent to my web server:
In this case, the browser is telling the server the following:
My most preferred language is US English – give me that content if you have it.
If you don’t have US English, ok, just give me generic English if you have it.
If you don’t have English at all, that’s fine, give me Swiss German if you have it.
Don’t have Swiss German either? OK, give me generic German.
Things To Watch Out For
In the process of creating my test page I was reminded that, while all browsers send language information to the server, there are differences in implementation from one browser to the next that can make developer’s lives difficult:
Language list: The supported language list varies significantly from one browser to another, both in the number of languages and language variants that each browser allows users to choose from, and where some of the language IDs are implemented differently as well, even for major languages. Chinese could certainly be considered a very major world language, but some of the “newer” RFC 5646 variants are not implemented uniformly. Here are the varieties the current browsers support for Chinese today:
As you can see, IE “wins” here and is the only browser that supports the ISO 15924 script identifiers Hans (Simplified Han) or Hant (Traditional Han) that were included in RFC 5646 within the past couple of years, and is the only browser to support Chinese (Macao SAR). Chrome’s options were surprisingly limited. You can find this type of variation or inconsistent browser support for other languages as well.
Capitalization is inconsistent: You’ll probably want to lowercase everything when handling in code, for consistency.
The “Q” factor is inconsistent: Probably not a big deal, as it should not change the overall order of preference, but if you are using it, I found Chrome to be slightly “off” in its weight values compared to IE and Firefox.
IE user-specified values: Internet Explorer allows users to specify their own unique value, so you may see some strange values appearing. While it may be garbage, it also may be a serious attempt by the user to overcome a limitation in the available predefined list. Firefox/Chrome just allow users to choose from predefined lists.
No data: It is possible that the string you receive from the browser to be blank, if the user has removed all languages. Internet Explorer has always allowed users to do this, and Firefox also allows it. Chrome does not, as it ties its language settings into the UI language settings for the browser UI itself.
The Bottom Line
When you’re writing your case statement to parse the browser language string, make sure you cover the variants as well. Don’t assume all Chinese speaking users will send you zh-CN or zh-TW. Decide what to do with the neutral zh. And so on. The bottom line is that a proper case statement is going to be more complicated than you may have believed initially.
I hope you found this little guide useful. Let me know in the comments.
Your users are international. They deserve to experience your software in a way which makes sense to them. Just as a US user has a right to expect that the date displayed on your website is in US format (month-day-year), a UK user has the right to see that same date displayed in day-month-year format. This often leads to confusion: is 11/12/2010 November 12th or December 11th? There is simply no way to know, and both the US and UK user visiting your site will form their own interpretation. Only one will be right. And I feel like I’m stating the obvious – bottom line, you should know this already, and you should want to do the right thing.
Today, Google launched their new “Google Instant” search service. It’s slowly rolling out to users – I’m still seeing the old Google homepage, for example – but unlike many Google product launches, this one is not rolling out to all users worldwide.
According to the launch information, it’s now rolling out to only a few locations internationally:
Users in other markets will apparently be able to access Google Instant, but only if they are signed in with a Google account – in other words, it won’t appear to most users. Google says they will be expanding into other locations and languages over a period of several months.
The slow roll out is hardly surprising. In fact, this type of functionality – search as you type – is very challenging from an internationalization perspective. Search results that update with each letter and word that is entered works reasonably well for languages such as English and other Western European languages. Other languages are much more complex:
Languages which don’t put spaces between words (e.g. Japanese, Simplified & Traditional Chinese)
Languages which require an Input Method Editor to compose text (Japanese, Simplified & Traditional Chinese, Korean, etc)
Complex script languages where the character already typed may change based on subsequent characters (Arabic, Persian, Urdu, Thai, etc)
Bidirectional languages in cases where a user may type one word in Arabic (right-to-left) followed by another word in English (left-to-right) followed by another in Arabic (right-to-left again) – the search as typed would need to follow the logical order while the string typed was displayed in visual order.
It will be very interesting to watch if Google tackles these languages for its Instant feature, and see how they attempt to solve this very challenging internationalization problem.
For localized software or web applications, it’s necessary to go through a testing process called pseudo-localization (fake translation) before you start localizing the product for real.
What does pseudo-localization look like? Here are a few examples of a html button – first unlocalized, and then containing pseudo-localized strings:
Pseudo-localized strings commonly have the following properties: they increase the length of a string, typically by 30% on average; they add delimiters to the beginning and end of each string; and they may add characters from other languages. Using this pseudo-localization technique, it is possible to discover several internationalization issues, and we’ll cover each of them in this article.
Goals of pseudo-localization
Pseudo-localization has four main goals:
To ensure that your build environment is capable of producing localized builds as well as the English build.
To verify that the developer has created code that is localizable.
To verify that the user interface is capable of containing the translated strings.
To verify the strings are displayed without corruption.
Let’s look at each of these in turn:
There are multiple strategies to creating localized software, but all strategies are going to require work by a build engineer. Pre- or post-compile localization, language packs, Windows MUI (multilingual user interface), creation of bundled packages of resources for translation, etc., will all require thought and clear decisions to be made and communicated between localization team, build engineers and developers. Will the localization tools integrate with the build environment? If so, how? Where should developers place the resource files they create? Where should the translated files be placed in order to be picked up by the build process in order to create localized versions of the software? – The Pseudo-localization process can -and should – exercise all of these issues before the user interface coding is finalized, and issues found should be addressed before translation begins.
Ideally, the pseudo-localization process should be built into your daily (or nightly) build process. This is the single best way to know whether your entire project’s codebase and build environment is ready for localization or not.
Typically, this means that resources that should be translated – strings, images, dialogs, etc – have been extracted from the business logic of the application and placed in separate resource files. Pseudo-localization tests will find these resource files, translate them and return pseudo-localized files for the build process to consume when it creates localized builds. It’s quite common for developers to miss some strings here and there and leave them hardcoded (unavailable to localization). By pseudo-localizing all of the strings available to localizers, you can quickly identify areas which may have been overlooked. Apart from the visible UI, less obvious strings such as error messages need to be translated and care should be taken to fully test the scenarios which can result in strings or other resources being exposed to the user.
User Interface design
Even after the developer extracts strings into resource files for localization, problems can be found in the user interface (UI). The length of translated strings will vary, sometimes considerably, from one language to the next. If the UI design fails to accommodate those changes in string lengths in pseudo-localized builds, you can bet that you’ll run into problems with UI layout during the real localization stage. While it may be possible in some cases for the localization team to adjust widths, layout, etc., as much as possible you should be looking for ways to automatically adjust they layout to accommodate the strings.
Apart from lengthening the string, adding delimiters to the beginning and end of the lengthened string makes it easier to detect truncation issues – where a string is cut off before the end of the string. Delimiters surrounding each string are also very beneficial in identifying issues with concatenated strings – strings combined with variable values (e.g. “There are ” + fileCount + ” files in folder ” + folderName + “.”), which can be exceptionally difficult to translate. Delimiters also identify artificially fragmented strings (e.g. “Click” and “here” existing as two separate strings in the resource file but output in the UI as one element) – once again, such strings can be exceptionally difficult to translate.
As discussed above, lengthening strings and adding delimiters to the beginning and end of each string will identify many issues, but pseudo-localization also gives you an opportunity to test for character encoding issues. By inserting characters from other languages into the pseudo-localized strings, you will be able to identify whether the code is capable of handling localized resources without corruption or font issues.
Note that you will only be able to test the UI display for character encodings using this method – it will still be necessary to perform additional testing on the product for input, storage and retrieval of non-ASCII data, CRUD operations and so on.
Weaknesses in Pseudo-localization
Although pseud0-localization is highly beneficial in most respects, you should not expect to identify 100% of issues that real translation will run into. In particular, one weakness of pseudo-localization is the degree to which you increase the length of strings. Obviously, some languages tend to use longer words and sentences than others. Finnish, Russian and German often tend to be the longest, while other languages, such as Chinese, may be shorter than the English version.
Earlier, I mentioned that a 30% increase in string length was a reasonable average – however, in some cases 30% won’t be enough. In my experience, 30% is OK for a sentence or paragraph of moderate length, but you may encounter difficulties with really short strings of just 1 or 2 words – in some situations you may find that the translated string is longer than the English string by 100% or even more.
A Role for Machine Translation?
Machine Translation (MT) provides an ability to instantly translate content into a variety of target languages. While the translation is by no means perfect, I believe there is some benefit to adding MT into the pseudo-localization process. While you will still need to implement all the pseud0-localization techniques mentioned above, it may be beneficial to add an MT pseudo-localized test pass as well. By using MT, we can improve the ability of pseudo-localization to discover cases where a 30% padding of string length is insufficient.
At first glance you may think this is a crazy question. Everyone knows how numbers are written, right? We learned in school – 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 … and so on: 100, 1,000, 100,000, 1,000,000. Well, what you learned in school was correct for your particular world view – your culture, your locale. But other cultures have different writing schemes for numbers.
Regardless of whether you translate your product into a different language, you always need to remember that individual cultures format numbers differently, even if they speak English. This is especially clear in date formats such as dd/mm/yyyy (UK, Ireland, etc), mm/dd/yyyy (USA) or yyyy/mm/dd (South Africa). But even for regular numbers, different formats exist. Decimal separators and Thousands separators vary. Digit groupings vary. India, for example, separates the first three digits and then every two digits after. (For example: 12,34,56,789.00)
Various other formats also exist around the world, including some which swap the meaning of the comma and dot. (For example: Germany would write the number as 123.456.789,00)
Let me take the same number (123456789 + two decimal places) and format it correctly for three different cultures:
Which is correct for you?
So, a correctly internationalized application should be able to display a number correctly formatted for a user’s preferred culture. What does this have to do with calculators?
World’s first “localized” calculator?
Calculators are hardly a recent invention – the all-electronic, pocket calculator revolution started just about 40 years ago. I was somewhat surprised, therefore, to learn that Casio has just created the first calculator, the DJ-120D, which can group numbers using Indian digit groupings. It can also be set to use commas for decimal sybmols, as well as using periods to separate the digits.
In other words, Casio just created an internationalized, world-ready pocket calculator. And it only took them 40 years?
I wonder how much longer before cultures who use number systems other than 0-9 will get a calculator that works for them?