Any plan for I18n / gettext / locale in std?

mgavioli · March 22, 2024, 10:04pm

Any plan to add an internationalisation package to the std library, for instance along the lines of GNU gettext?

Or would it make more sense to help (one of) the existing projects? On github, there are 2 or 3 of them, but they seem to lack momentum, with just one contributor each and no work done since months. Do they know that some ‘official’ (yes, I know, I know…) work is going on in this area?

Also, there is any concept of locale now in the library? I looked in the possibly related sections, but I found nothing.

Thanks!

dee0xeed · March 22, 2024, 10:15pm

Some time ago I asked a question here, and there was no clear answer

chung-leong · March 23, 2024, 6:16am

I don’t think it belong in std since there’re different approaches to localization depending the languages involved. If you only need English and Spanish, then a simple approach with a dictionary file would be sufficient. On the other hand, if your software must cover Slavic languages, where rules for pluralization are totally wild (24 = plural, 25 = singular), you’d need a far more sophisticated framework like Fluent.

mgavioli · March 23, 2024, 9:38am

Thanks for your comment: the complexities you hint at are real, but the problem is also rather well-known and largely (if not completely) solved, as packages like GNU gettext show.

If an application does not need a feature provided by the library or if it can provide a lighter ad-hoc solution, fine: it will not use the feature (which will not be linked in the executable and will not contribute to the app ‘weight’).

On the other hand, for instance, any application developed in Europe or for the European market is likely to fare poorly if it cannot be set up in a variety of languages (Germanic, Romance/Neo-Latin and/or Slavic + special cases like Finnish, Greek or Hungarian), with minimum possible limitation. Same for applications developed in India or Malaysia or Japan or Singapore (not to mention a couple of currently unmentionable countries) which are likely to require (or greatly benefit from) a variety of languages from the most disparate linguistic families.

In my humble opinion, in the 21th century, this qualifies as something a language devised today should make easily available (in a reasonably solid and general way).

But, of course, these decisions are the cross result of several, possibly conflicting, sets of priorities. So, we shall see.

mgavioli · March 23, 2024, 11:16am

It seems reasonable to deduce that on this topic no work is going on at least in the main language repo.

I even found an issue ( #893 ) about implementing locale in std library which Andrew Kelley himself closed stating: “No locale in standard library. Locale will have to be solved with a third party package.”

Even if the original PR seemed to aim to much more (possibly too much), this seems a rather surprising statement: most languages have (built-in) the concept of locale and ways to at least detect the locale of the current user / process and to set it, even if more sophisticated tasks (like locale-dependent string processing) are delegated to third-party libraries.

Avoiding a centralised concept of locale might lead to multiple competing, even conflicting, ‘inventions of the wheel’…

chung-leong · March 23, 2024, 2:38pm

Localization is far from satisfactory today for certain languages like Polish. In the end, it’s an intractable problem. Programmers can’t solve what that they don’t understand. Given current tech trend, we’ll probably just throw in the towel. Some AI at the OS level will just do the translation for us, perhaps with guidance from native-speakers in QA.

kristoff · March 23, 2024, 3:17pm

One problem deriving from localization is a dependency on all the unicode metadata needed to perform things like sorting, comparing and normalizing funky strings. That’s like a good 10MBs or more IIRC.

I’m in the process of reinventing my own wheel with Zine (I’m adding initial i18n support as we type) and I would be curious to see what is a good list of features that other i18n systems implement.

For now a good starting point for building some i18n support into a Zig application is Ziglyph

mgavioli · March 23, 2024, 6:16pm

Well, well…I feel the topic is kind of spiraling out of control…

Internationalisation is not “implementing all the world-wide knowledge about everything having to do with any possible language in any possible point of space and in any possible writing system”.

It is ‘simply’ (cough cough) making an application ready to accept sets of translations for different locales, to retrieve the right set once given a specific locale (with well-defined fallback mechanism if not found) and to provide the supplied (uniquely determinable and fixed) translation for a given key string.

In addition, a reasonable i18n mechanism may provide the correct (or at least the most common) currency symbol, decimal and digit group separators for the supported locales. Where “supported locales” means that there is no need to support the Tibetan language as spoken in Andorra and written in the Egyptian demotic script, because such a combination is not actually attested.

Assuming an approximate estimate of 500 attested combinations of ISO 639 - ISO 15924 - ISO 3166, this builds up to a table of ca. 3000-4000 u16 / u8 values , hardly an impressive amount of data. And tables of this kinds are readily available.

This does not include providing the actual translations (remember: of the actual literal strings present in the application) for a given locale; this is what localisation covers and the current trend is that each company / community will provide the translations for a specific application and in a specific locale via hired specialists or motivated mother-tongue users (which is definitely ‘good enough’ from the perspective of the implementation programming language).

It also does not include supporting all the possible peculiarities of any specific orthography, like Upper-case/lower-case conversions (a difference which most writing systems beyond ours do not even have), which may be locale dependent. Or what happens to the German ‘ß’ when hyphenated or even less where to hyphenate a word. All of this belongs to a level of text processing, which tends to be specific of certain kinds of applications and which is reasonable to delegate to specific specialised libraries when required. This is more or less the area of the ziglyph library quoted above (thanks, @kristoff!) which I had already met it and it is a very interesting and welcome endeavour in itself!

So, it seems to me, if we limit ourselves to i18n (the first 3 paragraphs above), we are speaking (or at least I was initially speaking) of something relatively bounded and of potentially general interest.

As a last, specific point, above I quoted several times the concept of locale, which is one of the foundational concepts of all these different approaches and goals; and I am a little surprised that nobody seems interested in basic support for it (to retrieve the current locale for the current user/process and to set it at least for the current process), which is not so obvious to implement in a cross-platform way…

Thanks for all the comments!

Sze · March 23, 2024, 7:56pm

I don’t see how that makes it something that is needed in the standard library, just make a good community package and it will easily become a defacto standard, without becoming a hindrance for anyone that has no use for it.

If Zig gets optional dependencies that can be declared with the package system, the community will be able to create optional localization packages, that way nobody has to download stuff they don’t use. I really don’t want another language that adds lots of bloat that is rarely used (just for more convenience), other languages that do that are annoying when it comes to download and install sizes, because they can’t resist to add more stuff.

I think putting data that is only used by a fraction of programs into the initial download of a language, is a sign that the package manager of the language isn’t good.

There are many applications / packages that don’t have any interaction with users or need for localization. The way I see it, the best case we can hope for is that the standard library remains small or even shrinks, containing the things that are needed by Zig itself and the rest will just be handled with the package system.

Personally I think if somebody programming some micro controller, or trying to output some stuff on some old device won’t care about it, it might be better as a package.

I have used languages that pull in the whole kitchen sink although you aren’t really using it and it wasn’t fun.

I think in the future there may be a point where there maybe needs to be made a distinction between a core library (the current std) and a new standard library (which I would see as a collection of packages that are designed as a cohesive unit as the new standard library)). If the standard library keeps its name, then maybe the bigger library building upon it would be an extended library or something like that.

But currently I think the only difference between that imagined extended library and community packages, would be that the former would take greater care and community involvement to design the library code and coordinate efforts across multiple packages, instead of just somebody creating a package.

Overall I think that is still a long way from now and only becomes really relevant after audit the standard library API and implementations · Issue #1629 · ziglang/zig · GitHub is done. Until then I would say just create the packages you want as third party packages and raise relevant issues, if for example the build system lacks some feature that is absolutely necessary to make it feasible.

I think a big part of the community is focused on what is necessary to drive Zig itself forwards, but nothing stops you from forming a sub community of Zig developers that care a lot about localization and then together create a good package for that.

kristoff · March 23, 2024, 9:12pm

I never talked about providing a bunch of translations. I talked about unicode metadata.

Do you want to display a list of items sorted alphabetically? You need uncode collation information because otherwise non-latin localizations will show a weird sorting order to the user.

Does your application have any text that might be broken into multiple lines? Maybe user provided-text or even just a builtin text element of the application, like a tooltip. Well you need to know how to segment text correctly, which is information provided by the unicode standard.

I would personally consider these (plus more usecases covered by the unicode standard) must-haves for any application that tries to support i18n, that said I don’t know if this is traditionally considered table stakes for i18n or not.

The problem though is that unicode, on top of being chunky, is a moving target and so it’s probably a good idea to not tie people’s ability to access information from the latest version of unicode to Zig’s release frequency.

Hence my opinion that this stuff doesn’t belong in the stdlib.

Additionally, currently the stdlib only offers formatting for comptime known fmt strings and I don’t know if there are any plans to ever add to it support for runtime formatting. It’s my understanding that one would want i18n formatting to be at least partially done over formatting strings loaded from a specific translation. So in that case, at least given the status quo of Zig’s stdlib, this would have to be something implemented separately anyway.

I’m mentioning this last point mainly to highlight how currently the whole “have localizaton percolate through stdlib based formatting” isn’t really a thing in Zig at the moment. Maybe it should be, maybe not, I genuinely don’t know.

This would definitely be a welcome package. There are a lot of foundational things that Zig doesn’t have yet because it’s a young language with an even younger ecosystem. I mentioned Ziglyph a bunch but before dude_the_builder decided to sit down and make it happen, we had no Zig native way of dealing with unicode.

If you have interest & expertise over these subjects then it might be that you’re the right person to create a first Zig native implementation of a package that fetches the user’s locale information. Creating a high-quality package that offers some i18n support might incentivize Zig application creators to add support for translations.

I think one reason why this is not yet an existing Zig package is because most people just create CLI tools that don’t really strongly need i18n support, but there are some GUI applications being written in Zig (like Pixi and Ghostty) and there’s a very promising gamedev scene that could definitely be interested in using such a package.