[RFC] Internationalization system rework

I’m happy to see the upcoming fluent function (#1040), but with the current configuration system, it won’t solve some of the awkwardness with the localization system and create some more, namely:

  • Basic config options like title and description cannot be configured language-specifically,
  • Tucking all locale-aware operations into functions means backwards-compatibility for config files, not for themes/tempates. In my opinion, the opposite outcome is desired. Writing a verbose expression like {{ fluent(lang=lang, key="description") | default(value=config.description) }} is way too clunky compared to having a language-aware config struct.
  • Fluent is too awkward for use in configuring themes. One should not need to mess with a domain-specific language to adjust parameters like author_name or bio. Hugo still retains the config.languages key with the ability to transparently override global options alongside the i18n function for strings translation.
  • Moving language codes to unic_langid::LanguageIdentifier breaks links for sites that did not specify languages with the canonical BCP 47 syntax.

I propose completely redesigning how languages are handled both internally and in the user-facing API.

  • Store languages as LanguageIdentifier but allow mapping an alias to them which will be used in URLs. This is needed because Fluent treats numbers and dates differently based on language. Having an incorrect language should be a fatal error.
  • Replace config.languages: Vec<lLanguage> and config.translations: HashMap<String, TranslateTerm> with a single config.languages: HashMap<LanguageIdentifier, Language>
  • Move language-specific config options (taxonomies, title, description , extra etc.) into Languages.
  • Create a newtype struct LanguageConfig(Config) that is generated by overriding specific fields of Config with language-specific options from Language and pass that as the config object to templates.
  • Deprecate the trans function for general use, but keep it as an interface for getting options for a specific language. So, trans(lang="lang", key="key") would basically return config.languages._lang_.extra._key_, which is the new equivalent of config.translations._lang_._key_.

Pretty much every module that deals with the config and templates needs to be changed. Because this change would be pretty big (I estimate 500+ lines excluding tests and benches), I want to open a discussion about how people want the localization system to function and to get a go-ahead from those working on Zola or using it in production environments.

I’ve already mentioned some of these on the PR created by @XAMPPRocky . They have not responded to my commits in a while and the current progress would need to be rewritten to accommodate these architectural changes, so I’d like to take over and file a new PR. I’m a high school student with currently a crap-ton of time so besides review and the occasional question, I can handle this.

sorry for not linking relevant stuff, but I triggered the abuse detection with my references

1 Like

Fwiw I didn’t know there was something that I specifically needed to respond to. I’m only a contributor myself so I can only offer a perspective and I wouldn’t feel comfortable saying it should be one way or another.

That looks good overall, but I’m not sure about putting taxonomies in Languages, that means people with a single language have to create that config even though they don’t really need it.

If you do the PR, I can find and ping some people with multi languages setup to look at them, it’s easier to see this kind of thing in action.

Implemented this in #1148.

Where you put a string should not change based on whether you have a single or multi language site. If it’s a UI string it belongs in a language specific location, even if the site will only have one language. Anything else and you make the experience painful for multi-lingual sites and make the transition from single to multi hard. Config files with programmatic values can be shared. User facing strings are by definition in some language and should be saved accordingly.

1 Like

The only reason there is a default_language is for the RSS feed template. Going from single language to multiple language is pretty niche.

Let’s take a step back from the code and see what we are trying to accomplish. There are roughly two steps to that RFC:

  1. Improving the configuration of languages so it’s more consistent
  2. Add support for Fluent

Let’s forget about 2 until we are done with 1. What are we trying to accomplish with 1?

  • Make it more consistent to declare languages? Add more
  • From the user PoV, what’s the deal with language and language_alias from the PR?
  • Which part of a config is localizable? title, description… what else?
  • What about translations?
  • What do users need as localized data from the config in templates? Can we create a site variable holding that instead of cloning the Config every time in the PR?
  • What’s an example diff of config.toml with those changes, for someone using multiple languages and someone who doesn’t?
  • Anything else not captured in the PR currently?

Let’s answer these and see what exactly what we need to change and figure out a plan.

No, it’s not. It’s done infrequently because existing tooling is so awful. Even static generators and simpler CMS systems where this should be easy make it absurdly difficult, and even when they accomplish it whatever language you added is always somehow second class to the original.

As a result people that seriously want to move to a localized site often end up switching site generators or CMSs in the process because it’s easier to build from the ground up with something with good multi-language support than upgrade an existing site built with something where there isn’t an easy upgrade path.

From the user PoV, what’s the deal with language and language_alias from the PR?

First, I want to establish that it’s necessary for technical reasons.

Language codes unambiguously describe languages, using a standardized syntax. Fluent needs to know the exact language, because some of its behavior is language specific (e.g. number classes, date formatting), so it uses the language identifier. We also want to select the correct translation of a theme. This is not possible with arbitrary strings, because one might call the same language de-AT, de_at or german.

However, there are cases where the rigidity of language codes cause problems, like:

  • some sites might be using an ad-hoc or county-based language naming scheme in their URLs. (e.g. {base_url}/tw for the language zn-TW Mandarin Chinese as spoken in Taiwan or {base_url}/us for US English). We don’t want to break links to these sites.
  • one might want to use a specific language variant for Fluent, but a more generic variant in URLs (i.e. {base_url}/en in the en-AU dialect and spelling).

Thus, we should introduce a mechanism for giving arbitrary names for translations of the site. This should be entirely optional, because the examples above are rare edge-cases. This means indexing languages by code, but using the override in URLs. This is language_alias.

If an alias is not specified, the translation will be available under the canonicalized code of the lang. This is lossy and is caused by re-serializing the de-serialized identifier. The canonicalization will change underscores to dashes, and locale subtags to uppercase.

I agree the name is confusing, but clear documentation can fix that. I would like help in that regard, because my English skills certainly aren’t good enough for writing easy-to-understand technical documentation. We should focus on the “under what URL will the translation be available” aspect.

Which part of a config is localizable? title , description … what else?

Let’s look at each of the current fields in config.toml:

  • base_url: there is currently no multihost mechanism ✗
  • title and description: used in feeds; templates might use them (i.e. on the home page) :heavy_check_mark:
  • theme: currently isn’t possible, sites generally want to have a unified design language ✗
  • highlight_code, highlight_theme and compile_sass: related to theming, which should be consistent between languages. ✗
  • generate_feed and build_search_index: these are generally large for CJK languages, so some users might want to disable them. AFAIK Zola quits early if search is enabled for an unsupported language, so a global setting isn’t an option. :heavy_check_mark:
  • taxonomies: taxonomies contain pages for a single language. Taxonomy names and items should also be translatable. :heavy_check_mark:
  • ignored_content, link_checker, slugify and extra_syntaxes: probably pointless, also hard to implement ✗
  • search: the same concern for size (mentioned in the existing documentation) as build_search_index. :heavy_check_mark:
  • extra: unifies the features of extra and translations, see below. :heavy_check_mark:

There could be use cases for the crossed-out options, but I can’t come up with any.

What about translations ?

In 0.12, it is the only way to set language-specific values. With extra under languages, it will be redundant.

Also, it can only contain string values for arbitrary reasons.

I agree. I understand the purpose of language_alias but I’m thinking about ways to improve it so it’s easier to understand by users.
For example, we could keep code for example which is the free-form version and add a canonical_code which is the canonicalized version. If code is a valid canonicalized version, everything is fine. If it isn’t, we can display a warning to the user telling them to fill canonical_code, with a link on where to find the correct value. Names are up to bikeshedding of course, it’s just the alias notion that I find slightly confusing.

I agree with everything, expect for search. Wanting to index different part of the content per language is probably niche enough that it’s not important to implement, at least initially. Chinese and Japanese support for search are already disabled by default due to the size increase they cause to zola binary. Korean is not supported by the underlying library afaik.

I like that :+1:


For a site variable, we could have:

title
description
extra
code
canonical_code

We can easily add more but I don’t know if people access taxonomies/search/slugify in templates from the config object so let’s leave them out for now.

I like this idea. The keys in the languages table will be the free-form strings, and LocalizedConfig will contain the language identifier. Thinking about it, it allows more flexibility than code-based indexing.

I don’t want this discussion to continue on forever, so I’ll present my current opinion. Of course, I’ll file the PR with your final choice.

canonical is more of a big scary word than alias IMO. In the config file, we could simply call the LanguageIdentifier fields code or language_code, as they function the same as the code in HTML’s lang attribute. Neither of these is perfect: writing language_code under languages is verbose, but in the main section, code does not imply that it is the default language’s code.

For the in-template nomenclature, I would like to bring up prior art: Hugo calls the free-form version Lang, the full name (i.e. English) LanguageName and the code LanguageCode. However, the docs seem unclear and have to show examples to disambiguate.

I’m in favor of language_code for the canonicalized code. This clearly describes its purpose: it’s a canonical, machine-readable, strictly defined identifier of a language. code alone does not imply what it describes.

For the free-form version, code is not good, as its most important aspect is that it can be set to anything. language_name can be confused with the full name of a language (i.e. English). My preference is still language_alias, but some other suggestions are translation_path, simply language or language_id (as id can also refer to keys of a mapping).

Completely fine for me! I’m not great at naming things either.

My issue with alias is that page can have aliases so the term is already used in Zola but would mean something different. I would prefer language if I have to pick but I can see language and language_code being 2 codes can be confusing.

On way to fix it would be to have the code be the id of the table, like Hugo does: Multilingual mode | Hugo and see how they handle random user language codes.
We do need the right code to load Fluent right? Does it only support official language codes or could you use your own code? Is it to integrate with themes?

1 Like

Whoops, I forgot about that. Now it makes sense.

Let’s do that then. If one wants to tweak templates, reading the docs is not a big thing to ask.

Fluent functions take in a LanguageIdentifier, so the code must conform – in form – to the BCP 47 syntax. unic_langid does not do validation AFAIK, so setting random text is possible, though I don’t know either how Fluent would react.

My idea of a solution is:

  • index languages by the freeform string (this way we don’t have to give it a user-facing name)
  • call it language in templates (RTFM)
  • try to parse language_code from the index. If not possible, display a warning that Fluent will be disabled (might break themes later on) and that the theme’s default language will be used (might not be the expected behavior).

Disabling Fluent is by loading a dummy fluent function into Tera that displays a detailed explanation.

This is my first attempt at an error message:

Warning: {} isn’t a valid language code. This means multilingual themes won’t work correctly.
You can optionally set a code in language_code but keep using this name.

The current iteration:

fn try_parse_language(name: &str) -> Option<LanguageIdentifier> {
    if let Ok(id) = name.parse::<LanguageIdentifier>() {
        Some(id)
    } else {
        eprintln!(
            "Warning: {} is not a valid language code. This means multilingual themes \
            won't work correctly.\n You may keep using this name. You can optionally set a code \
            in `language_code`",
            name
        );
        None
    }
}

@keats Where should I define that? We already have a Site, wouldn’t that cause confusion?

@BertalanD

Sorry for disappearing a bit, I just had a baby. site could be whatever name we want, where do we use it?

Congratulations!

3 Likes

Hi, and congratulations @Keats on the baby!

I just posted a comment on Github about unified path resolution which is in my view an important feature for i18n so that you don’t have to localize all URLs everywhere. Don’t hesitate if you have feedback.

I’ll take some time to read the complete proposal here and react on it :slight_smile:

Thanks!
The next version will definitely be 100% focused on i18n polishing (+ fixing inconsistencies with paths), I know I said that before but this time for real :tm:

2 Likes

That’s brilliant! I really like this idea of language-specific extra configuration. It could make it easier to support different baseURL for every language. I have two questions about this proposal:

  1. I’m concerned about taxonomy translations. How can we link to translations for taxonomy pages like we do for content pages? Also, should the localized taxonomy name be used for output paths (URLs), for taxonomy frontmatter declarations, or both?

Maybe some people would like to have localized taxonomy names in their frontmatter. Personally, i would only like them in URLs, because it’s a less error-prone process when translating a single article (a taxonomy term may have different translations depending on who is doing the translation).

taxonomies = [
  { name = "author" }
]

[languages]
[languages.fr]
translate_taxonomies = "url"|"frontmatter"|"both"|"none"
taxonomies = [
  { name = "auteurice", id = "author" }
]

So i can write:

+++
title = "Ma traduction"
[taxonomies]
author="southerntofu"
+++

This way no need to worry if someone is translating author to “auteur”, “auteure”, or “auteurice” (gender variance). This approach would still enable to have language-specific taxonomies, by not specifying an id. The same translation strategy could be applied to taxonomy terms.

  1. I was already (mis)using the translations for this, but faced the problem that theme’s translations were not merged with site translations. If i understand your PR, this problem is fixed in add_theme_extra().

However, how does that work when the default language is not the same in the theme and in the site? Language-agnostic configuration will gladly be merged, but will a default-french site display default-english translation strings from the theme’s extra? I believe this could be a problem, maybe justifying to keep the translations field separate from extra?