Jump to content

[Discussion] Localization (again)


Recommended Posts

Hi again. Localization became a topic again today and after some extensive feedback from Michael and his vision of things I decided to write a quick prototype of the system.

1) Language files
The system should be simple and allow easy modification of game strings. Localization strings should be placed into simple .txt files and the first part of the file name marks the language used, while the rest of the file name is ignored:
-) Format: {lang}xxxxx.txt
-) Example: english_units.txt

All files associated with a language will be collated into a single language dictionary.

2) File format
Since the files are simple text files, the format should also be simple, easy to edit and foremost it should be readable.
Each translation entry (id string from now on) is described first by its id string on a single line, followed by its attribute values, each on a single line. Each attribute is followed by a translation string, separated with white-space. An id string entry is terminated by an empty line. For the sake of parsing correctness, any empty lines should be ignored. Lines can be commented using a semicolon ;.
-) Format:

;;; comment
id string
{attribute} {translation string}
{attribute} {translation string}

-) Example:


athen_hero_pericles
generic Pericles
specific Periklēs
tooltip Hero Aura: Buildings construct much faster within his vision. Temples are much cheaper during his lifetime.
Melee 2x vs. all cavalry.
history Pericles was the foremost Athenian politician of the 5th Century.



3) Current translation text
Now of course, the first concern of Michael was how to get all the current translations from XML to new files. It took a few minutes of scribbling to put together a tiny conversion tool. It simply shuffles through a set of sub-directories and collates xml files of a directory into a file.


using System.Xml;using System.IO;namespace WFGLocalizationConverter{class Program{static string GetValue(XmlNode node, string key){if (node == null || (node = node[key]) == null)return null;return node.InnerText;}static void CreateTranslationFile(string path){string[] files = Directory.GetFiles(path, "*.xml", SearchOption.TopDirectoryOnly);if (files.Length == 0)return; // nothing to do hereStreamWriter outfile = new StreamWriter(string.Format("english_{0}.txt", Path.GetFileName(path)));outfile.WriteLine(";; Generated by WFGLocalizationConverter\n");foreach (string file in files){XmlDocument doc = new XmlDocument();doc.Load(file);XmlNode identity = doc["Entity"]["Identity"];if (identity == null)continue; // not all entities have <Identity> tagsstring generic = GetValue(identity, "GenericName");string specific = GetValue(identity, "SpecificName");string tooltip = GetValue(identity, "Tooltip");string history = GetValue(identity, "History");if (generic == null && specific == null && tooltip == null && history == null)continue; // no useful data for us// write it downoutfile.WriteLine(Path.GetFileNameWithoutExtension(file));if (generic != null) outfile.WriteLine("generic {0}", generic);if (specific != null) outfile.WriteLine("specific {0}", specific);if (tooltip != null) outfile.WriteLine("tooltip {0}", tooltip);if (history != null) outfile.WriteLine("history {0}", history);outfile.WriteLine();}outfile.Close(); // clean-up & flush}static void Main(string[] args){foreach(string path in Directory.GetDirectories("data/"))CreateTranslationFile(path);}}}



Running this tiny piece on "simulation emplates\", I get a full list of collated translation files:

english_campaigns.txtenglish_gaia.txtenglish_other.txtenglish_rubble.txtenglish_special.txtenglish_structures.txtenglish_units.txt

4) Loading translation files
Now that we've converted all of this into the language files, we need to read it back in C++. In order to minimize memory usage, we load the entire file into a buffer, treat it as a string and tokenize it. The tokenized strings are then put into a hash map (std::unordered_map<size_t, TEntry>). Even though a sorted vector indexed with binary search would be more memory efficient, we resort to the hashmap for simplicity.

The code itself is written as a small C/C++ module (yeah, sorry - it's 175 lines):


#include <stdio.h> // FILE* suits us a bit better in this case#include <unordered_map> // lazy today// good old k33 hashinline size_t hash_k33(const char* str){size_t hash = 5381;while(int c = *str++)hash = ((hash << 5) + hash) + c; // hash * 33 + creturn hash;}// tests if the range is ALL-whitespaceinline bool is_whitespace(const char* begin, const char* end){while(begin < end) {if(*begin != ' ' && *begin != ' ') return false; // found a non-ws char++begin;}return true; // this is all whitespace}// advances to the next lineinline char* next_line(char* str){return (str = strchr(str, '\n')) ? ++str : nullptr;}// reads a valid line (skipping comments and empty lines)const char* read_line(char*& str){char* line = str;do{if(*line == ';' || *line == '\n' || *line == '\r')continue; // next linechar* end = strpbrk(line, "\r\n"); // seek to end of lineif(is_whitespace(line, end)) // is it all whitespace line?continue; // skip line// window CR+LF ? +2 chars : +1 charstr = *end == '\r' ? end + 2 : end + 1; // writeout ptr to next line*end = '\0'; // null term this line, turning it into a C-stringreturn line;} while(line = next_line(line));return nullptr; // no more lines}// gets an attribute lengthinline int attr_len(int attrid){static size_t attrlens[] = { 0, 7, 8, 7, 7 };return attrlens[attrid];}// gets the attribute id [1..4] of this line; 0 if not an attributeint attr_id(const char* line){static const char* attributes[] = { 0, "generic", "specific", "tooltip", "history" };for(int i = 1; i <= 4; i++) {size_t len = attr_len(i);if(memcmp(line, attributes[i], len) == 0) { // startsWith matchconst char* end = line + len;if(*end != ' ' && *end != ' ') return 0; // it's not a valid attribute!return i; // it's a valid attribute}}return 0; // it's not a valid attribute}// UTF8 Translation Entrystruct TEntry{const char* idstring; // id string of the translation entryconst char* generic; // 'generic' attribute stringconst char* specific; // 'specific' attribute stringconst char* tooltip; // 'tooltip' attribute stringconst char* history; // 'history' attribute stringvoid set(int attrid, const char* line){line += attr_len(attrid) + 1; // skip keyword +1 charwhile(*line == ' ' || *line == ' ') ++line; // skip any additional whitespace*((const char**)this + attrid) = line; // hack}};// UTF8 dictionarystruct Dictionary{char* mBuffer; // buffersize_t mSize; // buffer sizestd::unordered_map<size_t, TEntry> mEntries;Dictionary(FILE* f){// get the file sizefseek(f, 0, SEEK_END); size_t fsize = ftell(f); fseek(f, 0, SEEK_SET);char* str = mBuffer = new char[mSize = fsize];// read all the data in one gofread(mBuffer, fsize, 1, f);const char* line = read_line(str);if(line) do {TEntry entry = { 0 };if(attr_id(line) == 0) { // not an attribute; great!entry.idstring = line;int attrid;while((line = read_line(str)) && (attrid = attr_id(line)))entry.set(attrid, line);// emplace entry into the hash table:mEntries[hash_k33(entry.idstring)] = entry;}} while(line);}~Dictionary(){delete mBuffer, mBuffer = nullptr;mEntries.clear();}inline const TEntry* at(const char* idstring) const { return &mEntries.at(hash_k33(idstring)); }inline const TEntry* operator[](const char* idstring) const { return &mEntries.at(hash_k33(idstring)); }inline const TEntry* at(size_t idhash) const { return &mEntries.at(idhash); }inline const TEntry* operator[](size_t idhash) const { return &mEntries.at(idhash); }};struct Entity{const TEntry* descr;// ...Entity(const TEntry* descr) : descr(descr) {}void Print() // print the unit{printf("%s\n", descr->idstring);if(descr->generic) printf("generic %s\n", descr->generic);if(descr->specific)printf("specific %s\n", descr->specific);if(descr->tooltip) printf("tooltip %s\n", descr->tooltip);if(descr->history) printf("history %s\n", descr->history);printf("\n");}};int main(){if(FILE* f = fopen("english_gaia.txt", "rb")){Dictionary english(f);fclose(f);Entity(english["fauna_bear"]).Print();Entity(english["flora_bush_badlands"]).Print();system("pause");}return 0;}




-----------
I'll put the main focus on how the Dictionary is actually used. For any given entity, we will assign a Translation entry which contains all the required strings we need. These Translation entries can be retrieved from the dictionary by their id string or its hash. This is done only once when the Entity type is instantiated.

Here's a snippet of this in action:

struct Entity{const TEntry* descr;// ...Entity(const TEntry* descr) : descr(descr) {}void Print() // print the unit{printf("%s\n", descr->idstring);if(descr->generic) printf("generic %s\n", descr->generic);if(descr->specific)printf("specific %s\n", descr->specific);if(descr->tooltip) printf("tooltip %s\n", descr->tooltip);if(descr->history) printf("history %s\n", descr->history);printf("\n");}};int main(){if(FILE* f = fopen("english_gaia.txt", "rb")){Dictionary english(f);Entity(english["fauna_bear"]).Print();Entity(english["flora_bush_badlands"]).Print();system("pause");}return 0;}

And its output in the console:

fauna_bearspecific Bearflora_bush_badlandsspecific Hardy Bushhistory A bush commonly found in dry flatlands and rocky crags.

This is it for now. What you should do now is discuss! You can take a look at the converted translation files below.

Regards,
- RedFox

english_campaigns.txt

english_gaia.txt

english_other.txt

english_rubble.txt

english_special.txt

english_structures.txt

english_units.txt

Edited by feneur
Link to comment
Share on other sites

That seems mostly to deal with entity templates, which are certainly one part of the translation issue, but not all. We'd still have to come up with custom solutions for JS, JSON, and GUI XML? Personally I'm leaning toward the approach taken on http://trac.wildfiregames.com/ticket/67 that uses existing libraries, tools, and data formats. I don't have experience translating software, but if someone who does says that's a common approach, it's a major consideration. If the tools are out there to do what we want, why reinvent the wheel?

It's true that this:


msgid ""
msgstr ""
"Project-Id-Version: PACKAGE VERSION\n"
"POT-Creation-Date: 2013-06-15 21:58+0200\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
"Language-Team: LANGUAGE <LL@li.org>\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=UTF-8\n"
"Content-Transfer-Encoding: 8bit\n"

#. (itstool) path: object/localizableAttribute
#: binaries/data/mods/public/gui/pregame/mainmenu.xml:343
msgid "Learn To Play"
msgstr ""

#. (itstool) path: object/localizableAttribute
#: binaries/data/mods/public/gui/pregame/mainmenu.xml:344
msgid "The 0 A.D. Game Manual"
msgstr ""

#. (itstool) path: object/localizableAttribute
#: binaries/data/mods/public/gui/pregame/mainmenu.xml:360
msgid "Single Player"
msgstr ""

#. (itstool) path: object/localizableAttribute
#: binaries/data/mods/public/gui/pregame/mainmenu.xml:361
msgid "Challenge the computer player to a single player match."
msgstr ""

#. (itstool) path: object/localizableAttribute
#: binaries/data/mods/public/gui/pregame/mainmenu.xml:375
msgid "Multiplayer"
msgstr ""

#. (itstool) path: object/localizableAttribute
#: binaries/data/mods/public/gui/pregame/mainmenu.xml:376
msgid "Fight against one or more human players in a multiplayer game."
msgstr ""

#. (itstool) path: object/localizableAttribute
#: binaries/data/mods/public/gui/pregame/mainmenu.xml:390
msgid "Tools & Options"
msgstr ""

#. (itstool) path: object/localizableAttribute
#: binaries/data/mods/public/gui/pregame/mainmenu.xml:391
msgid "Game options and scenario design tools."
msgstr ""

#. (itstool) path: object/localizableAttribute
#: binaries/data/mods/public/gui/pregame/mainmenu.xml:405
msgid "History"
msgstr ""

#. (itstool) path: object/localizableAttribute
#: binaries/data/mods/public/gui/pregame/mainmenu.xml:406
msgid "Learn about the many civilizations featured in 0 A.D."
msgstr ""

#. (itstool) path: object/localizableAttribute
#: binaries/data/mods/public/gui/pregame/mainmenu.xml:422
msgid "Exit"
msgstr ""

#. (itstool) path: object/localizableAttribute
#: binaries/data/mods/public/gui/pregame/mainmenu.xml:423
msgid "Exit Game"
msgstr ""

#. (itstool) path: object/localizableAttribute
#: binaries/data/mods/public/gui/pregame/mainmenu.xml:450
msgid "[font=\"serif-bold-16\"]Alpha XIII: Magadha[/font]\\n\\nWARNING: This is an early development version of the game. Many features have not been added yet.\\n\\nGet involved at: play0ad.com"
msgstr ""

#. (itstool) path: object/localizableAttribute
#: binaries/data/mods/public/gui/pregame/mainmenu.xml:459
msgid "Website"
msgstr ""

#. (itstool) path: object/localizableAttribute
#: binaries/data/mods/public/gui/pregame/mainmenu.xml:460
msgid "Click to open play0ad.com in your web browser."
msgstr ""

#. (itstool) path: object/localizableAttribute
#: binaries/data/mods/public/gui/pregame/mainmenu.xml:473
msgid "Chat"
msgstr ""

#. (itstool) path: object/localizableAttribute
#: binaries/data/mods/public/gui/pregame/mainmenu.xml:474
msgid "Click to open the 0 A.D. IRC chat in your browser. (#0ad on webchat.quakenet.org)"
msgstr ""

#. (itstool) path: object/localizableAttribute
#: binaries/data/mods/public/gui/pregame/mainmenu.xml:487
msgid "Report a Bug"
msgstr ""

#. (itstool) path: object/localizableAttribute
#: binaries/data/mods/public/gui/pregame/mainmenu.xml:488
msgid "Click to visit 0 A.D. Trac to report a bug, crash, or error"
msgstr ""

#. (itstool) path: object/localizableAttribute
#: binaries/data/mods/public/gui/pregame/mainmenu.xml:518
msgid "WILDFIRE GAMES"
msgstr ""

#. (itstool) path: action/localize
#: binaries/data/mods/public/gui/pregame/mainmenu.xml:530
msgid "Build: %(buildDate)s - %(buildDetails)s"
msgstr ""

at first glance seems less friendly than this:


;; Generated by WFGLocalizationConverter

fauna_bear
specific Bear

fauna_boar
specific Boar

fauna_camel
specific Camel

fauna_chicken
specific Chicken

fauna_deer
specific Deer

fauna_elephant
specific Elephant

fauna_elephant_african_bush
specific African Bush Elephant

fauna_elephant_african_infant
specific African Elephant (Infant)

fauna_elephant_asian
specific Asian Elephant

fauna_elephant_north_african
specific North African Elephant

fauna_fish
specific Tuna Fish
tooltip Collect food from this bountiful oceanic resource.

fauna_fish_tilapia
specific Tilapia Fish
tooltip Collect food from this bountiful riparian resource.

fauna_fish_tuna
specific Tuna Fish
tooltip Collect food from this bountiful oceanic resource.

but I'd be wary of spending time to solve problems other people have already solved over the past decades. It seems tools are out there to work with .po files, sourcing them from files in multiple languages. Writing a tool to parse our own custom translation format is yucky enough, do we want to be doing the same for entity and GUI XML, let alone arbitrary JS scripts and JSON data files?

  • Like 3
Link to comment
Share on other sites

That seems mostly to deal with entity templates, which are certainly one part of the translation issue, but not all. We'd still have to come up with custom solutions for JS, JSON, and GUI XML? Personally I'm leaning toward the approach taken on http://trac.wildfire...s.com/ticket/67 that uses existing libraries, tools, and data formats. I don't have experience translating software, but if someone who does says that's a common approach, it's a major consideration. If the tools are out there to do what we want, why reinvent the wheel?

Indeed it only considers entity templates for now, but also adding GUI localization conversion would be trivial. C# is perfect for these little conversion tools; it took me around 30 minutes to throw together that program.

I think the first concern of Michael was that most modders are not familiar with translation software and thus, for the sake of moddability, a clear and simple file would be preferred. Writing a quick C# program that can take care of translation will barely take a full day.

I'm garnering from experience working as a developer for proprietary projects and K.I.S.S. approach has always worked the best. In this case the 'prototype' translation system is extremely simple - it only took me a few hours to throw that parser together. Yes it's that simple.

We should also not turn our back at the speed and performance benefits we'll get from this. All of the game strings would be stored in a contiguous block of memory. There is no fear of fragmentation or allocation overhead and the parsing speed is pretty amazing. Why waste time on complex third party libraries that might or might not do the trick we need, if we could get away with 150 lines of code that does it very efficiently?

but I'd be wary of spending time to solve problems other people have already solved over the past decades. It seems tools are out there to work with .po files, sourcing them from files in multiple languages. Writing a tool to parse our own custom translation format is yucky enough, do we want to be doing the same for entity and GUI XML, let alone arbitrary JS scripts and JSON data files?

You'd be amazed what people can do with Notepad++ alone. I had a friend from China who threw together a 100k ancillary/trait script for a mod; all done by hand. If we keep it simple and easy to edit, we'll definitely have them translated in no time.

Thanks for replying to this discussion, I really do agree all of this definitely needs to be thoroughly discussed, so the best possible solution is used. :) I'm just advocating two things: 1) Translation file simplicity 2) Huge memory performance gain from using a dictionary object.

Link to comment
Share on other sites

I can only agree with a translation method that allows existing tools. There are really lots of PO translation tools. So if that can be used, it would be best. PO isn't the only one. Android and IOS also have their own file formats. But PO is certainly the best known.

I've translated software before. And normally, translation projects only gain followers when a standard service can be used. Translators don't like checking out an SVN, or editing a file in a text editor they're not used to.

The number of translation tools is quite big. And there are many different features they offer. Some are open-source, and can be self hosted. Others provide automated syncing with a repo.

Unless you can make all those tools yourself. I'd go for PO files.

Also note that sometimes, translators want a context (it's hard to translate a single word when you don't know the context). Or strings can have variables (you can't split them up, as the order of words might be completely different). This can all be done in PO files.

Link to comment
Share on other sites

You'd face multiple problems by implementing your own system.

You are using a keyword based system which is bad for the translators. Just look at the examples posted by historic_bruno: You have to translate some arbitrary keyword which means you have to lookup first the english "translation" to get its meaning and then translate. It's also bad for the developers as they now have to look at two places: The source code where the keyword is used and the english translation.

Why not just use the english phrases in the source code and extract it from there (as is the case now anyway)? The developers just have to maintain the source code and the translators get an english phrase or real word to work with. That's what gettext does.

How do you do updates? How do you inform other translators that the english translation of a keyword changed? What do you do with old obsolet/unmaintained translations? How do you inform translators about new keywords? You'd have to write tools for all these situations and probably more (like context and plural). Gettext and PO-editors have all these features already.

And i just have to second zoot. PO files are text files. If you don't want to use a PO-editor with nice comfort functions or an online translations system like transifex then just use a plain text editor.

  • Like 1
Link to comment
Share on other sites

Hello.

I just want to chime in that I vote for the existing method of translation, namely the po file. Beside saving you the added work of making our own solution, po files and po editors are great ways to translate software with advanced features and easy to use.

I did some translating for The Battle for Wesnoth game in the past, and I can say that the po file translation have many advantages:

1. Source comment: The developers can use comments to inform translators about the context of the string they are translating. Translators can use comments to indirectly communicate with each other about the translation.

2. Translation memory: Po file editor allows you to create "translation memory" with the source language and its translation. The editor will memorize your translation and if a non-translated string matchs its memory, it will suggest you the relevant translation. This means greatly to translators as they can agree on the correct words to use in various strings for a more standardized and consistent translation.

3. Efficient update: When a new language file comes out, it's easy to update the translation. The editor will automatically add the new strings and remove obseleted ones, allow a more up-to-date translation in tandem with translation memory.

4. Simplification: I'm not a developer myself so I can't quite talk about this but a lot of errors come from the translator having to work with the source, po files eliminates this problem (mostly).

In conclusion, po files will save the developers and translators alot of works. Do use it please :)

Regards,

snwl

Link to comment
Share on other sites

You'd face multiple problems by implementing your own system.

You are using a keyword based system which is bad for the translators. Just look at the examples posted by historic_bruno: You have to translate some arbitrary keyword which means you have to lookup first the english "translation" to get its meaning and then translate. It's also bad for the developers as they now have to look at two places: The source code where the keyword is used and the english translation.

Why not just use the english phrases in the source code and extract it from there (as is the case now anyway)? The developers just have to maintain the source code and the translators get an english phrase or real word to work with. That's what gettext does.

How do you do updates? How do you inform other translators that the english translation of a keyword changed? What do you do with old obsolet/unmaintained translations? How do you inform translators about new keywords? You'd have to write tools for all these situations and probably more (like context and plural). Gettext and PO-editors have all these features already.

And i just have to second zoot. PO files are text files. If you don't want to use a PO-editor with nice comfort functions or an online translations system like transifex then just use a plain text editor.

Think of the historians, too. The people who actually have to write all the English text of the game - it's a lot harder to change a bunch of visible names if they're spread out across multiple files. In this case having a central file that contains all the text is both logical and efficient. Having the text spread out between files also implicates string fragmentation in memory - there is no efficient way of creating an actual dictionary this way - you'll be forced to fragment memory.

AAA games use systems like this due to performance implications. Performance has long been a bottleneck of 0 A.D., so why do we insist on implementing non-standard (in context of the gaming industry, not gnu applications) methods? Take a look at any AAA game like Mass Effect, Total War - Shogun 2, Skyrim, Battlefield 3 - they all use a dictionary system (if you don't believe me, just check their game files yourself). It works and it's efficient.

Writing a system that generates .po files from these .txt files isn't hard. It can even detect updated strings by cross-referencing old and new dictionary. IMHO a tool can be used that generates a .po file for translation and then back to .txt - the game strings themselves should remain in collated .txt files.

Hello.

I just want to chime in that I vote for the existing method of translation, namely the po file. Beside saving you the added work of making our own solution, po files and po editors are great ways to translate software with advanced features and easy to use.

I did some translating for The Battle for Wesnoth game in the past, and I can say that the po file translation have many advantages:

1. Source comment: The developers can use comments to inform translators about the context of the string they are translating. Translators can use comments to indirectly communicate with each other about the translation.

2. Translation memory: Po file editor allows you to create "translation memory" with the source language and its translation. The editor will memorize your translation and if a non-translated string matchs its memory, it will suggest you the relevant translation. This means greatly to translators as they can agree on the correct words to use in various strings for a more standardized and consistent translation.

3. Efficient update: When a new language file comes out, it's easy to update the translation. The editor will automatically add the new strings and remove obseleted ones, allow a more up-to-date translation in tandem with translation memory.

4. Simplification: I'm not a developer myself so I can't quite talk about this but a lot of errors come from the translator having to work with the source, po files eliminates this problem (mostly).

In conclusion, po files will save the developers and translators alot of works. Do use it please :)

Regards,

snwl

Since using .po files for the text editors is becoming a very strong argument, I could develop tools that handle the intermedia .txt -> .po conversion and .po -> .txt.

I'll just try to address each of your argument and see how this could be tied with .txt -> .po, .po -> .txt converter.

1) Source comment: The dictionary text files have comments, so this is not an issue :)

2) Translation memory: Comparing which strings have changed is nothing difficult. It just requires we keep a previous .po file for comparison. So if a new .po file is generated, it can be compared to the old one.

3) Efficient update: If we stick to the .txt -> .po conversion this shouldn't be an issue.

4) Simplification: In this case the dictionary .txt files a very simple, so it's difficult to make any errors. At any rate, C-style format strings should not be used inside translation strings - that's just bad design.

In all sense, .po format indeed has its upsides and since so many editors already exist for them it makes sense to use that format. In my opinion the best middle-ground solution to gain the benefit of both systems would be to:

1) Use dictionary .txt for the game strings

2) Have a tool that can convert .txt to .po

3) Have the same tool also convert .po back to .txt

I'll update the post later on with this new tool.

Link to comment
Share on other sites

Think of the historians, too. The people who actually have to write all the English text of the game - it's a lot harder to change a bunch of visible names if they're spread out across multiple files. In this case having a central file that contains all the text is both logical and efficient.

But then again they have to know the intented meaning of the keyword and where it is shown ingame. Isn't your data laid out badly if the information is too fragmented? Better merge some data files than collecting it later.

Having the text spread out between files also implicates string fragmentation in memory - there is no efficient way of creating an actual dictionary this way - you'll be forced to fragment memory.

I don't really get what you want with your dictionary (string pool?). It doesn't make sense for english if the english text is in the source or data files. You already parsed the xml and hold the information in some object. Why do another lookup in a dictionary? For other languages where you have to translate, gettext of course uses a hashmap. So there you have your dictionary.

Link to comment
Share on other sites

But then again they have to know the intented meaning of the keyword and where it is shown ingame. Isn't your data laid out badly if the information is too fragmented? Better merge some data files than collecting it later.

I don't really get what you want with your dictionary (string pool?). It doesn't make sense for english if the english text is in the source or data files.

Regarding the 'dictionary' and how it works. You use an id string to define a single translation entry:


;; en-US_units.txt
athen_champion_marine
generic Athenian Marine
specific Épibastēs Athēnaïkós
history .

Now in the entity templates section you use the id string as a reference to the defined translation entry:


<Entity ...>
<Identity>
<TranslationEntry>athen_champion_marine</TranslationEntry>
</Identity>
</Entity>

In order to translate your unit, you'll just generate a new language file 'de_units.txt'. Also notice how I don't have to redefine the 'special' name, since its already defined in 'en-US_units.txt':


;; de_units.txt
athen_champion_marine
generic Athener Marine
history .

In this case the information is not fragmented at all. If you add the text into your Entity descriptions, then that creates the fragmentation. You could look at it both ways, but ultimately, using a collated text file for the translations is a neat way to keep track of your game strings without spreading them out all over the code.

You already parsed the xml and hold the information in some object. Why do another lookup in a dictionary? For other languages where you have to translate, gettext of course uses a hashmap. So there you have your dictionary.

If you took a closer look at the code snippet I posted, you can see that it loads the entire translation file and tokenizes it in-place. This means only 1 allocation and a fast tokenization of the data. The addresses of those tokenized strings are put into TEntry structures, which are put into a hashmap. There are no string allocations.

During unit initialization you'll only have to reference the dictionary once (you usually don't change game language during runtime):


Entity* e = Entity::FromFile("athen_champion_marine.xml");
// behind the scenes: e::Descr = unit_dictionary["athen_champion_marine"];

printf("%s", e->Descr->generic); // output: "Athenian Marine"

Regardless of any lookups, there are yet again two gains for this method:

1) All game text is collated into specific text files. If you are new to the project, you can easily edit a name by opening up the text file and hitting Ctrl+F for your unit name. No need to go looking through tons of templates.

2) Due to the simplicity of this method, all the text can be loaded in one go and you don't need to create a separate std::string object for each game string. You avoid memory fragmentation this way (which is the main point). String objects are notorious for fragmenting memory.

Edited by RedFox
Link to comment
Share on other sites

2) Due to the simplicity of this method, all the text can be loaded in one go and you don't need to create a separate std::string object for each game string. You avoid memory fragmentation this way (which is the main point). String objects are notorious for fragmenting memory.

If this is a concern, can't we include the POT file* in the game data, which can then be loaded "in one go" into a map at startup; when a game string is later found in an XML file, the string is translated into an address in the map, which is stored in the appropriate data structure in memory; this address can then be used to lookup the string to invoke gettext on - won't that accomplish the same thing?

If I understand you correctly, it seems to be more of a concern with how gettext is invoked than with gettext itself.

(* The POT file is a collection of all the source strings that have been marked for translation in the source code and data.)

Edited by zoot
Link to comment
Share on other sites

If this is a concern, can't we include the POT file* in the game data, which can then be loaded "in one go" into a map at startup;

You could, yes, but POT and PO files have a much more extra data. The idea of a simple text file is to have minimal amounts of garbage text. This way, loading the file in one go makes much more sense since the text is tightly packed.

when a game string is later found in an XML file, the string is translated into an address in the map, which is stored in the appropriate data structure in memory; this address can then be used to lookup the string to invoke gettext on. Won't that accomplish the same thing?

If I understand you correctly, it seems to be more of a concern with how gettext is invoked than with gettext itself.

(* The POT file is a collection of all the source strings that have been marked for translation in the source code and data.)

Did you notice how complicated what you suggest, is? Lets break it down:

Gettext load .po;

Load entity XML -> Translation String -> Hash;

Invoke gettext -> translation string hash -> get translated string;

Store (create copy) translated string in map using the original Hash.

It's basically a search and replace method using gettext. That goes through sooooo many layers its not funny. What I proposed was:

Load lang txt

Load entity XML -> Get id string -> Hash;

Invoke dictionary -> Return TEntry;

It's very simple and very straightforward. Even the C++ code for it is tiny.

Does it accomplish the same thing?:

No, the whole mechanics differ and the amount of 'memory pressure' is several magnitudes higher for the gettext version. A good analogy might be texture mipmaps: if you use a mipmapped format (for example .dds), you only have to load the texture from the file and write it to vram. If you use a non-mipmapped format, you'll have to generate the mipmaps on-the-go, which is several times slower, depending on the actual texture size. This is a very noticeable performance difference in games development.

The same could apply here. Using the dictionary txt format, you'll only have to load the file once and you'll have your translated strings. The gettext method inherently requires creation of temporary strings and has an additional layer of 'search & replace' complexity.

Link to comment
Share on other sites

You could, yes, but POT and PO files have a much more extra data. The idea of a simple text file is to have minimal amounts of garbage text. This way, loading the file in one go makes much more sense since the text is tightly packed.

I wasn't referring so much to the format of the file as to its contents. POs are normally converted to the binary MO format anyway, which is more tightly packed than the format you are suggesting.

Did you notice how complicated what you suggest, is?

Did you notice how much functionality you are throwing out by using the allegedly simpler custom format? I'm just trying to point out that the options seem to be either something on the level of complexity I suggested, the IMO strongly neutered format you suggested, or standard PO. I would prefer the latter, but it's always a good idea to at least take note of the alternatives.

  • Like 1
Link to comment
Share on other sites

If you took a closer look at the code snippet I posted, you can see that it loads the entire translation file and tokenizes it in-place. This means only 1 allocation and a fast tokenization of the data. The addresses of those tokenized strings are put into TEntry structures, which are put into a hashmap. There are no string allocations.

You forget about your id strings/keywords which you have to load from the xml files. So basically the amount of allocations is only 1/3 because you merged generic, specific and history into one keyword.

The only real difference with your approach is you do this string pooling for the english messages too.

The lookup and the message catalog are highly optimized in gnu gettext. It hashes the string and looks it up in the message catalog which is by default a MO file[1]. It also caches results[2]. I wouldn't be surprised if the whole catalog is memory-mapped, but i'm not sure about it (looking at strace it sure looks like it).

Btw, tinygettext which is used in the proposed patch[3], uses the PO files directly, no MO files. Not sure how the lookup is done there (probably building a hashmap on load).

[1] http://www.gnu.org/s...t.html#MO-Files

[2] http://www.gnu.org/s...timized-gettext

[3] http://trac.wildfire...s.com/ticket/67

Link to comment
Share on other sites

I wasn't referring so much to the format of the file as to its contents. POs are normally converted to the binary MO format anyway, which is more tightly packed than the format you are suggesting.

You are right in that. Binary format would win in any case.

Did you notice how much functionality you are throwing out by using the allegedly simpler custom format? I'm just trying to point out that the options seem to be either something on the level of complexity I suggested, the IMO strongly neutered format you suggested, or standard PO. I would prefer the latter, but it's always a good idea to at least take note of the alternatives.

I agree that it's neutered; the whole point was to have simple-to-edit text files so that anyone could jump in and edit whatever they wanted without having to use anything more complex than perhaps Notepad.

You forget about your id strings/keywords which you have to load from the xml files. So basically the amount of allocations is only 1/3 because you merged generic, specific and history into one keyword.

The amount of allocations for strings is only 1: the buffer for the text file. An entity won't reference any hashes or id strings once it has loaded a translation entry pointer.

The only real difference with your approach is you do this string pooling for the english messages too.

Yes, which is the point of it.

The lookup and the message catalog are highly optimized in gnu gettext. It hashes the string and looks it up in the message catalog which is by default a MO file[1]. It also caches results[2]. I wouldn't be surprised if the whole catalog is memory-mapped, but i'm not sure about it (looking at strace it sure looks like it).

Now that's something different. Having a binary file would be the fastest way to do this, but it would also mean you can't jump in and edit the game texts as you go. The goal is to be able to translate the strings with or without a third party tool (I'll get back to you on the 'with' part).

-------------------

I've been working on a small command line tool that allows .TXT to .POT conversion and .PO to .TXT conversion. It's written in C# 2.0, so you'll need Mono 2.0 or .NET 2.0 to run it.


Usage:
-? --help Shows this usage text
-o --overwrite Forces overwrite of an existing file
-p --pot <src> <dst> Converts TXT to POT
-t --txt <src> <dst> Converts PO to TXT

Converting TXT to POT:
wfgpotext -o -p en-US_units.txt wfg_units.pot

Converting PO to TXT:
wfgpotext -o -t de_units.po de_units.txt

Given a base input translation file: en-US_test.txt:


;; Generated by WFGLocalizationConverter

army_mace_hero_alexander
generic Army of Alexander the Great.
specific Army of Alexander the Great
tooltip This is what an army would look like on the Strat Map.
history The most powerful hero of them all - son of Philip II.

We run the .TXT -> .POT command:


wfgpotext -o -p data/en-US_test.txt wfg_test.pot

This will generate a .pot template file. You can use this file to keep track of string changes with your favorite .PO editing software.


# POT Generated by wfgpotext
msgid ""
msgstr ""
"Project-Id-Version: \n"
"POT-Creation-Date: \n"
"PO-Revision-Date: \n"
"Last-Translator: \n"
"Language-Team: \n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=utf-8\n"
"Content-Transfer-Encoding: 8bit\n"


#
#: data/en-US_test.txt:4
msgctxt "army_mace_hero_alexander_generic"
msgid "Army of Alexander the Great."
msgstr ""

#
#: data/en-US_test.txt:5
msgctxt "army_mace_hero_alexander_specific"
msgid "Army of Alexander the Great"
msgstr ""

#
#: data/en-US_test.txt:6
msgctxt "army_mace_hero_alexander_tooltip"
msgid "This is what an army would look like on the Strat Map."
msgstr ""

#
#: data/en-US_test.txt:7
msgctxt "army_mace_hero_alexander_history"
msgid "The most powerful hero of them all - son of Philip II."
msgstr ""

An example translatio with Poedit. Imported wfg_test.pot, translated and saved as ee_test.po:

wfg_poedit1.png

ee_test.po:


# POT Generated by wfgpotext
msgid ""
msgstr ""
"Project-Id-Version: \n"
"POT-Creation-Date: \n"
"PO-Revision-Date: \n"
"Last-Translator: \n"
"Language-Team: \n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=utf-8\n"
"Content-Transfer-Encoding: 8bit\n"
"X-Generator: Poedit 1.5.5\n"

#
#: data/en-US_test.txt:4
msgctxt "army_mace_hero_alexander_generic"
msgid "Army of Alexander the Great."
msgstr "Aleksander Suure armee"

#
#: data/en-US_test.txt:5
msgctxt "army_mace_hero_alexander_specific"
msgid "Army of Alexander the Great"
msgstr "Aleksander Suure armee"

#
#: data/en-US_test.txt:6
msgctxt "army_mace_hero_alexander_tooltip"
msgid "This is what an army would look like on the Strat Map."
msgstr "Armee näeb Strat Kaardil selline välja."

#
#: data/en-US_test.txt:7
msgctxt "army_mace_hero_alexander_history"
msgid "The most powerful hero of them all - son of Philip II."
msgstr "Kõigist võimsam kangelane - Philippos II poeg."

Now to get the translation back to .txt, I'll have to call the conversion tool again:


wfgpotext -o -t data/ee_test.po ee_test.txt

And the resulting translation text file:


; Translation file generated from ee_test.po

army_mace_hero_alexander
generic Aleksander Suure armee
specific Aleksander Suure armee
tooltip Armee näeb Strat Kaardil selline välja.
history Kõigist võimsam kangelane - Philippos II poeg.

You can test out the tool and check the generated files from the attached archive:

wfgpotext.zip

Edited by RedFox
Link to comment
Share on other sites

Now that's something different. Having a binary file would be the fastest way to do this, but it would also mean you can't jump in and edit the game texts as you go. The goal is to be able to translate the strings with or without a third party tool (I'll get back to you on the 'with' part).

If we use caching on-the-fly conversion from text to binary, it would still be moddable. (Regardless of which 'catalog' format is selected.) Though I suppose that would not be compatible with tinygettext.

Link to comment
Share on other sites

If we use caching on-the-fly conversion from text to binary, it would still be moddable. (Regardless of which 'catalog' format is selected.) Though I suppose that would not be compatible with tinygettext.

But it would work perfectly with the text dictionary system, no?

I think we should provide the tools to integrate .po (wfgpotext.exe), but the game itself shouldn't rely on it. We don't need yet another third party library that somewhat does what we need if we squeeze hard enough. In the end, the amount of code for the text dictionary is still 150 lines, which is maintainable enough.

Most people won't end up using fancy translation software, but in the case they want to, we'll have the PO Template for them and the tool that convert it back to .txt.

Link to comment
Share on other sites

But it would work perfectly with the text dictionary system, no?

Not really. As you said, it would still be slower than binary.

Most people won't end up using fancy translation software

If we make it excessively hard, obviously they won't. The end result will just be undermaintained and unfinished translations.

Link to comment
Share on other sites

Not really. As you said, it would still be slower than binary.

Ahh, not that. I mean on-the-fly caching would be very easy to do with the text system.

If we make it excessively hard, obviously they won't. The end result will just be undermaintained and unfinished translations.

You are right about that. We can however provide the necessary .PO files for anyone who wants to translate the game text, right? That would give us the benefit of keeping the game strings easy to mod and easy to translate.

In any case the game strings should be collated to text files like en-US_units.txt, since in the end it makes modding the visual name of an entity very easy.

Link to comment
Share on other sites

Ahh, not that. I mean on-the-fly caching would be very easy to do with the text system.

What data would you cache?
You are right about that. We can however provide the necessary .PO files for anyone who wants to translate the game text, right? That would give us the benefit of keeping the game strings easy to mod and easy to translate.

I can't readily imagine what the process would be. Copying the POs back and forth manually? As long as I don't have to do it, I guess I can't stop you :)

Link to comment
Share on other sites

What data would you cache?

A cached text collection can just be a binary array. The first 4 bytes would be the 'length' of the array, followed by the data. We don't actually need to store the string lengths, but we could do it to save some processing time for the CPU. Also, if we store entries with an 'idhash' we somewhat loose the point of having a visible 'id string', but we should store it anyways, because the debugging info might be useful to us later.

First it would require a redesign of the TEntry structure:


// Describes a TEntry shallow UTF-8 string reference
struct RefStr { unsigned short len; const char* str; };

// UTF-8 Translation Entry
struct TEntry
{
RefStr idstring; // 'id string' of this entry
RefStr generic; // 'generic' attribute string
RefStr specific; // 'specific' attribute string
RefStr tooltip; // 'tooltip' attribute string
RefStr history; // 'history' attribute string
};

Now we need a binary version of TEntry that is more compact and also contains the 'id hash'. We'll store the strings using the same layout as a TEntry, to make them compatible (this will be useful later...). This also means that the pointers in the RefStr structures will be garbage and need to be recalculated when a BTEntry is being loaded:


// Binary UTF-8 Translation Entry
struct BTEntry
{
size_t idhash; // 'id hash' of this entry
RefStr idstring; // .str = data
RefStr generic; // .str = idstring.str + idstring.len + 1
RefStr specific; // .str = generic.str + generic.len + 1
RefStr tooltip; // .str = specific.str + specific.len + 1
RefStr history; // .str = tooltip.str + tooltip.len + 1
char data[];
};

We can then load all of the binary entries in one swoop. This implementation is a lot more efficient since the storage for TEntry structures is in the loaded binary file itself. If you'd want to make this even more efficient you can insert a binary search table in the header of the file. As for having to recalculate the strings.. It would probably be better if the string pointers were saved as offsets from data instead...


void LoadEntries(FILE* f)
{
// put it here for the sake of this example:
std::unordered_map<size_t, TEntry*> entries;

fseek(f, 0, SEEK_END); size_t fsize = ftell(f); fseek(f, 0, SEEK_SET);
size_t* mem = (size_t*)malloc(fsize);
fread(mem, fsize, 1, f);

size_t numEntries = mem[0];
BTEntry* bte = (BTEntry*)(mem + 1);
for(size_t i = 0; i < numEntries; i++)
{
size_t offset = 0;
auto fixptr = [&](RefStr& rs) {
if(rs.len) {
rs.str = bte->data + offset; // update the string
offset += rs.len + 1; // update offset
} else rs.str = nullptr; // we don't have this string
};
fixptr(bte->idstring);
fixptr(bte->generic);
fixptr(bte->specific);
fixptr(bte->tooltip);
fixptr(bte->history);

// convert BTEntry to a TEntry pointer by skipping the hash
entries[bte->idhash] = (TEntry*)&bte->idstring;
// next chunk
bte = (BTEntry*)(bte->data + offset);
}
}

I can't readily imagine what the process would be. Copying the POs back and forth manually? As long as I don't have to do it, I guess I can't stop you :)

Wouldn't really have to copy anything. A single batch script can run wfgpotext to generate a new POT template for anyone who wants to do translation. Those PO files can be easily converted back to .TXT. Heck, I should just edit wfgpotext to run the whole life-cycle, updating PO templates and text files in a single go.

It would need to:

1) Generate PO Templates from TXT files:

en-US_{0}.txt => wfg_{0}.pot

2) PO translations to TXT:

{0}_{1}.po => {0}_{1}.txt

3) Make sure en-US_{0} is never overwritten, since they're our baseline. That would be sad indeed.

Edited by RedFox
Link to comment
Share on other sites

How would the translated .po or .txt be committed to SVN, if not by copying it from the translator? Would they send their entire harddrive by carrier pigeon? :D

Haha :D Yeah. I'm afraid you can't avoid having to copy some files if you're working on a project... ;)

Link to comment
Share on other sites

I think the first concern of Michael was that most modders are not familiar with translation software and thus, for the sake of moddability, a clear and simple file would be preferred.

I don't see what translation necessarily has to do with modders, unless they also want to be translators, but those are two separate groups with different needs. The way I see it, the only people who should care about translations are:

1) Non-English players who need them

2) People who actually translate the game

3) Programmers who implement the translation system

Frankly, group (1) won't care in the least how it's implemented, as long as it works. Group (2) will care, and we have several of them saying we should use a standard system, and they are providing links to communities and tools built around such systems. Group (3) will do whatever needs to be done, so I think the focus shouldn't be on what's simplest to program today, but what makes translation work as well as possible now and in the future.

If I'm understanding the concept correctly, we will have in either case a tool that extracts translatable strings from the game data (XML, JS, whatever) and creates .po or .txt files. So the average modder need not know anything about translations, they will write their mod following the conventions of 0 A.D. wrt/ translation support, and then whoever from group (2) does the translating will use the same process as those translating on the game proper. As far as I'm concerned, the average modder shouldn't care or need to care much about the process of translation (unless they are also a translator, in which case they approach it from that p.o.v rather than modding)

I'm garnering from experience working as a developer for proprietary projects and K.I.S.S. approach has always worked the best. In this case the 'prototype' translation system is extremely simple - it only took me a few hours to throw that parser together. Yes it's that simple.

We're not a proprietary project. And on the contrary, my experience with 0 A.D. is that the clever solutions programmers came up with over the years, to avoid using existing 3rd party libraries and obvious solutions, tend to become very complicated over time and are now the most fragile parts of the engine, because nobody in a few hours of work can foresee all the issues of a complex system that others have encountered over potentially decades of development, and they're no longer around to maintain that fragile code.

We should also not turn our back at the speed and performance benefits we'll get from this. All of the game strings would be stored in a contiguous block of memory. There is no fear of fragmentation or allocation overhead and the parsing speed is pretty amazing. Why waste time on complex third party libraries that might or might not do the trick we need, if we could get away with 150 lines of code that does it very efficiently?

From a simplicity perspective, if you look at the patch on http://trac.wildfiregames.com/ticket/67, that solution doesn't look particularly more complicated than yours, most of the patch is source for the tinygettext lib, so we're using other people's work there instead of our own, while the changes to the engine itself for GUI XML support are trivial. Can you explain why that approach would be noticeably less efficient than yours, as I understand it's only a matter of loading the .po's into a dictionary instead of .txts? We can benefit from compiling the .po's into binary .mo's for faster loading, if necessary (in fact my suggestion would be that we integrate that into the archive builder and preconvert them for releases, like we already do with XML, DAEs, and PNGs).

(Of course we can't have a fair comparison until both approaches are working with the same data, GUI pages will be messier to translate in any case than entity templates, and we don't yet know how the approach of #67 will extend to entity templates)

Can you elaborate on our needs that gettext doesn't meet but that your solution does? Nothing hand wavy like "it's easier for modding" without evidence that modders should be constrained by our choice of translation system. So far the argument is sounding like we need to justify using the proposed arbitrary .txt format and get around its shortcomings, and create another tool to convert between .po and .txt to work with everything that's already compatible with gettext, which is a very strange angle to approach this from.

Think of the historians, too. The people who actually have to write all the English text of the game - it's a lot harder to change a bunch of visible names if they're spread out across multiple files. In this case having a central file that contains all the text is both logical and efficient.

...which historians are we speaking for? If you're saying non-programmers don't want to edit XML files, I fully agree, that's why we should have a GUI-based entity editor for Atlas. That is certainly worth putting effort into and part of having an easily moddable game. Currently an entity editor really only needs to be an XML parser with a GUI generated from the schema, anyone could write that and someone should, it becomes more complicated if the actual text data of the entity is replaced by a key into a proprietary .txt file that needs a custom parser and dictionary. For the historians, finding the correct entity to change would be no more difficult than searching by plain English keywords, kinda the way the object panel of Atlas works now.

But there's always the simplest non-technical solution: we could have the historians create a Trac ticket with their suggested changes, which a developer commits in the appropriate file. Or they could ask which file to modify and how to do so, I think if they did this once, it wouldn't need explaining twice, it's not exactly rocket science to figure out the entity template naming conventions and as you say, Notepad++ is more than up to the task :) (if they make a mistake, that's why we have SVN/git)

Performance has long been a bottleneck of 0 A.D., so why do we insist on implementing non-standard (in context of the gaming industry, not gnu applications) methods?

Any evidence that text loading and storage is a performance bottle neck in 0 A.D.? Honestly, I would be delighted if that was the case, it would mean we squashed all the other more serious one. What about evidence that using a gettext-based solution will be noticeably slower than yours? (As I've said before in discussion of the GUI system, I get like 800+ fps on the main menu, so there's clearly some wiggle room there, and I haven't seen any evidence that game lag is caused by text display/rendering/storage)

  • Like 1
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

 Share

×
×
  • Create New...