As developers, we’ve all been there – stuck with a project that requires translation support for multiple languages, aiming for that elusive 80-90% accuracy mark with minimal manual intervention. Currently, many systems rely on i18n for language selection, but to take translation quality to the next level, we need to provide context for each UI string used in the app.
One approach is to create a database that stores each UI string along with the surrounding code snippet where it occurs. This data can then be stored in a vector database and used to build a Retrieval-Augmented Generation (RAG) model that generates context descriptions for each UI string. These contexts can be used during translation to improve accuracy, especially since some words have multiple meanings and can be mistranslated without proper context.
I’ve been experimenting with LibreTranslate, but I’m not getting the desired results. For instance, when I provide a sentence in the format ‘”{UI String}” means {Context}’, it doesn’t seem to understand the context correctly. Take the example of ‘romanian minor’ – without proper context, it’s treated as ‘age minor’ instead of the musical scale ‘minor’.
The question is, how can we improve the translation accuracy of our systems? Is it by fine-tuning our models, or is there a better approach to providing context for our UI strings?