Have you ever tried to extract translation pairs from unstructured web pages in low-resource languages? It’s a daunting task, especially when dealing with inconsistent formatting and unreliable delimiters. My team and I have been struggling with this issue while collecting data for English and Itsekiri, a low-resource language. We’ve tried manual inspection and regular expression rules, but the accuracy is far from satisfactory.
I’ve been searching for a better approach, a technique that can read the text and collect segments in one language and others in another. I came across papers like Segment-any-Text, but they seem to focus on breaking text into units like sentences and paragraphs, not separating segments by language.
The ideal solution would take an input text, like the example below, and output the segments in each language:
Input: Aujourd’hui, nous allons parler des citrons et des limes. (Today, we will talk about lemons and limes.)
Output:
Lang_1 Lang_2
Aujourd’hui, nous allons parler des citrons et des limes. Today, we will talk about lemons and limes.
Les citrons et les limes sont tous les deux acides. Both lemons and limes are sour.
I’d prefer an approach that’s language-agnostic and can handle our entire corpus without being computationally intensive. Has anyone found a solution to this problem?