February 27, 2008
It's a cliche to say that open source breaks down barriers, but every day I learn about a new way that's happening. Here's one barrier that open source can help to bring down, incrementally: the language barrier.
I have a long and ongoing love for and fascination with Japan -- the culture, the language, the whole shebang. Back about a decade and a half ago I decided I was going to teach myself the language. Since I didn't have money for classes, I homebrewed my own self-teaching method. I went out and bought a grammar guide, and then two copies of a given book -- one in Japanese, the other an English translation -- and sat with them side-by-side, comparing the two on a sentence-by-sentence and phrase-by-phrase level. It worked, up to a point, and while I'm no native speaker I can certainly figure out a fair amount of what's put in front of me as long as I have a dictionary.
I didn't know it at the time, but this parallel-texts technique is actually one of the best ways to also teach a computer to perform translations between languages.
Here's how this works. You take two languages you want to translate between -- English and Japanese, for instance -- and you obtain a given text that has been translated into both languages. The bigger the parallel corpus, as it's called, the better: a corpus of a billion or more words tends to be a good place to start, but more is always better. You then perform a statistical analysis on each corpus and draw as many parallels as you can between the two texts, on a sentence-by-sentence and phrase-by-phrase level. The parallels drawn between the two texts can then be used to translate between both languages. I'm leaving out a great deal -- like how to compensate for languages with varying word orders or conflicting agglutinative structures -- but that's the basic idea.
I got curious as to whether or not there were any open source projects that dealt with machine translation -- not just the software, but also creating and maintaining clean and dependable parallel texts. That's actually the hardest and most labor-intensive part: supplying the translation algorithms with valid data. And as it turns out, there are indeed several such projects already in progress. On the software side, there are programs like Moses (licensed under the LGPL); on the corpus side, there's projects like the OpenOffice corpus, which uses the translated documentation from said program as a parallel text.
There's a couple of ways a company could make a commercial implementation of these tools, keep them open source, and still make money. The most obvious approach is for them to give away the software with a rudimentary corpus of maybe a few hundred thousand words. For a fee, they'd provide you with a much larger corpus -- or for a yearly subscription fee you could get a corpus that is kept up-to-date with recent texts (such as news or periodicals). (Language Weaver, which produces a closed-source machine translation system, has a vaguely similar yearly subscription / maintenance model.)
The real value in such a system, then, isn't the algorithms themselves -- it's the data you use to power them. Such a scheme also makes it possible for people to collaborate openly on creating their own corpus for specialized applications if one doesn't exist already -- although, as with any collaborative knowledge project (Wikipedia being the biggest example), genuine expertise is something you may want to actually pay for.
I should point out that I don't think machine translation can work 100% of the time -- it'll always need human guidance and training to be useful. But as a labor-saving system and a way to knock down barriers between people, we're barely now scratching the surface of what's possible -- and for me the best way to develop such things is right out in the open, where they belong.
About the Author(s)
You May Also Like