Commentary

Serdar Yegulalp
 

Talk To Me, Openly

It's a cliche to say that open source breaks down barriers, but every day I learn about a new way that's happening. Here's one barrier that open source can help to bring down, incrementally: the language barrier.

It's a cliche to say that open source breaks down barriers, but every day I learn about a new way that's happening. Here's one barrier that open source can help to bring down, incrementally: the language barrier.


More Software Insights

White Papers

More >>

Reports

More >>

Webcasts

More >>

I have a long and ongoing love for and fascination with Japan -- the culture, the language, the whole shebang.  Back about a decade and a half ago I decided I was going to teach myself the language.  Since I didn't have money for classes, I homebrewed my own self-teaching method.  I went out and bought a grammar guide, and then two copies of a given book -- one in Japanese, the other an English translation -- and sat with them side-by-side, comparing the two on a sentence-by-sentence and phrase-by-phrase level.  It worked, up to a point, and while I'm no native speaker I can certainly figure out a fair amount of what's put in front of me as long as I have a dictionary.

I didn't know it at the time, but this parallel-texts technique is actually one of the best ways to also teach a computer to perform translations between languages.

Here's how this works.  You take two languages you want to translate between -- English and Japanese, for instance -- and you obtain a given text that has been translated into both languages.  The bigger the parallel corpus, as it's called, the better: a corpus of a billion or more words tends to be a good place to start, but more is always better.  You then perform a statistical analysis on each corpus and draw as many parallels as you can between the two texts, on a sentence-by-sentence and phrase-by-phrase level.  The parallels drawn between the two texts can then be used to translate between both languages.  I'm leaving out a great deal -- like how to compensate for languages with varying word orders or conflicting agglutinative structures -- but that's the basic idea.

I got curious as to whether or not there were any open source projects that dealt with machine translation -- not just the software, but also creating and maintaining clean and dependable parallel texts.  That's actually the hardest and most labor-intensive part: supplying the translation algorithms with valid data.  And as it turns out, there are indeed several such projects already in progress.  On the software side, there are programs like Moses (licensed under the LGPL); on the corpus side, there's projects like the OpenOffice corpus, which uses the translated documentation from said program as a parallel text.

There's a couple of ways a company could make a commercial implementation of these tools, keep them open source, and still make money.  The most obvious approach is for them to give away the software with a rudimentary corpus of maybe a few hundred thousand words.  For a fee, they'd provide you with a much larger corpus -- or for a yearly subscription fee you could get a corpus that is kept up-to-date with recent texts (such as news or periodicals).  (Language Weaver, which produces a closed-source machine translation system, has a vaguely similar yearly subscription / maintenance model.)

The real value in such a system, then, isn't the algorithms themselves -- it's the data you use to power them.  Such a scheme also makes it possible for people to collaborate openly on creating their own corpus for specialized applications if one doesn't exist already -- although, as with any collaborative knowledge project (Wikipedia being the biggest example), genuine expertise is something you may want to actually pay for.

I should point out that I don't think machine translation can work 100% of the time -- it'll always need human guidance and training to be useful.  But as a labor-saving system and a way to knock down barriers between people, we're barely now scratching the surface of what's possible -- and for me the best way to develop such things is right out in the open, where they belong.


Related Reading




Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

InformationWeek encourages readers to engage in spirited, healthy debate, including taking us to task. However, InformationWeek moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing/SPAM. InformationWeek further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
T-Shirt Giveaway T-Shirt Giveaway: Each week we're selecting one great comment from our readers. The author of the comment will receive an InformaitonWeek Community t-shirt. So get posting!
Subscribe to RSS

Resource Links