Winds of Change
The Demise of the Language Barrier
As autumn briskly descends on New England I find myself admiring nature’s capacity to silently transmute subtle changes into full-fledged displays of majestic beauty. A leaf here, a leaf there… They practically go unnoticed for several weeks – and then suddenly one morning you wake up and look through the window, amazed to see that nearly all the trees on the street are ablaze.
Reflecting back on this past year, I can’t help but draw the analogy to Digital Trowel’s own progress. It seems unbelievable that only one year ago we were a small team of 7 engineers crowded together in 2 small rooms, with not much more than an untested NLP platform and a vision. As the weeks passed by, we added an engineer here, a linguist and a mathematician there, and suddenly we are a mature, full-fledged commercial company, with over 40 developers, selling products and data, ablaze with a proven breakthrough NLP technology in hand.
But we’re still hungry! And our team of scientists is cooking up a storm 🙂
We’ve just recently hired another mathematician and a computer scientist, so we now have 5 PhD level scientists on our R&D team. We strongly emphasize the research aspect of the business, because we know that being at the top is a very fragile state. So while we’re very proud to be the first company to produce consistent commercial results with over 90% accuracy and recall across multiple semantic fields, we realize there are still many challenges to be met, and we’re determined to be the first to meet them.
While I’m not yet at liberty to divulge the full extent of our newest developments, I am able to give you a taste of some of the advances we’ve made.
One of the most exciting projects we’re looking into is a multi-lingual text-mining platform codenamed SNUG for Semantically Negotiated Universal Grammar. Admittedly, I don’t think this is exactly what Noam Chomsky had in mind when he first proposed his theory of Universal Grammar, but it may not be too far off. Let me explain:
Linguists use the term Universal Grammar (UG) to denote a theory according to which all humans possess an inherent “hard-wired” capability to acquire a language. It is this linguistic “hardware” we’re all born with that allows children to learn the grammar of a language even when the linguistic data available to them is insufficient.
A quick example is in order. A child born in the U.S.A must learn to form questions out of assertions in English. He may, for example, hear the assertion:
“My sister is pregnant.”
And over time infer that the proper interrogative form of this statement is:
“Is my sister pregnant?”
He may then (subconsciously) form a rule of grammar in his mind, whereby questions in English are formed by moving the first auxiliary verb to the beginning of the sentence. We would then expect that children faced with an assertion such as:
“My sister who is pregnant will be blessed with happiness.”
Would form the incorrect question:
* “Is my sister who pregnant will be blessed with happiness?”
Interestingly enough – they don’t! Only the correct form is acquired by children:
“Will my sister who is pregnant be blessed with happiness?” (Of course she will!)
The claim (asserted by UG supporters) is that children simply don’t encounter enough sentences as complicated as the one above to make a learned choice. So how do they do it? Well, simply put – they are born knowing. More accurately, they’re born with a set of rules that are triggered and activated in a certain way once they are exposed to sentences in their language.
Of course, things are far more complicated than these simplistic examples, and many questions immediately arise. How, for example, is it that other languages, German and French for instance, form questions by placing the main verb at the beginning of the sentence? For example:
French: Parlez-vous anglais?
German: Sprechen Sie Englisch?
Both mean: Do you speak English?
But a literal translation would be the ungrammatical: * Speak you English?
So, obviously, the theory must be significantly more complex. But don’t worry, that concludes today’s lesson in linguistics 101 :-). We now return to what got us here in the first place – our new multi-lingual text-mining platform SNUG (We thought it’s kind of cute to build SNUG using CARE :-)).
So, first things first: No – we’re not out there to decipher the way our mind processes languages, but – Yes we are out there to create a platform that will enable us to process and mine texts in any language.
For us humans, it’s quite hard to fathom learning a new language in say, a week. But for our system that’s actually an achievable feat. In order for us to effectively extract data from free text in a new language, we need to accomplish two things. First, we need to teach our system this new language. Well not exactly the language itself, but more like the statistical distribution of words in this new language. We do this using an automatic training process, by which our system runs through huge text corpora, producing a statistical model of the language, which is further refined every time we train the system.
The second task is translating our semantic rulebooks. Here is where our linguists come into play. Since our rulebooks are basically comprised of highly sophisticated weighted Context-Free Grammars, their translation amounts to a structure-preserving function of semantic rules (a semantic homomorphism). This function can be thought of as taking an English semantic-driven grammar as input and transmuting it into a semantic-driven grammar for a different language. Though languages vary considerably in their actual spoken grammar, they tend to convey events and factual information in a surprisingly similar manner (or rather, the structure of the required semantic rules is quite similar).
Most Central European languages share so many common qualities with English, that many of the rules can be translated verbatim. Some of the challenges arise when translating rulebooks to languages with a word order that significantly differs from the English word order (e.g. German, not to mention Subject-Object-Verb languages such as Japanese, Hindi and Turkish). Languages with a high level of agreement inflection such as French, Italian and Spanish are usually easier to parse as well, though high inflection agreement often incurs the omitting of overt pronouns. So for example “I eat” in Italian will simply be “mangio”. This turns out to be quite problematic for issues of anaphora resolution.
Every new language poses a new and exciting challenge, but crucially, it does not require us to rewrite the code. All we need is a few large text corpora, an expert linguist, a savvy rulebook writer, and a week or two of intensive work, and our system will learn to “speak” a new language. If only it were so simple for people to learn…
Well I guess that’s enough for this post. I hope you’ve enjoyed this quick “intro to linguistics”, and are as excited as I am by the prospect of text-mining – free of language barriers. Stay tuned for more new and interesting features in my next entry.
Meanwhile, wherever you are, I hope you have an autumn as beautiful as the one I am having in Boston. Even more importantly, I hope you take the time to appreciate the beauty of change all around – spoken without any words.