Feeds:
Posts
Comments

Turing’s Test & The Stock Market

A Non-standard Introduction to Sentiment Analysis in 3 Parts

Part 1 – CAPTCHA to Gotcha:

A Brief History of Artificial Intelligence

Alan Turing was a prominent British mathematician and one of the most inspiring pioneers of modern computer science. In 1950, at the age of 38, he published his seminal paper Computing Machinery and Intelligence, which till this day remains probably the single most influential paper in the field of Artificial Intelligence (AI).

Since Digital Trowel’s core technology is based on machine learning, a modern offshoot of AI, it would be conducive (and nice!) to get back to the basics, and learn a bit about the history that continues to shape both the science itself and the challenges we face at DT.

Big words and complications aside, Turing begins his paper with the simple yet perplexing question: “Can machines think?” Nevertheless, realizing that “thinking” is a highly ambiguous term, Turing immediately proposed an alternative question that would be free of obscurities and eschew obfuscations. Instead of dealing with machines’ capacity for thinking, he focused on their capacity to emulate human thought. In simplified terms the question he suggested was:

Could machines be made to simulate human thought well enough so as to fool a person into believing they were actually human?

This question is the essence of what has come to be called the Turing Test. It proceeds as follows: a human judge engages in a natural language conversation with one human and one machine, each of which tries to appear human. All participants are placed in isolated locations. If the judge cannot reliably tell the machine from the human, the machine is said to have passed the test. In order to test the machine’s intelligence rather than its ability to render words into audio, the conversation is limited to a text-only channel such as a computer keyboard and screen.

At the time of its publication, many people viewed the prospect of machines ever reaching the level of human computational power an impossibility, but in his paper, Turing, armed with his visionary intuition and razor-sharp mathematical analysis, set out to invalidate contemporary objections, ending with a speculation of his own, that one day machines would indeed emulate human thought, thereby passing the Turing Test!

Inspired by the challenge, Digital Trowel’s groundbreaking technology has taken several huge steps forward in proving that Turing was right. The technology we’ve developed allows computers to extract not only the facts communicated by the text, but also the underlying sentiment or, if you will, the attitude associated with the message conveyed. In simple words, we’re enabling computers to understand the full meaning not only of the text, but of the subtext – just as a human would. But hold your horses! Before we continue, let’s try to explain why the problem is so difficult, so we can more fully appreciate the profundity of Digital Trowel’s achievement and its extraordinary implications.

Well for one thing, it’s now sixty years later and the question presented by Turing has yet to be settled. In fact, it is far from being resolved. Machines have beaten world chess champions, navigated spacecrafts millions of miles away and even been used to prove mathematical theorems whose intricacy is impenetrable to human beings for their sheer magnitude of computational complexity, yet as of now no computer has been shown to pass the test.

It may be argued that the challenge machines face is simply a matter of raw computing power. It is currently estimated by some experts that the human brain can perform some 38,000 trillion operations per second (that’s 3.8×1016 operations!) and hold over 3,500 terabytes of memory. In comparison, the world’s most powerful supercomputers (e.g. IBM’s BlueGene) have computational capacity of less than a “mere” 100 trillion operations per second (only 1014) and less than 10 terabytes of storage. However, if this indeed is the case, and the capacity to “think” lies in raw computation power alone, then according to some versions of Moore’s Law (which predicts the rate at which computation performance evolves with time) machines will ultimately obtain the required criterion by circa 2018. But we may have to wait a bit longer for an answer: recently, futurist Raymond Kurzweil, revised his earlier prediction that Turing test-capable computers would be manufactured by 2020, deferring the predicted date to 2029 (I can’t help but wonder if this prediction has anything to do with the fact that asteroid (35396) 1997 XF11 is anticipated to make a close approach to Earth late in 2028 🙂 ).

But how does this all have to do with Digital Trowel’s business? That is, unless we have secret plans in store for purchasing stocks in Scientific American… (which we don’t!). Well to answer that consider first what is sometimes called a Reverse Turing Test. Imagine a modification of the Turing Test wherein the role of the judge has been switched between machines and humans. Now it’s the computer who has to determine whether it is “conversing” with a human or another machine. In fact to some of you this may just ring a (quite annoying) bell. Take a look at the images below:

Ever wonder why every once in a while you’re prompted to try to decipher the jumbled up letters in images such as these? Well, put simply, it’s because you’re taking part in a test that’s not meant for you – you’re serving as a participant in a Reversed Turing Test administered and judged by the security computers of the website with which you are attempting to engage. Humans have (or rather should have!) no problem deciphering the text in the above images, which incidentally are called CAPTCHAs (for Completely Automated Public Turing test to tell Computers and Humans Apart). However, the random distortions in a CAPTCHA make it nearly impossible for computers to decipher the letters. As a result, automated security programs can use these images and the respective responses to make certain it is a human attempting to engage with the website and not some malicious script.

Back to Digital Trowel. No we don’t make CAPTCHAs. This may come as a disappointment, but we’re not even in the business of Turing Tests; reversed, straightforward or in any other direction :-). We are, however, in the business of using computers to achieve something no less illusive:  deciphering the sentiment that lies hidden inside text.

Gleaning not only the formal meaning but also the sentiment associated with a text passage is crucial for any machine that hopes to “pass the Turing Test”. Albeit so, this is not part of our technological agenda nor is it a component of our business plan. In fact our aspirations are much more practical. We aim to use the highly sophisticated technologies powered by our cutting edge machine-learning and linguistic algorithms to analyze millions of lines of text, thus creating valuable business information that will help our customers make decisions in real-time. In short:

Extracting and discerning the underlying sentiment allows us to transform otherwise inert texts into vibrant business opportunities.

But again we’re jumping ahead of ourselves. Now that we’ve laid the foundations for understanding what AI is all about, we’re ready to take a tour down the path of linguistic algorithm theory, focusing of course on the art of sentiment analysis. Or as we like to call it at DT, Synergistic Sentiment Analysis, a term that is used for reasons that will become apparent in due course.

The second part of this survey presents an overview of sentiment analysis. What it is, what it does, and most importantly what it’s good for (hint: think unique business opportunities!). The third and final part, will delve into the deep abyss of the algorithmic world in hope of salvaging insight on the awesome technology we’ve developed at DT. By the end of this intro we hope you’ll understand not only what we do and why do it, but also how we do it, and why we’re light-years ahead of anyone else in the field.

In the meantime, we hope you understand at least half of the words in the titles above 🙂

We’ve told the story, set the stage, laid the bait – are you hooked..?

All we can do is hope we “gotcha” !

Winds of Change

Winds of Change

The Demise of the Language Barrier

As autumn briskly descends on New England I find myself admiring nature’s capacity to silently transmute subtle changes into full-fledged displays of majestic beauty. A leaf here, a leaf there… They practically go unnoticed for several weeks – and then suddenly one morning you wake up and look through the window, amazed to see that nearly all the trees on the street are ablaze.

View from my front porch in Somerville, MA

Reflecting back on this past year, I can’t help but draw the analogy to Digital Trowel’s own progress. It seems unbelievable that only one year ago we were a small team of 7 engineers crowded together in 2 small rooms, with not much more than an untested NLP platform and a vision. As the weeks passed by, we added an engineer here, a linguist and a mathematician there, and suddenly we are a mature, full-fledged commercial company, with over 40 developers, selling products and data, ablaze with a proven breakthrough NLP technology in hand.

But we’re still hungry! And our team of scientists is cooking up a storm 🙂

We’ve just recently hired another mathematician and a computer scientist, so we now have 5 PhD level scientists on our R&D team. We strongly emphasize the research aspect of the business, because we know that being at the top is a very fragile state. So while we’re very proud to be the first company to produce consistent commercial results with over 90% accuracy and recall across multiple semantic fields, we realize there are still many challenges to be met, and we’re determined to be the first to meet them.

While I’m not yet at liberty to divulge the full extent of our newest developments, I am able to give you a taste of some of the advances we’ve made.

One of the most exciting projects we’re looking into is a multi-lingual text-mining platform codenamed SNUG for Semantically Negotiated Universal Grammar. Admittedly, I don’t think this is exactly what Noam Chomsky had in mind when he first proposed his theory of Universal Grammar, but it may not be too far off. Let me explain:

Linguists use the term Universal Grammar (UG) to denote a theory according to which all humans possess an inherent “hard-wired” capability to acquire a language. It is this linguistic “hardware” we’re all born with that allows children to learn the grammar of a language even when the linguistic data available to them is insufficient.

A quick example is in order. A child born in the U.S.A must learn to form questions out of assertions in English. He may, for example, hear the assertion:

“My sister is pregnant.”

 

And over time infer that the proper interrogative form of this statement is:

“Is my sister pregnant?”

He may then (subconsciously) form a rule of grammar in his mind, whereby questions in English are formed by moving the first auxiliary verb to the beginning of the sentence. We would then expect that children faced with an assertion such as:

“My sister who is pregnant will be blessed with happiness.”

 

Would form the incorrect question:

* “Is my sister who pregnant will be blessed with happiness?”

Interestingly enough – they don’t! Only the correct form is acquired by children:

“Will my sister who is pregnant be blessed with happiness?”   (Of course she will!)

The claim (asserted by UG supporters) is that children simply don’t encounter enough sentences as complicated as the one above to make a learned choice. So how do they do it? Well, simply put – they are born knowing. More accurately, they’re born with a set of rules that are triggered and activated in a certain way once they are exposed to sentences in their language.

Of course, things are far more complicated than these simplistic examples, and many questions immediately arise. How, for example, is it that other languages, German and French for instance, form questions by placing the main verb at the beginning of the sentence? For example:

 

French: Parlez-vous anglais?

German: Sprechen Sie Englisch?

Both mean: Do you speak English?

But a literal translation would be the ungrammatical: * Speak you English?

So, obviously, the theory must be significantly more complex. But don’t worry, that concludes today’s lesson in linguistics 101 :-). We now return to what got us here in the first place – our new multi-lingual text-mining platform SNUG (We thought it’s kind of cute to build SNUG using CARE :-)).

So, first things first: No – we’re not out there to decipher the way our mind processes languages, but – Yes we are out there to create a platform that will enable us to process and mine texts in any language.

For us humans, it’s quite hard to fathom learning a new language in say, a week. But for our system that’s actually an achievable feat. In order for us to effectively extract data from free text in a new language, we need to accomplish two things. First, we need to teach our system this new language. Well not exactly the language itself, but more like the statistical distribution of words in this new language. We do this using an automatic training process, by which our system runs through huge text corpora, producing a statistical model of the language, which is further refined every time we train the system.

The second task is translating our semantic rulebooks. Here is where our linguists come into play. Since our rulebooks are basically comprised of highly sophisticated weighted Context-Free Grammars, their translation amounts to a structure-preserving function of semantic rules (a semantic homomorphism). This function can be thought of as taking an English semantic-driven grammar as input and transmuting it into a semantic-driven grammar for a different language. Though languages vary considerably in their actual spoken grammar, they tend to convey events and factual information in a surprisingly similar manner (or rather, the structure of the required semantic rules is quite similar).

Most Central European languages share so many common qualities with English, that many of the rules can be translated verbatim. Some of the challenges arise when translating rulebooks to languages with a word order that significantly differs from the English word order (e.g. German, not to mention Subject-Object-Verb languages such as Japanese, Hindi and Turkish). Languages with a high level of agreement inflection such as French, Italian and Spanish are usually easier to parse as well, though high inflection agreement often incurs the omitting of overt pronouns. So for example “I eat” in Italian will simply be mangio”. This turns out to be quite problematic for issues of anaphora resolution.

Every new language poses a new and exciting challenge, but crucially, it does not require us to rewrite the code. All we need is a few large text corpora, an expert linguist, a savvy rulebook writer, and a week or two of intensive work, and our system will learn to “speak” a new language. If only it were so simple for people to learn…

Well I guess that’s enough for this post. I hope you’ve enjoyed this quick “intro to linguistics”, and are as excited as I am by the prospect of text-mining – free of language barriers. Stay tuned for more new and interesting features in my next entry.

Meanwhile, wherever you are, I hope you have an autumn as beautiful as the one I am having in Boston. Even more importantly, I hope you take the time to appreciate the beauty of change all around – spoken without any words.

New Year’s Blessings

On the occasion of the new Hebrew year, I thought I’d make this entry a bit lighter than usual and present blessings for the new year in the spirit of NLP, spiced up with some advice, so here goes:

When setting the goals for extraction results, try to find the optimal balance between recall and precision. Remember: Aiming high is good, just make sure there’s an easy way down!

giraffe_zebra

Do not attempt to debug a rulebook for more than 24 hours straight. We’ve been there. We’ve done that. If you think you’re stuck now, you still haven’t seen nada!

horse_head

Remember the chain of  language processing from last entry? First comes The HTML converter – it hands down the info to CARE, which does all the hard work and in turn passes the parsed relations to the Post Processor so that it can rest on its laurels

hampsters

Perfect “Anaphora Resolution” is a myth. You can try. For a while, you may even believe you’ve done it. Our prediction: In the end the bubble will burst. Or you’ll go crazy trying to solve all the problems. Or both.bubbles

Make sure that all the crucial information that comes in as input, goes out as output

dog

Remember the engineer that didn’t take our advice and corrected extraction rules through the night?monkeys

Just like life itself, retrieving the ultimate extractions for a given relation may be an extremely tedious and laborious task. Lighten up, add some spice. Make fun of yourself!

hotdogs

Do everything you do with love and CARE!

dogs1hug

But remember that if you’re not enjoying it, you’re probably not doing it right! 🙂smack

Here from Cambridge, MA, wishing you all a wonderful year full of wonderful experiences, CARE-ing, happiness and love!

Two Buddhist monks were traveling from monastery to monastery on their spiritual journey to enlightenment when they chanced upon a beautiful young woman standing by the bank of a stream.

Sorrowfully, the young woman could not cross the stream for the water had risen and it would ruin her silk robe. Without hesitating the older monk lifted the woman in his arms and carried her across, placing her gently on the opposite bank. The younger monk was taken by surprise – Buddhist monks were not supposed to touch women, let alone carry them – but decided to keep his silence, and the two proceeded on their journey for long hours without exchanging a word.

Finally at dusk the two monks arrived at their lodging and the younger monk could no longer hold his tongue. He turned to the older monk: “Tell me,” he asked, hardly concealing his reproachful tone, “Why did you carry that woman? We monks are not supposed to touch women at all!”

The older monk smiled amusedly and calmly replied: “I have let go of that woman many hours ago. Are you still carrying her in your mind?”

Now I bet you’re wondering what the heck this Zen story has to do with Natural Language Processing.

Well, it does – just read on 🙂

It’s time to tell you a little more about the general extraction process we follow at DT.

You see, CARE doesn’t really work alone. Rather, it is assisted by both an HTML converter and a post processor.

The following diagram will help to illustrate the process:

Capture

As you can see CARE is “sandwiched” between the HTML Converter and the Post Processor. This blog entry will focus on the Post Processor, but first for the sake of completeness a few words about the HTML Converter:

The HTML Converter can be thought of as a “pre-processor”. It begins by downloading selected HTML code from the internet, and then performs the following tasks:

  • Cleans the HTML code leaving only the English text
  • Cleans the text of advertisements and other garbage sections
  • Classifies the text according to its content and determines what CARE rulebooks should run on each section of the text

So for example, sections describing a person’s biography will be marked and designated for the PPC (Person Position Company) and Education rulebooks, while sections of text containing addresses will be sent to the Contact Information rulebook.

Next, one or more of CARE’s rulebooks are run on the text. The text tagged by CARE is what’s fed to the Post Processor.

The Post Processor is essentially an extremely potent script that runs on CARE’s tagged output and turns it into sensible and readable information. Along the way it performs several intricate tasks, some of which are technical and some semantic in nature. In a less than perfect analogy, the Post Processor can be thought of the younger monk in our story, still holding on to all the events in their original context of which the older monk (aka CARE) has long since let go 🙂

Following is a brief survey of the Post Processor’s most notable tasks:

  • Performing idiosyncratic regular expression substitutions and fixes
  • Resolving co-references and anaphoric phrases
    • For example the post processor will replace pronouns such as “he” or “she” by the relevant name of the person. Likewise if the phrase “the company” appears in an extracted relation, it will be replaced by the appropriate name of the company
  • Matching of entities and merging of relations – this is to ensure that information is not unnecessarily duplicated, and that all data is matched with the correct entities and only with them
  • Assertion and Filtering of outputted relations – this is a last validation of the information’s integrity

An example of part of the post processor’s capability is in place. Assume CARE is fed the following input sentence:

“Micha Breakstone earned his masters in math at Hebrew University (2007, cum laude), and then went on to study for his Ph.D. in cognitive sciences at Hebrew University. Subsequently he was invited to study for one year of his Ph.D at MIT”

CARE will parse the information correctly, but its output may at times look like gibberish to anyone who is not an expert engineer. Moreover, pronouns and conjunctions must be interpreted, as can be seen:

<S><PERSONDEGREE><_NAME><PERSON>Micha Breakstone</PERSON></_NAME> <_STATUS>earned</_STATUS> his <_DEGREE><TS type=”SUPER_GROUP” id=”1″/><TS type=”GROUP” id=”2″/>masters</_DEGREE> in <_SPECIALTY>math</_SPECIALTY><TE type=”GROUP” id=”2″/> at <_UNIVERSITY><TS type=”GROUP” id=”2″/>Hebrew University</_UNIVERSITY> (<_YEAR><TS type=”GROUP” id=”3″/>2007</_YEAR>, <_DISTINCTION>cum laude</_DISTINCTION><TE type=”GROUP” id=”3″/>)<TE type=”GROUP” id=”2″/><TE type=”SUPER_GROUP” id=”1″/></PERSONDEGREE>, <PERSONDEGREE><_NAME>and</_NAME> then went on to <_STATUS>study</_STATUS> for <TS type=”SUPER_GROUP” id=”1″/><TS type=”GROUP” id=”2″/>his <_DEGREE>Ph.D.</_DEGREE> in <_SPECIALTY>cognitive sciences</_SPECIALTY><TE type=”GROUP” id=”2″/> at <_UNIVERSITY>Hebrew University</_UNIVERSITY><TE type=”SUPER_GROUP” id=”1″/></PERSONDEGREE>. Subsequently <PERSONDEGREE><_NAME>he</_NAME> was invited to <_STATUS>study</_STATUS> for one year of <TS type=”SUPER_GROUP” id=”1″/><TS type=”GROUP” id=”2″/>his <_DEGREE>Ph.D</_DEGREE><TE type=”GROUP” id=”2″/> at <_UNIVERSITY>MIT</_UNIVERSITY><TE type=”SUPER_GROUP” id=”1″/></PERSONDEGREE> </S>


It is the Post Processor that comes and saves the day, tidying up, putting things in order, and all in all making sense of the myriad of unresolved anaphora and tags, to obtain a completely readable and simple extraction as follows:
<Education>
<DEGREE>masters</DEGREE>
<DISTINCTION>cum laude</DISTINCTION>
<NAME>Micha Breakstone</NAME>
<SPECIALTY>math</SPECIALTY>
<STATUS>earned</STATUS>
<UNIVERSITY>Hebrew University</UNIVERSITY>
</Education>
<Education>
<DEGREE>Ph.D.</DEGREE>
<NAME>Micha Breakstone</NAME>
<SPECIALTY>cognitive sciences</SPECIALTY>
<STATUS>study</STATUS>
<UNIVERSITY>Hebrew University</UNIVERSITY>
</Education>
<Education>
<DEGREE>Ph.D.</DEGREE>
<NAME>Micha Breakstone</NAME>
<STATUS>study</STATUS>
<UNIVERSITY>MIT</UNIVERSITY>
</Education>

<Education>

<DEGREE>masters</DEGREE>

<DISTINCTION>cum laude</DISTINCTION>

<NAME>Micha Breakstone</NAME>

<SPECIALTY>math</SPECIALTY>

<STATUS>earned</STATUS>

<UNIVERSITY>Hebrew University</UNIVERSITY>

<YEAR>2007</YEAR>

</Education>

<Education>

<DEGREE>Ph.D.</DEGREE>

<NAME>Micha Breakstone</NAME>

<SPECIALTY>cognitive sciences</SPECIALTY>

<STATUS>study</STATUS>

<UNIVERSITY>Hebrew University</UNIVERSITY>

</Education>

<Education>

<DEGREE>Ph.D.</DEGREE>

<NAME>Micha Breakstone</NAME>

<STATUS>study</STATUS>

<UNIVERSITY>MIT</UNIVERSITY>

</Education>

So, as you can see, next week I’m off to MIT 🙂

In the next entries we’ll try and focus in more detail on each of the Post Processor’s tasks, discussing strategies for anaphora resolution, relation merging, etc. But meanwhile, wish me luck, continue to ask questions or comment, and in the spirit of Zen, remember to let go of any unproductive burdens you may be carrying in your mind.

It’s About Time!

First of all I want to thank you all for your emails and comments. Part of the idea behind this blog is to establish an active community. Your questions are already challenging us to create even better rulebooks for information extraction… so keep the ideas and challenges coming!

Speaking of challenges, today I’ll present a few amusing (yet formidable) examples we encountered while writing our Financial Events rulebook.

Unlike the Person Position Company (PPC) and Education rulebooks which aim to extract biographical information, the Financial Events rulebook aims to capture any and all events pertaining to different companies and organizations. As such, it is action-oriented, and must therefore  be sensitive to complex verb constructions such as negation, modalities (e.g. could, would, may) and speculative tones (e.g. “it is rumored that…”), as well as idiom usage and subjective interpretations of the events.

Consider the following examples we’ve recently encountered:

AOL to buy Time Warner in historic merger – CNET News

Yahoo stalling tactics buy time with Microsoft

Now Is the Time to Buy Yahoo!

Although the word “time” is used quite differently in each of the above examples, our rulebook is able to ascertain that only the first sentence denotes an acquisition event, yielding the following welcome result:

<Acquisition>

<ACQUIRER>AOL</ACQUIRER>

<ACQUIRED>Time Warner</ACQUIRED>

<ACTION_ACQUISITION>to buy</ACTION_ACQUISITION>

<SOURCE>CNET News</SOURCE>

</Acquisition>

How are we able to achieve this? The secret lies in the synergetic interplay of our NER (Named Entity Recognition) component with the semantic rules handcrafted by our linguistic engineers, and the inherent competition between possible parses.

In the first sentence, Time Warner is identified as an organization by the NER, and thus any interpretation tagging it as such will receive a relatively greater weight. On the other hand, in the second and third sentences, “time” is identified as a regular word by the NER, so that interpretations consistent with this (vacuous) tagging are preferred.

Still, you may wonder, why isn’t Microsoft identified as an organization in the second sentence, leading to the (wrong) conclusion that Yahoo is buying Microsoft? The answer is that Microsoft is, in fact, identified as an organization, but the idiom “to buy time” competes with inflections of the verb “to buy”, receiving a greater weight which renders the “acquisition-less” parsing the most rewarding one.

In fact, this is only the tip of the iceberg. Deciphering the subtleties, connotations, and “between the lines” interpretations of newsfeeds is one of the greatest challenges we’ve taken upon ourselves. We’ve recently developed a whole new set of rulebooks to extract – or if you will – divine, the sentiment attributed to sentences, paragraphs and articles.

Running our Sentiment Analysis (SA) engine on financial sites, we are able to automatically generate a sentiment score for each company mentioned. Likewise, applying the SA engine to health forums, we are able to ascertain how satisfied patients are with a certain drug or treatment. The possibilities are nearly endless, but the SA topic deserves an entry (or several) of its own, so enough for now 🙂

I’ll wrap up this entry with an amusing example:

A few weeks ago while testing our Financial Events rulebook, one of our engineers encountered the following suspicious extraction:

<Acquisition>

<ACQUIRER>Goldman Sachs</ACQUIRER>

<ACQUIRED>Treasury Department</ACQUIRED>

<ACTION_ACQUISITION>to Acquire</ACTION_ACQUISITION>

<ACTION_STATUS>in Talks</ACTION_STATUS>

</Acquisition>

Preparing himself for a new bout of debugging, our engineer turned to the source text from which the event was extracted.

You may want to have a look at the source yourself.

Quoting the article, it indeed turns out that “Goldman Sachs [is] in Talks to Acquire Treasury Department”.

It seems even Google Finance accepts satire from time to time (I especially enjoyed the description of the festivities: “Goldman recently celebrated record earnings by roasting a suckling pig over a bonfire of hundred-dollar bills”). Naturally, we took immediate action and hired two comedians to join our algorithms team in hope of improving CARE’s sense of humor 🙂

____________

On a more personal note, I just landed in the USA, and am very excited about beginning my studies at MIT in a few weeks… Meanwhile, please continue with your comments and questions. Till soon…

Yes We CARE!

It’s been nearly a month since the last post – so there’s much to catch up on! In this entry I’ll explain a little more about what makes CARE so special, and provide examples to prove it.

The beauty of CARE is that it combines several cutting-edge technologies to achieve its ultimate goal:

Extracting complex relations from free text with a precision and recall rate of over 90%.

In the professional lingo, CARE is what is known as a hybrid system. On the one hand, it employs supervised machine learning for Named Entity Recognition, and on the other, it involves Knowledge Engineering-based rules. 

If you’re not an expert you’re probably a bit lost, so let me explain :-).

CARE is comprised of 2 basic components. The first component uses automated algorithms to learn the semantics of specific words and phrases according to their statistical distributions in text corpora. This is referred to as the Named Entity Recognition (NER) module. The primary NER module is currently trained to identify salient entity categories such as: People, Organizations, Locations, Products, etc. However, the NER can essentially be trained to identify any category of entities.

On top of the NER comes the second component, which comprises a set of relation-specific linguistic rules. These rules are written by a team of engineers and linguists. Together, this collection of rules form what is known as a grammar. Formally, CARE uses what is called a Context-Free Grammar (CFG), which is a term used to indicate that rules are independent of the context in which they occur. More precisely, we use a Weighted CFG, which means that each rule is assigned a weight, so that rules may compete with each other, with the highest weighted rule prevailing. It is this Weighted CFG component that allows CARE to decipher and extract higher-level relations from within the text.

The real beauty of this, though, is that the interface between the two components described above is flexible. Now, what could I possibly mean by a flexible interface? Well, put simply, it means that the two components interact. Instead of two severed immutable components, we have the two components running in parallel and modifying each other’s output, until an optimal result for both modules is obtained.  As it turns out, it is exactly this synergetic cooperation between the automated machine learning module and the semantically engineered grammar that gives CARE its winning edge.

Well, enough theory (at least for this entry :-)) let’s talk facts!

Recently, we ran an experiment to test our “Person-Position-Company” relation rulebook (PPC), on biography pages we selected randomly from the Internet. Before sharing the results with you, I’ll explain the task at hand and give a few examples so that you can get a feel of how complex it is.

Say CARE receives the following paragraph as input: 

Mr. Alan H. Fishman is the Chief Executive Officer and Director at Washington Mutual Inc. Previously, he was President and Chief Executive Officer at Independence Community Bank Corp. Prior to that, he served as the President and Chief Executive Officer at ContiFinancial Corp. from July 1999 to December 2000. Mr. Fishman is also the Founder and was the Managing Partner at Columbia Financial Partners LP, from 1992 to March 2001. 

How many different relations should be extracted from this paragraph? As a quick exercise (or rather task in occupational therapy) try “extracting” all the relevant information. During this time, I will leisurely run CARE (honestly, this is precisely what I am doing at the time I write these lines) to obtain the following results within milliseconds: 

<document>

<PPC>
<EDATE>CURRENT</EDATE>
<EMPLOYER>Washington Mutual Inc.</EMPLOYER>
<NAME>Mr. Alan H. Fishman; Mr. Fishman</NAME>
<POSITION>Chief Executive Officer</POSITION>
</PPC>

<PPC>
<EDATE>CURRENT</EDATE>
<EMPLOYER>Washington Mutual Inc.</EMPLOYER>
<NAME>Mr. Alan H. Fishman; Mr. Fishman</NAME>
<POSITION>Director</POSITION>
</PPC>

<PPC>
<EDATE>PAST</EDATE>
<EMPLOYER>Independence Community Bank Corp.</EMPLOYER>
<NAME>Mr. Alan H. Fishman; Mr. Fishman</NAME>
<POSITION>President</POSITION>
</PPC>

<PPC>
<EDATE>PAST</EDATE>
<EMPLOYER>Independence Community Bank Corp.</EMPLOYER>
<NAME>Mr. Alan H. Fishman; Mr. Fishman</NAME>
<POSITION>Chief Executive Officer</POSITION>
</PPC>

<PPC>
<EDATE>December 2000</EDATE>
<EMPLOYER>ContiFinancial Corp.</EMPLOYER>
<NAME>Mr. Alan H. Fishman; Mr. Fishman</NAME>
<POSITION>President</POSITION>
<SDATE>July 1999</SDATE>
</PPC>

<PPC>
<EDATE>December 2000</EDATE>
<EMPLOYER>ContiFinancial Corp.</EMPLOYER>
<NAME>Mr. Alan H. Fishman; Mr. Fishman</NAME>
<POSITION>Chief Executive Officer</POSITION>
<SDATE>July 1999</SDATE>
</PPC>

<PPC>
<EDATE>CURRENT</EDATE>
<EMPLOYER>Columbia Financial Partners LP</EMPLOYER>
<NAME>Mr. Alan H. Fishman; Mr. Fishman</NAME>
<POSITION>Founder</POSITION>
</PPC>

<PPC>
<EDATE>March 2001</EDATE>
<EMPLOYER>Columbia Financial Partners LP</EMPLOYER>
<NAME>Mr. Alan H. Fishman; Mr. Fishman</NAME>
<POSITION>Managing Partner</POSITION>
<SDATE>1992</SDATE>
</PPC>
</document>

The above results are given in our post-processor’s XML-format with the outermost tag indicating the relation caught (here PPC) and the inner tags indicating the different slots and entities (all are quite self explanatory, except perhaps for EDATE which stands for “End Date” and SDATE, which stands, well, for “Start Date” 🙂 ).

You can check for yourselves to see that we got all the relations (100% Recall) and that we got all of them right (100% Precision).

Pretty impressive, don’t you think? Well that’s nothing! Imagine the results when we run this same automatic extraction over hundreds-of-thousands of biographies within minutes. Just think of the endless applications for such refined and structured data in numbers that really have no limit…

Before we get to the punch-line (the actual recall and precision results we obtained) I’ll add a few technical details, for those so inclined (non-experts can skip the next paragraphJ).

In our experiment we used a CRF-based NER model which was trained on the CoNLL 2003 shared task data. The test set was comprised of 10 web pages, containing 113 instances of the PPC relation. We ran CARE (with the PPC rulebook) over these pages and then manually checked the results. We used the following definitions to present the statistics for the results:

Recall is the number of True Positive catches divided by the total number of True Instances (=113).

Precision is the number of True Positive catches divided by the sum of the True Positive catches and the False Positive catches.

And finally (drums, please…) for the final results:

In the experiment described above, we obtained Recall of 94.8% and Precision of 96.0%

So, now I have only 3 words left for you:

Yes we CARE!

__________________________________________________

In the next few weeks I’ll be presenting further examples of relation extraction on an ever-increasing level of complexity. If you have questions, comments or challenging ideas please drop me a note.

Also, if you’d like to see me extract a specific type of relation – don’t hesitate to write me. I’ll be back shortly with more exciting challenges and results – so be sure to stay posted!

Are you into Natural Language Processing? Text Mining? Developing Software for Information Extraction?

If so, take a deep breath, lean back and imagine….

Imagine a search-engine you can use to query the web as though it were your personal structured database.

Imagine verifying the facts of any article, CV or paper, by the time you finish reading it.

Imagine browsing the web at 1,000 pages per second extracting valuable information as you whiz along.

That’s the kind of imagining I’ve been doing at Digital Trowel ever since I began working there in 2008 as a linguistic engineer, and I’m proud to say that in the very short time since, together with our team of engineers and scientists, we’ve managed to transform our imagination into cutting-edge technology. Now we want to invite you to continue imagining with us – which, essentially, is what this blog is all about.

But first, a little bit more about what we are enabling people to do with our technology if they do it with CARE. That’s the name we’ve given to our unique NLP engine (an acronym for CRF Assisted Relation Extraction). Incorporating highly sophisticated learning algorithms, CARE can be used to scan millions of pages in a matter of minutes, extracting detailed biographical data and up-to-date Contact Info as it does so, making sense of Free Language Texts, any style, any format.

A simple example of how our technology can be applied: Say you’d like to compile a list of all CFOs in major companies across Chicago along with their contact info and employment history. Sure, all the information is already out there on the web, but good luck finding it with a standard search engine. Just for the heck of it, give it a try. You’ll quickly appreciate the value of the automated technology we have pioneered – and of the staggering, unprecedented 90% accuracy rate we can deliver.

But the truth is I believe CARE can and should be even better, which brings me back to you. While our team of world renowned text-mining computer scientists and talented algorithm engineers has been able to accomplish incredible results, we recognize that we’ve only just scratched the surface of CARE’s true potential.

Now we want you to scratch deeper with us. CARE can essentially be used to extract any information from any source: match drugs with their side-effects as reported in medical forums, compile lists of stock prices along with concurrent ratings posted in economical sites, or scan Wikipedia for correlations between major historical figures and events. All you need to do is imagine. So, in the coming weeks, I’ll be posting a secured-access link that will enable you to play around with our engine.

Whether you’re developing software that requires access to information-extraction services, conducting theoretical or empirical research in discriminative-probabilistic algorithmic models for NLP, or simply have been dreaming of having access to a state-of-the-art text-mining engine tailor-made for your own needs – this is an opportunity you won’t want to miss.

The idea is to let you compose rulebooks that will then be loaded into CARE and applied to texts and URLs of your choice. Of course you get to use any information extracted.

And what do we get out of it? Are we crazy to be allowing access to our precious CARE over the web? Well, not quite… in fact, au contraire!

Pioneering these new frontiers of text-mining, we’ve discovered more directions to go in than we can possibly explore ourselves, so we are more than happy to share significant parts of our technology with developers and users who may suggest new features, report bugs and even contribute new information extraction rulebooks.

I myself am one of the senior rulebook composers at DT, which is good news for you: I will be in charge of moderating, commenting and helping anyone interested in honing and debugging the rulebooks he or she loads into CARE on the site. This way you can enjoy our engine while helping us to improve the spectacular results we’ve already achieved.

Once we upload an API to plug into CARE you’ll be able to start using it immediately, see what all the hype is about, and judge for yourself if it is justified. We are convinced you’ll conclude that it is.

Another hope of ours is that this blog will help establish a small but dynamic community of text-mining enthusiasts who can enjoy our technology as well as help us by challenging it to its limits!

We invite everyone and anyone to partake in this effort. This is a real chance for you to actually make a difference, and at the same time take advantage of our break-through technology, create your own rulebook, load it into CARE and run it on the content of your choice…

We look forward to your becoming a part of this process, and expect that soon you’ll be extracting information from the web in a way that you never have before. Until then, take CARE.