Digital Trowel was founded to help alleviate the information overload that inevitably is taking place thanks to the growth of the web. One part of what we do is gather company and executive information that we provide as relevant and up-to-date business information.
One of our biggest challenges is determining if a company from one source is the same one as we found in another source. Using our proprietary semantic Identity Resolution engine, we perform advanced matching to know if this company is the same as that company. This enables us to avoid duplicates in the database and to combine multiple information sources to create rich company and executive profiles. This process is referred to as “Identity Resolution” or “Match & Merge.”
Our matching system is based on advanced semantic & statistical models, natural language processing and machine learning. Using our massive database of web-sourced data the system has created sets of positive and negative rules to decide if two contacts should be matched.
Positive Rules
The inputs are first exposed to a defined rule set made up of certain positive rules that are meant to examine the likelihood that the two inputs refer to the same entity. The names of the companies are examined first and based on the likelihood that they refer to the same company the pair is given a rating between 0 and 1.
For example: Luigi’s Pizzeria and Luigi’s Italian Restaurant will be matched and assigned a score of 1 as it is likely that the two names refer to the same company.
Next, the contact information of each company is examined, in order to ascertain a true connection between the two. This includes the physical addresses of the two entities, the phone numbers, URL’s, employee information and so on. Based on the similarities between all this information, the two entities are combined into a “matched set.”
“Fuzzy Matching” is utilized in the identification of spelling mistakes, and an advanced semantic process analyzes the actual meaning of the content, allowing the system to take into account synonyms and similar-meaning words (such as “restaurant” and diner). In addition, an abbreviation process considers if YMCA is its own name, or just a short hand for Young Men’s Christian Association.
Negative Rules
Every “matched set” is then processed through a defined rule set made up of negative rules that are meant to examine the likelihood that, regardless of the “matched set” status assigned after successfully meeting the standards of the positive rules, these two entities are different companies. The DUNS number, stock symbol and other recognized information is examined for any discrepancies. After this examination, if there is a conflict in information the set is assigned a “problematic” status, re-examined and either accepted as a match or dismissed as opposing companies.
Merge
After two entities are matched, the next step is to merge them correctly into one profile. Sources are assigned priority levels based on the quality, accuracy and recency of the information, so that any conflicting data can be adjusted according to the source with the highest priority. Priority and source quality is assigned independently per attribute and in this way we can refer to the most accurate information provided on one topic regardless of the accuracy of the remaining information provided by the source.
For example: If one source has great company financial data but poor phone records, aside from the overall priority score assigned to the source as a whole, each one of these attributes receives a quality and priority ranking; the source would be considered when dealing with financial information but basically disregarded when handling phone records. In this way the most relevant and accurate information per category is utilized.
Precision and Recall
We need to strike the optimal balance between precision and recall. For our purposes, precision equals the fraction of information that is correctly retrieved while recall equals the fraction of information retrieved relative to what’s available. We have created advanced tools that allow us to dial the precision and recall up or down, to find the right balance. The more “loosely“ we match, the greater the opportunity is to extract more information on the topic, leading to higher recall, the “tighter” the match, the less likely we are to connect the dots incorrectly, improving precision.
Depending on our customers wants and expectations, we can select the appropriate balance between precision and recall to offer the most accurate and rich concentration of knowledge to our customers.