Feeds:
Posts
Comments

Digital Trowel was founded to help alleviate the information overload that inevitably is taking place thanks to the growth of the web. One part of what we do is gather company and executive information that we provide as relevant and up-to-date business information.

One of our biggest challenges is determining if a company from one source is the same one as we found in another source. Using our proprietary semantic Identity Resolution engine, we perform advanced matching to know if this company is the same as that company. This enables us to avoid duplicates in the database and to combine multiple information sources to create rich company and executive profiles.  This process is referred to as “Identity Resolution” or “Match & Merge.”

Our matching system is based on advanced semantic & statistical models, natural language processing and machine learning. Using our massive database of web-sourced data the system has created sets of positive and negative rules to decide if two contacts should be matched.

Positive Rules

The inputs are first exposed to a defined rule set made up of certain positive rules that are meant to examine the likelihood that the two inputs refer to the same entity. The names of the companies are examined first and based on the likelihood that they refer to the same company the pair is given a rating between 0 and 1.

For example: Luigi’s Pizzeria and Luigi’s Italian Restaurant will be matched and assigned a score of 1 as it is likely that the two names refer to the same company.

Next, the contact information of each company is examined, in order to ascertain a true connection between the two. This includes the physical addresses of the two entities, the phone numbers, URL’s, employee information and so on. Based on the similarities between all this information, the two entities are combined into a “matched set.”

“Fuzzy Matching” is utilized in the identification of spelling mistakes, and an advanced semantic process analyzes the actual meaning of the content, allowing the system to take into account synonyms and similar-meaning words (such as “restaurant” and diner). In addition, an abbreviation process considers if YMCA is its own name, or just a short hand for Young Men’s Christian Association.

Negative Rules

Every “matched set” is then processed through a defined rule set made up of negative rules that are meant to examine the likelihood that, regardless of the “matched set” status assigned after successfully meeting the standards of the positive rules, these two entities are different companies. The DUNS number, stock symbol and other recognized information is examined for any discrepancies. After this examination, if there is a conflict in information the set is assigned a “problematic” status, re-examined and either accepted as a match or dismissed as opposing companies.

Merge

After two entities are matched, the next step is to merge them correctly into one profile. Sources are assigned priority levels based on the quality, accuracy and recency of the information, so that any conflicting data can be adjusted according to the source with the highest priority. Priority and source quality is assigned independently per attribute and in this way we can refer to the most accurate information provided on one topic regardless of the accuracy of the remaining information provided by the source.

For example: If one source has great company financial data but poor phone records, aside from the overall priority score assigned to the source as a whole, each one of these attributes receives a quality and priority ranking; the source would be considered when dealing with financial information but basically disregarded when handling phone records. In this way the most relevant and accurate information per category is utilized.

Precision and Recall

We need to strike the optimal balance between precision and recall. For our purposes, precision equals the fraction of information that is correctly retrieved while recall equals the fraction of information retrieved relative to what’s available. We have created advanced tools that allow us to dial the precision and recall up or down, to find the right balance. The more “loosely“ we match, the greater the opportunity is to extract more information on the topic, leading to higher recall, the “tighter” the match, the less likely we are to connect the dots incorrectly, improving precision.

Depending on our customers wants and expectations, we can select the appropriate balance between precision and recall to offer the most accurate and rich concentration of knowledge to our customers.

As the closing to this series, this post  will concentrate on how to use web-mined event information as variables in modeling and/or decisioning. For simplicity’s sake, I will break it down by sub-topic.

Introducing Event Data into Predictive Models

Event data can easily be introduced as predictor (independent) variables within pre-existing risk and marketing models in two basic modes.  The first adds them as later stage variables, which means the pre-existing model variables are entered on a forced basis (which replicates the current state of the model), and the event variables are subsequently allowed to enter.  This process ensures that the event variables are evaluated for their incremental contribution to the model, and do not displace any pre-existing model variables.  In contrast, the alternative mode starts the model development from scratch, and pre-existing model variables might be replaced by the event variables.  This approach may outmode the current model, but yield a more optimized set of factors.

The event input data should be coded into event types as well as time periods. For example, the number of litigation occurrences in the last 3, 6,12,18,24+ months.  As a simple example, I’ve found very high correlation between the number of lawsuits from different parties and payment delinquency.  Sometimes, source and quantity are desirable, but from a practical perspective they create significant complexity (a single litigation event might now be exploded into many different combinations of source and amount which need to be individually tested).

Treating Events as Triggers

Sometimes, events are hugely significant in their impact on risk and/or reward.  As an obvious example, M&A, which (believe it or not) is a variable ignored in risk models.  The affects of these events cannot be easily quantified in models, and so they are best treated as “triggers” or decisioning input (for subsequent manual review and intervention).

Sentiment Analysis

Sentiment Analysis of companies is one of the more interesting qualitative pieces of data that has recently become available, due to advances in web mining. Briefly stated, sentiment measures the positive or negative “buzz” about a company.  The firms that utilize product sentiment analysis use varying sources and methods to produce “sentiment scores”.  Minimally, sentiment analysis can be used to corroborate certain decisions, and may have predictive ability as well.  Like events, sentiment scores can be used as time-based model variables, or as external triggers.

To conclude, I would like to impress that there should be no doubt that select business events affect the risk and opportunity value of a company.  Event data, and its accompanying sentiment, is available on a near-real time basis on the Internet.  Semantic analysis companies (such as Digital Trowel) have created a process that mines this data and presents it in coded form, which can be made available to scoring and decision models, as well as human monitors.  Virtually any company relying on risk and/or potential models can incorporate this powerful information to enhance its accuracy, by employing them either as internal variables or as external decision factors.

Please contact me with any questions or comments. I can be reached by commenting on the blog, or via email at Steve (at) digitaltrowel.com

Check back soon for more in-depth exploration of the growing text-mining phenomenon.

- Steve

I would now like to explore the concept of “Business Events,” particularly their affect on company risk. First things first, let’s define risk.  The traditional definition of risk is a company will be unable to make the required payments on its debt obligations.  This, of course, is a narrow financial definition, and if you’re a lender that’s probably exactly what you care about. If you’re a supplier, on the other hand, you probably view risk in a larger scope; for example, is your customer having financial difficulties and will he demand to renegotiate payment terms on a more extended basis, renegotiate pricing in a downward direction, reduce his order commitments, and so on?  Furthermore, although many risk models have a 1-2 year horizon, a short-term view is also needed, and web-data can be used in that short-term (1-6 month) context.

Regardless of whether you’re a lender, supplier, analyst or salesperson, here are some events that negatively impact the growth behavior of a company, and that can be mined from web-data in a very updated manner:

  • Litigation: When a company starts to have cash flow problems, one of its first reactions is to delay payment to some suppliers.  At some point, payment delinquency moves from the “tolerant” stage to the litigation stage.  But what happens if you’re modeling the risk of a company and do not have access to their AP/AR data?  How do you recognize litigation without waiting for it to possibly appear in a financial report?  Fortunately, there are fee-based web-based sources that detect and track litigation including LexisNexis, Public Access to Court Electronic Records (PACER), and D&B.  Publicized litigation that has made it into the media can be obtained at little or no direct cost, and recent (2010) examples of major litigation include BP,  Microsoft’s suit against Salesfore.com (patent infringement), Borg-Warner (asbestos product liability), and Chrysler (failure to pay suppliers).  Of course, the above companies are large enough to withstand the litigation payouts to avoid default; but what does this do to their sales & marketing budget and supplier terms?  In our more expanded view of risk, these are important topics!
  • Analyst Recommendations: Analyst recommendations often, and quickly, affect a company’s stock price.  Downward recommendations that cause the stock to fall, place pressure on the company to compensate. A typical reaction is to cut expenses in order to boost earnings. Of course, this action does not bode well for the company’s S&M efforts, or their suppliers.
  • Partnerships: Partnerships usually indicate positive growth activity, and by logical extension, lower the company risk.
  • M&A: M&A logically reduces the target company’s risk.  Although M&A (and even its announcement) should immediately change the risk score of the target company, this is usually not the case, since the scoring models have no way of quickly recognizing the event.
  • Key employee movement: When a company hires a heavyweight Senior executive, it is invariably a growth move, which should lower risk (otherwise they would likely not take the new position).
  • Insider trading: The purchasing of shares by insiders is often a leading indicator that they expect the stock will go up in the near future (which is itself a leading indicator that the company will expand due to its increased market cap)
  • Product introductions:  A new product introduction is typically a leading indicator of growth, hype, success, and similar; these are all leading indicators that reflect a lowering of risk.
  • Product recalls (pharma): At a minimum, product recalls offer a negative distraction to sales.  Sometimes, for example in the Pharma sector, recalls can have a devastating affect on sales.  Sometimes, for example in the auto industry, they may have a more temporary affect. But in either case, they diminish the strength of a company.
  • Financial announcements:  Financial announcements are excellent leading indicators, on the upside and downside. They appear on the Internet well before they appear in the financial statements that are used to drive typical company risk models.   Competitive tracking:  Significant changes in competitive activity greatly affect market potential models, and could well affect risk models.
  • Competitive monitoring becomes increasingly important in economic downturns, since supplier loyalty is overshadowed by the customer need for cost reductions.   Generally speaking, as direct competition grows, it becomes more formidable to deal with, and the competitive events including product, financial, employment, and so on should be quantified and incorporated into both risk and marketing models .Whew! Now that we are all caught up on Business Events, check back for the third and final post of the series that will tie everything together.

Check back in a couple of days for Part 3: Using Web Mined Data to Enhance the Performance of Business Risk and Opportunity Models

Please contact me with any questions or comments. I can be reached by commenting on the blog, or via email at Steve (at) digitaltrowel.com

Cheers!

Steve

As you may know, business risk models have not fundamentally changed over the past 40 years.  The famed Altman Z-score model, first published in 1968 by Edward Altman, is still being used as a pillar in the area of modeling bankruptcy.  Why? Well, because risk models are typically founded on basic financial information such as working capital, total assets, retained earnings, EBIT, equity, sales, and similar financial statistics that reflect fundamental measurements of company health. Since the importance of these basic financial barometers hasn’t changed over time, the models that employ them haven’t needed to change either.  It is true that improvements in risk model performance can be made by incorporating payment patterns, however this is more suitable for internal customer scoring models, as finding enough reliable and ongoing payment data for an external risk model build and score is difficult indeed!

Having been in the business risk and opportunity-modeling arena for many years, I’ve come to the conclusion that the greatest weakness in business data modeling is quite simply the age of the data.  There is no doubt that a downswing in EBIT spells bad news; but by the time that is recognized in a financial report, it’s very late in the game, and no modeling technique can overcome the limitations of old data.  In my search for a better source of leading indicators, I naturally gravitated to the internet.  After all, the Internet offers an unparalleled rich, dynamic source of data in both quantitative (e.g. financial reports) and qualitative (e.g. sentiment) form, and many of these are inherently powerful leading indicators of both risk and opportunity.

Not coincidentally, statistical package developers such as SAS and SPSS have already launched applications that combine text mining and analytics.  However, for many companies, it will be preferable to gather the data as a separate process, and then integrate it into their modeling/decisioning processes. Recently, I’ve found that incorporation of web data can improve the accuracy/timeliness of risk-based decisions by as much as 20%; even larger benefits can be expected in the area of market potential analytics.

Stay tuned for my upcoming musings on this topic. Part 2: Using Web Mined Data to Enhance the Performance of Business Risk and Opportunity Models

Please contact me with any questions or comments. I can be reached by commenting on the blog, or via email at Steve (at) digitaltrowel.com

Looking forward to an active dialogue.

- Steve

We’re back!

Hey Everyone,
So much has been happening here at Digital Trowel in the past months, and we’ve sort of let the blog fall to the side. But no more!
This blog is now a company-wide affair. You’ll be seeing regular postings from numerous members of our team, about everything from text mining and data analytics, to new product ideas and development updates.
First up – Steve Gasner, our Chief Data Officer, posting about risk modeling.
Please comment and reply. All are welcome. For any other questions, please feel free to email yoni (at) digitaltrowel.com with any questions or comments.
Enjoy!

Older Posts »

Follow

Get every new post delivered to your Inbox.