How to Find the ‘Signal’ in the Noise of Open Source Data

Finding ‘signal’ through the noise of open source information is ultimately what will drive value to your organization. Whether this is to support sales, investment decisions, or the fight against fraud, corruption, IP theft, or terrorism, it all depends on identifying the ‘signal’ in the data.

This article is intended to start a conversation about how open source information needs to be approached to be successful.

Internal Data Signals

Most clients are very comfortable finding the signals in their own internal data, those solutions have existed for years. Think of the standards used in data matching when looking for fraud inside your company. Matching HRIS data sets against payroll data sets will help identify false employees, or matching this data against your suppliers and vendors will help expose duplicate vendors.

Reviewing data entities such as phone numbers or addresses will also reveal fraud, collusion, and a list of other potential problems. This is well understood and well leveraged through software solutions like ACL. These types of entity matching of rules and monitoring for data anomalies have become standard over time. What has helped make this an automated process is the ability to match against structured data sets with clear rules for identifying what ‘signals’ indicate fraud.

External, Open Source Signals

Now, let’s think about open sources. The Internet is the largest data set that exists, with information on everything imaginable. This data exists in many formats with no inherent structure (there are no ‘rows and columns’). This means developing the rules to identify a signal is a lot more complex and requires more creativity than most tools support today. Understanding which entities in the data are useful is helped by the use of Natural Language Processing (NLP) to ensure structured data is developed.

Open Source Data for a Hedge Fund

Think of the hedge fund business that wants to monitor online policy pages of specific companies to learn which changes are being made to the fine print of products or legal actions. Signals to the hedge fund will make for faster decisions on investments.

Signals may be as simple as, ‘alert the office immediately if product recall notices are posted’. Perhaps it comes earlier with the monitoring of social media for sentiment on a new product release. Signals may be a threshold on negative versus positive feedback as well. Entities would include a company’s products , sentiment score, dates, locations, aggregate scores, and frequency of change over time period.

Open Source Data for Insurance

Maybe it is the chief actuary of a large insurance firm who is eager to better understand the backgrounds of people applying for policies online. What can open source data reveal about a car insurance applicant and how might that affect the rate quoted?

Is there a news article about a previous conviction or accident where the person was found to be negligent? Is it the fact that social media shows the potential policy holder standing beside his or her new sports car with the caption “had it going 210 on the 407 highway today and it felt great”? Does data such as this match a rule built for high risk activities?

Identify the Signals

Obviously this is simplifying for the sake of this article but you get the point – the practitioner of open source data needs to take their time to identify what the signal looks like and what rules would cause an action to occur. The entities in the insurance example would include a first name, last name, location, reference to infraction, reference to drinking or intoxication, reference to a criminal offense , a location, date, and a vehicle mention.

In the case of a ‘false employee’ alert with a shared address to another employee or supplier, the actions are very clear.

Think of the security sector for a moment and let’s assume we’re monitoring risks to employees traveling to a specific country. How many negative events would need to be reported before a ‘no travel’ ban is issued? Does it depend on where the activities are occurring? How frequent they are occurring? Who is being targeted by their actions? The time of day or year? Others?

From a duty of care perspective, this would be known by the risk management apparatus of the company. If this is understood, we can build the data rules and create alerts with a workflow based on monitoring open source data.

What about predicting events using open source data monitoring? Is it possible to identify a company expansion by monitoring permit filings across jurisdictions? What about the frequency of new job postings by that company across geographies? Perhaps it is monitoring the growth and movement of their core suppliers to identify where the manufacturer is moving next? All of this can be done through open source data, but YOU need to have an idea for what signal to look for and in what open source data sets.

Finding Your Signals in Open Source Data

Open source data continues to grow everyday. What was once unavailable is now free and open for access (think Google Maps – where was that 10 years ago?). Geolocation data, the ability to assign latitude and longtitude based on matching textual reference to an open lookup table, advances in image matching, and so much more.

It is an amazing place to spend time, and we encourage any company looking to grow their open source programs to get creative, think differently, and think about what signals would look like to you. There is absolutely no doubt that open source data will drive value through revenue, lead generation and loss reduction and prevention.

Good luck finding your signals in the noise! Let me know if we can help.