The advancement in hardware, cloud computing, and more efficient database and warehousing solutions has allowed for a new way of thinking in data analysis, particularly when it comes to data collection. The old way of thinking that revolves around the concept of sample size, often described as ‘n’, is a thing of the past.
Instead of having to collect an appropriate sample of a dataset, why not go directly to a source and collect the dataset in its entirety? This new paradigm of using data has forced a large number of industries to focus on exploiting large amounts of unstructured external data, which becomes significantly more difficult to manage than a structured dataset. In this blog post, we explore the noisiness in data, how it happens, and why sometimes it’s best to embrace all of the data instead of some of it.
Where Noise Gets Introduced
Anytime you start to incorporate unstructured data into your analysis, you’re introducing a few additional challenges. The main challenge for our customers is that some noise (aka imperfect data) often gets included when one structures previously unstructured datasets. This noise typically gets introduced when companies deploy automated processes to structure the data from hundreds or even thousands of disparate and separate sources, such is the case when collecting external and Web data.
No technology is immune to introducing noise, including BrightPlanet and our Natural Language Processing (NLP) engine. An example of noise that often faces BrightPlanet occurs during the entity extraction process, where the names of people, companies, and places can potentially be incorrectly identified.
Consider the following Los Angeles Times article on Huston Street, a closing pitcher for the Los Angeles Angels. The image below shows our entity extraction process. The green highlight indicates that the pitcher, Huston Street, is being extracted as a named place instead of an individual person. It’s mistakes like this that commonly get introduced as noise, a false positive extraction.
Changing the Way We Think About Data
What companies need to realize is that the payoffs of having access to all the data, even with imperfections, is still significantly better than having access to simply a small sample or none of the data at all. Kenneth Cukier, Data Editor for The Economist, writes on this topic in his Foreign Affairs article, “The Rise of Big Data: How It’s Changing the Way We Think About the World”. Cukier maintains,
“When we increase the scale by orders of magnitude, we might have to give up on clean, carefully curated data and tolerate some messiness. This idea runs counter to how people have tried to work with data for centuries. Yet the obsession with accuracy and precision is in some ways an artifact of an information-constrained environment. Tapping vastly more data means that we can now allow some inaccuracies to slip in (provided the data set is not completely incorrect), in return for benefiting from the insights that a massive body of data provides.”
Embrace the Imperfection
At BrightPlanet we harvest external data from all over the Web based on the problem you are trying to solve and/or insight you are trying to gain. Because of the scale at which we are able to harvest and deliver data, there will be imperfection but just like Kenneth Cuvier says, “Tapping vastly more data means that we can now allow some inaccuracies to slip in (provided the data set is not completely incorrect), in return for benefiting from the insights that a massive body of data provides.”