With respect to the mass categorization that is central to most computer operations, there are two types of relevant data which affect speed of assimilation as well as information recall: structured data and unstructured data.
Structured vs. Unstructured
For the most part, structured data refers to information with a high degree of organization, such that inclusion in a relational database is seamless and readily searchable by simple, straightforward search engine algorithms or other search operations; whereas unstructured data is essentially the opposite.
The lack of structure makes compilation a time and energy-consuming task. It would be beneficial to a company across all business strata to find a mechanism of data analysis to reduce the costs unstructured data adds to the organization.
The Problem with Unstructured Data
Of course; if it was possible or feasible to instantly transform unstructured data to structured data, then creating intelligence from unstructured data would be easy.
However, structured data is akin to machine-language, in that it makes information much easier to deal with using computers; whereas unstructured data is (loosely speaking) usually for humans, who don’t easily interact with information in strict, database format.
Email is an example of unstructured data; because while the busy inbox of a corporate human resources manager might be arranged by date, time or size; if it were truly fully structured, it would also be arranged by exact subject and content, with no deviation or spread – which is impractical, because people don’t generally speak about precisely one subject even in focused emails.
Spreadsheets, on the other hand, would be considered structured data, which can be quickly scanned for information because it is properly arranged in a relational database system. The problem that unstructured data presents is one of volume; most business interactions are of this kind, requiring a huge investment of resources to sift through and extract the necessary elements, as in a web-based search engine.
Since the pool of information is so large, current data mining techniques often miss a substantial amount of the information that’s out there, much of which could be game-changing data if efficiently analyzed.
BrightPlanet’s Solution for Unstructured Data
BrightPlanet’s Deep Web harvesting platform provides a robust solution for collecting both structured and unstructured data from the Internet. BrightPlanet takes a unique approach to “connecting” those unconnected strands of information through the use of metadata.
BrightPlanet’s Deep Web harvesting technology employs multiple threads to mass-harvest scalable quantities of unstructured data. Harvests are based on multiple user-developed queries with results (web pages, PDF”s, XLS, PPT, XML, etc.) qualified through customizable filters. BrightPlanet developed four scoring algorithms that index the information based on relevancy to further qualify the documents returned, ensuring the user is seeing only super-relavant content.
The final user interface displays the qualified results in a searchable database based on customizable facets (URL, filetype, source category, people mentioned, places mentioned, companies mentioned, custom keywords, etc.)
Finding a way to analyze and create intelligence from the wealth of unstructured data available on the Web can be expected to endow an organization with the direct benefit of drastic increases in overall effectiveness and speed of decision-making and implementation.
Photo courtesy of Brandon Doran