Deep Web Data Harvesting, Uncovering the Process

Deep Web data harvesting seems like a very intimidating endeavor to many organizations but it really does not have to be. In this post, we discuss the process a user goes through with BrightPlanet to get the data on the web they are looking for from unstructured data to the delivery of their choice.

Why Deep Web Data Harvesting

Web Data ImageThere are plenty out of the box and free applications on the market that help you do monitoring of content from the Surface Web e.g. Google Alerts. These applications work great for monitoring and alerting you to new content; however, if the solution that you are looking for requires more than a simple email alert to new content, you need to look to the Deep Web.

The problems you hope to solve from the data found on the Deep Web are significantly more complicated and cannot be answered by a Google Alert. For these types of harder questions, we offer BrightPlanet’s Deep Web Intel Silo Services to help you not only find the data, but analyze it so that you can answer the harder questions.

1. Defining the Scope

The first step in developing a Deep Web Silo is to define the scope of the silo. In this step, you work with BrightPlanet’s Deep Web Investigators to uncover the answers you are hoping to get from the data. You help determine the data to collect from the web and what type of enrichments need to happen to gain actionable insight from the dataset.

2. Harvesting / Tagging

After the scope has been defined, BrightPlanet’s Deep Web Investigators work behind the scenes to perform the harvesting on the data using BrightPlanet’s patented Deep Web Harvester. BrightPlanet’s Investigators configure the Deep Web Harvester to collect the content specified within the scope document. Often the end users do not necessarily care how BrightPlanet gets the data, simply the fact that the data is collected. If you want to learn more about how BrightPlanet’s harvest capabilities, check out this blog post.

The more important step than the actual harvesting of the data is the normalizing and enriching of the content. Data as it is collected from the web is completely unstructured, meaning that it’s just text.  It also exists in many different formats, an HTML page, a Word document, a Powerpoint, etc. The true value in the harvesting comes not only from getting unstructured data from the web, but converting the unstructured data, using tagging and processing it into a semi-structured format.

BrightPlanet Deep Web Investigators work with the end users to determine what types of custom tags are extracted to enrich the documents. The custom tags extracted help prepare the data for analytics and also help answer questions for virtually any domain since they are customizable.

3. Delivering the Data

Once the data has been harvested and tagged, BrightPlanet delivers the data in some type of format for the end user. This last piece of BrightPlanet’s Silo services varies heavily by the definitions of the scope of the project and the end users. The output is completely customizable, but some options for the output are:

1. Email Alert

BrightPlanet users that want to do monitoring of datasets to new content that is added online or monitor content that is changing can receive the outputs in a customizable email alert.

Customer Example: A political organization wanted to monitor what type of content was being published about them on message boards, blogs, and in email newsletters. The customer also wanted to know when content had been changed on specific websites identified as their competitors.

BrightPlanet’s Deep Web Investigators automated the harvesting of content from specific blogs, message boards, newsletters and competitor sites. BrightPlanet then configured two different email alerts. One email alert sent a daily update of new mentions, while another alert sent out notifications about content changes on the competitor’s website.

2. Search Dashboard

Customers that want a search experience of their aggregated content can use BrightPlanet’s Deep Web Silo Dashboard. BrightPlanet’s dashboard allows for full text search capabilities.  Any custom tags can then be included with the faceted search capabilities to help you find the answer you’re looking for fast.

Customer Example: Watch this video to see how a health research group uses BrightPlanet’s Silo Dashboard.

3.  Access to raw data

Customers that need access to the raw data can be supplied access to semi-structured data sets harvested and tagged by BrightPlanet. BrightPlanet can offer either a shared database so that end users can access the data in real-time or deliver the data in a semi-structured flat file.

Customer Example: An HR staffing company for Fortune 500 companies wanted to bring more offerings for their end users around the area of competitive intelligence. To help build another service offering, BrightPlanet harvested all the job postings of the Fortune 500 companies and then delivered the harvested data in a semi-structured format extracting the following items:

  • Job Title
  • Description
  • Qualifications
  • Company
  • Location.

The HR staffing organization was then able to use BrightPlanet’s harvested data within their own environment to increase their current service offering.

4. Custom Reports

BrightPlanet’s Deep Web Investigators can create custom reports for end users. Sample reports are completely customizable and vary by end user. An example sample report can be found here.

Download Our Whitepaper

If you are interested in learning more about how unstructured data becomes actionable intelligence, download our whitepaper that expands on what was discussed in this blog post and provides more end user case studies.



Photo: Patrick Hoesly