Deep Web Harvest Engines vs. Search Engines – Finding Intel in a Growing Internet

Everyone and their mom knows how to “Google” stuff. The search engine long ago became a verb meaning the act of instantly finding answers on the Internet. Type your keyword in the search box (this technique doesn’t work as well) and BOOM! Millions of answers at your fingertips.

People across the business, government, and healthcare landscapes keep asking , “Is your technology like a Deep Web search engine?” BrightPlanet’s response is, “We are not Google. We are not a search engine, we are a Deep Web Harvest Engine.”

Different Solutions for Different Questions

To understand the major differences between a Harvest Engine and a search engine, it’s important to understand the problem that search engines are meant to solve.

Yesterday’s Search Engines

The problem search engines tried to tackle dates back to the early 1990s as the Internet increased in popularity. Mostly static webpages were being added to the Internet, but users needed a way to easily find webpages that contained information.

Search engines like Google, AltaVista, Yahoo!, and Lycos created technologies that crawled through websites and indexed them as a way for users to identify pages of interest. Search engines tried to find the most relevant page containing the answer to what users were looking for.

Questions that were originally asked to search engines in the late 90’s were very basic. Students researching class reports replaced encyclopedias with the Internet, researchers created basic webpages to share their discoveries, and social sharing consisted of updating your GeoCities page. The Internet back then was non-commercial and viewed with a research purpose.

Today’s Search Engines

Today’s Internet is significantly different; millions of webpages are published for all sorts of reasons beyond traditional research. Fifty-four percent of kids would ask Google before their teachers or parents.

Search engine companies developed systems able to quickly index millions of webpages in a short time period, therefore allowing users to accurately search the assimilated index. Search engines don’t find or store all the content on a webpage, they simply lead you to the content’s location. This lack of data retention allows search engines to get away with storing minimal information about each individual webpage.

Typically, search engines store the most frequently mentioned words, locations of those words, and finally any metadata (title of the webpage, URL of the webpage, keywords, etc) when indexing webpages. The amount of data stored from each page is a crucial difference between search engines and Harvesters.

Search Engines and the Surface Web

Search Engines like Google are really good at finding Surface Web websites providing answers to basic questions quickly. However – often companies and organizations have significantly harder questions than “How late is Burger King open?” Complex questions like those listed below require more than a search engine, they require a Deep Web Harvester:

Who is selling my products fraudulently online?
How many people have won grants on Fetal Alcohol Spectrum Disorders?
What are clinical trial patients saying about my experimental drug?
What new information has been published on my competitor’s website today?
Has anything changed in this insurance coverage plan that would affect a pharmaceutical company’s stock price?
What new breast cancer research has been published in the last month? What are people saying about it?

Deep Web Harvest Engine

Unlike a search engine, BrightPlanet’s Deep Web Harvester extracts every single word every time it accesses a webpage. Additionally, the Deep Web Harvester stores every single page harvested as a separate version in our database.

For example, BrightPlanet has a list of 100 websites actively harvesting for a customer every four hours. Therefore, the Deep Web Harvester collects a version of every single webpage found within every single of the 100 domains every four hours.

To put that into perspective, lets envision that each of those domains are relatively small websites (100 pages). In this scenario, every four hours we harvest content from 10,000 webpages (100 webpages multiplied by 100 domains). In one week, this harvesting process stores 420,000 webpages. BrightPlanet harvested over 52 million webpages in 30 days for one customer.

Harvest Engine Advantages

The concept of a harvest engine has a number of different advantages. The two largest advantages being

Analytic capabilities
Versioning of webpages.

Because BrightPlanet harvests the actual raw text from webpages as opposed to storing metadata and only top keywords, we can integrate our harvested data directly into nearly any analytic technology using our OpenPlanet Enterprise Platform.

Combining BrightPlanet’s scalable harvesting capabilities with custom analytic technology helps customers visualize, analyze, and ultimately create intelligence from large data sets.

To learn more about our partner analytic technologies, please visit our partner page.

Free Demonstration to Answer Your Questions

Have questions about how a Deep Web Harvest Engine might help your organization or business? We offer free demonstrations to answer your questions about Big Data and the Deep Web. You can sign up here for a free demonstration call. We also have whitepapers on creating intelligence from Big Data and what is Big Data, which feature case studies describing how other companies have exploited Deep Web intelligence.

Photo: The Green Party