“Is this the Deep Web or Surface Web?” – We get asked this seemingly simple question often by customers and curious learners interested in our harvesting technology and the Deep Web in general.

The basic question seems pretty straightforward but like all things on the Internet, there’s a very gray area that exists in even the simplest of questions. We hope to help shed some light on why the answer to this simple question isn’t as straightforward as it may seem.

The Deep Web and Surface Web

We’ve covered the differences between the Deep Web and the Surface Web and where they reside in previous posts. This post isn’t explaining the difference between them, it’s explaining why it’s become very difficult to take a single Web page and classify it into one or the other.

When we categorize a harvested Web page as Deep Web, we are categorizing it based on the process that was used to harvest that document instead of categorizing it based on where that document resides. A Deep Web document to us is one that is accessed through a Deep Web search or a search that required a query into a Web search form. These searches return both pages that can be found on the Surface Web and Deep Web; whereas documents we classify as Surface Web were collected using a link crawling or spidering process.

There are a few reasons that classifying a document as solely Surface Web or Deep Web is tough.

The first reason is the major discrepancies that exist when it comes to the different data that is collected by different search engines. Take for example Twitter. Twitter results are included in Google search results because of Google’s relationship with Twitter, but not typically included in search engines like Yahoo! And Bing. Do you then classify Twitter as Deep Web or Surface Web? The answer isn’t always clear, but we tend to lean more towards Deep Web.

Another reason would be that search engines often only display the first 1,000 results. With billions of unique Web pages online today, the vast majority of pages will never appear in a search engine result regardless of the query that is searched. How would one classify these pages? A giant gray area clearly exists.

Web Harvesting Vs Indexing

It often more important to understand the difference between indexing and Web harvesting versus the exact classification of data. Search engines index the Web and don’t actually collect all the data from each page. They only collect and temporarily store some of the page data; whereas harvesting permanently stores all the text data from pages. When comparing harvesting to indexing data, harvesting text from pages has a few advantages:

  1. It allows you to track changes to Web pages and also view content that has been removed from Web pages.
  2. The data being stored offline allows it to be structured for further analysis.

