For 2013, it is important to tap into the rich resources existing in the Deep Web. The last time an extensive study was completed estimating the size of the Deep Web was in a 2001 study – a time when the internet consisted of only approximately three million different domains. The 2001 study revealed that at that time the Deep Web was approximately 400-500 times the size of the Surface Web.
Today’s internet is significantly bigger with an estimated 555 million domains, each containing thousands or millions of unique webpages. As the web continues to grow, so too will the Deep Web and the value attained from Deep Web content. This blog post includes:
- What the Deep Web is
- Why you should care about the Deep Web
- How to harvest the data from the Deep Web
- Why regular search engines can’t reach the Deep Web
Deep Web vs. Surface Web
We’ll start with a quick refresher on the difference between the Surface Web and the Deep Web.
Surface Web: Parts of the internet that can be found via link crawling techniques – meaning it is linked data and can be found via a link from the homepage of a domain; Google can find this data.
Deep Web: Portions of the internet that cannot be accessed by a link crawling search engine like Google. The only way a user can access this portion of the internet is by doing a directed query into web search form to access content within a database that is not linked data. In layman’s terms, a search that is within a particular website.
For more information in layman’s terms check out:
Getting the Data from the Deep Web
BrightPlanet’s core technology lies in its ability to harvest Deep Web content, or content that requires a directed query to a web search form.
The Deep Web Harvester (which holds 7 individual patents) works directly off the source’s HTML forms to emulate exactly how a human user would interact with a site search through their web browser. However, unlike a single user only able to search one site, one query at a time – BrightPlanet’s technology allows users to issue hundreds of queries to hundreds of different sites simultaneously.
It is also important to distinguish between traditional searches and Deep Web harvesting. Unlike traditional search technologies, like Google, that index links and allow you to view the results, BrightPlanet takes it a step further and harvests all of the results. The harvest process involves BrightPlanet extracting all of the text based content from each of the results pages and then preparing the content for some type of analysis depending on the needs of our end users.
Why is this technology valuable?
- More Efficient
- More Content
- Allows users to view every change made to a web page, not just the current version
What Search Engines are Missing
The following example should put into perspective what exactly search engines may be missing.
The Argus Leader, the local newspaper of Sioux Falls, South Dakota, did an article about BrightPlanet’s CEO, Steve Pederson (an avid bugler) titled “Living Legacy in 24 Notes.” The article at one point in time had been on the homepage of the Argus Leader, a location that had been reachable by a Surface Web search engine like Google. However, a few days after the article was featured on the front page, the article was pushed into archive format and now can only be reached via a search through the web form located on Argus Leader’s site; it has left the Surface Web and entered the Deep Web.
The two following images demonstrate the differences between the Deep Web and the Surface Web. The image on the left is a search of what Google has indexed. The query (“BrightPlanet AND Steve Pederson site:argusleader.com”) tells Google that the only results we care about are from the Argus Leader domain. The search returns zero webpages that have been indexed by Google containing both the terms BrightPlanet AND Steve Pederson.
However, the image on the right proves that there are results containing both terms. The search on the right is done using the search box provided by the Argus Leader website. The reason why this search returns results is because the search box points to a large database that can only be accessed via the Argus Leader’s search. Google does not direct queries into any site searches, as it only finds documents via link following. Hence, the “Living Legacy in 24 Notes” article has fallen into the Deep Web.
When BrightPlanet collects Deep Web content, it is exactly this type of search placed directly into the search forms that BrightPlanet can execute and at very large scale. Issuing thousands of search queries into thousands of Deep Web sites and pulling all the content back for analysis. Imagine being able to query every single online newspaper web search form within the U.S. simultaneously.
The other major advantage of using a Deep Web harvest over a search engine is efficiency. Doing a search for the query BrightPlanet on the Argus Leader webpage will return the same one article. However, doing a search for BrightPlanet within the Argus Leader domain on Google will return 74 results (see image on left). All of these results returning links that no longer contain BrightPlanet on the actual page as Google is still searching an old version of previous content on the page. When Google filters through a site for content, it filters through millions of links oftentimes picking up irrelevant content, when BrightPlanet performs a Deep Web search on a site, it only harvests the relevant contents related to your queries.
We Can Tap the Deep Web for You
Interested in learning how you can tap into the Deep Web for your business? Sign up to schedule a free demonstration call with one of our Deep Web investigators. The opportunities are endless and our investigators can help talk with you and figure out what solution might be best for you.
Not ready for a demo? Check out our free whitepaper that features the content in this post along with additional information about the Deep Web and examples of how it can be tapped.