Being able to access data from sources that you normally wouldn’t be able to access through conventional search engines is an advantage. Think of accessing data like fishing in the ocean.
You might be able catch plenty of fish with the equipment available to you (i.e. Google, Bing, Yahoo)–but you want to catch the fish in the deeper waters that no one else is catching in the places no one else is thinking of looking. Having BrightPlanet at your service is like having the ultimate fishing net that can reach all of those fish within seconds.
In this post, we discuss what the Deep Web is, how we harvest sources from it, and what that means for you.
Harvest Deep Web Sources to Cut Data Discovery Time
Harvesting data from the Deep Web is a unique BrightPlanet solution that uses our extraction tool suited for large-scale website searches.
Deep Web harvests are the solution if you want quality content from a curated list of sources. For the sake of time, you do not want to spend hours of your day performing complex Google queries or scraping through websites looking for a single content source.
What Does Deep Web Mean?
The Deep Web refers to data stored in a website’s internal database. This data is typically not accessible via search engines, which index the Surface Web. Consequently, Deep Web data is accessible only by issuing queries to a website’s search form.
Once the query is issued, BrightPlanet extracts the search results. Our Harvester then detects “next” pages to ensure you receive all relevant documents.
How do Deep Web Harvests Work?
In order to get access to Deep Web sources, BrightPlanet’s Data Acquisition Engineer team configures each source. This means testing the search forms and filtering out irrelevant links. We also determine which text fields we should extract from in order to guarantee clean results.
Configured sources are grouped in what BrightPlanet calls the Source Repository. For example, you could configure the top 50 news sources from Texas into one “Texas News” source category. When BrightPlanet harvests that source category, your search query is automatically issued to all 50 of those news sources. The harvest also extracts and processes the search results. Below are some examples of existing BrightPlanet Deep Web source categories:
- U.S. News
- International News
- Code Repositories
- Finance News
- Patents and Patent Applications
- Science News
- Academic Publishers
- Job Search Websites
At this point, it’s clear to see how the scalability of BrightPlanet’s Deep Web harvests can exponentially reduce the time required for data collection.
With a single BrightPlanet harvest, multiple queries can be issued to each source. Many sources may also be grouped into one category, and within each event you can harvest multiple categories.
In quantitative terms, for an example harvest event where you want to search with three queries per source, you could potentially leverage:
- 3 queries
- 10 source categories
- 8 unique sources per category (80 total sources)
- 100 relevant matching documents per query
- 3 x 10 x 8 x 100 = 24,000 harvested web documents
Of course, that was just an example. There is no limit to your maximum documents per site, number of source categories, or unique sources within those categories. In fact, the only limitations to configuring a website as a Deep Web source are the external websites rate restrictions.