BrightPlanet
  • Home
  • About
  • Services
  • API
  • Case Studies
  • Deep Web University Blog
  • Contact Us
  • Search
  • Menu

Frequently Asked Questions

We get asked some of the same questions about the web and our technology a lot.

Check out some of the most popular questions below. You can also find more robust answers at our blog, the Deep Web University.

What is the difference between the Surface Web, Deep Web, and the Dark Web?

The Surface Web is anything that can be found by a search engine, it normally consists of web data that is linked and can be navigated to through a clickable link. The Deep Web refers to any portion of the internet that cannot be found through a standard link crawling search engine. The vast majority of this content exists because you need to navigate through a web search form or input a query to get there.

Dark Web refers to anything that is intentionally hidden. The most common place on the Dark Web is the TOR network or a private anonymous internet that can only be accessed via a special browser. The Dark Web makes up a small portion of the Deep Web.

Link to Blog Post: Clearing Up Confusion – Deep Web vs. Dark Web

How Big is the Deep Web?

In 2001, BrightPlanet completed a study to test the size of the Deep Web, our initial findings revealed that search engines were searching only 0.03% of pages available on the entire Internet.

Since 2001, BrightPlanet has not completed any additional studies to predict the size of the Deep Web because of how large the Internet has grown. The Internet has grown so vast and so large that we now classify the Deep Web as infinite.

Link to Blog Post: How Big is the Internet?

How does BrightPlanet harvest Deep Web content?

BrightPlanet has patented technology to automate the process of directing queries into sites that have web search forms. Think of any site that requires you to type in a search in a search box, this could be a government search site or a travel site. Our Deep Web Harvester places the queries directly into search forms at large scale and harvests the results of those queries to provide you with Deep Web data for analysis.

Link to Blog Post: What is a Deep Web Harvest?

What’s the difference between BrightPlanet and a Google Search?

BrightPlanet performs harvesting of data as opposed to indexing sites. When we harvest data, we extract all of the text from each individual web page that our harvester visits. When Google indexes data, they don’t extract all the text content. Google only stores a temporary reference of what they think is important, usually the most mentioned keywords. BrightPlanet harvests are a directed harvest allowing you to define the pool of data to be collected.

Link to Blog Post: Why Deep Web Harvesting is Different than a Google Search

What’s the difference between structured and unstructured data?

Structured data would be any data that has some type of form or structure to it either existing in a database or existing in a spreadsheet with columns and rows.

Unstructured data lacks any standard form or consistency. Unstructured data will typically be free flowing blocks of text. Almost all data on the web is unstructured data.

Link to Blog Post: “Structured vs. Unstructured Data”

How does BrightPlanet compare to a web extraction tool?

Companies like Kapow, Connotate, and Import.IO have technologies that are known as web extraction engines. Extraction engines rely on using the structure of web pages to extract content.

Extraction engines are not highly scalable as they need custom configuration to collect data from web pages. They also do not handle unstructured data with free text such as blog postings, news articles, and forums easily. BrightPlanet’s harvest process structures text based on the elements in the text and the text itself as opposed to the makeup of the web page.

Link to Blog Post: How Deep Web Harvesting Isn’t Your Traditional Web Extraction

What’s the difference between the Twitter Firehose and the Twitter API?

Twitter provides access to their data a couple of different ways. The two major options that users have are the Twitter Firehose and Twitter API. Twitter’s API gives access to limited amount of tweets and does not guarantee access to all Tweets like the Firehose. To access all of the data available on Twitter, users should work with a full Firehose.

Link to Blog Post: Twitter Firehose vs. Twitter API:  The Difference and Why You Should Care.

What type of data can you get from Twitter?

We collect data from Twitter using both Twitter’s search API and also the Twitter Firehose. Regardless of how we get the data, the data returned is fairly identical.

When we get data back from Twitter, we get additional metadata about the Tweet and the user that sent it. This includes the device used to send the Tweet, the number of users, any geo-coordinates of the Tweet,  the time of the Tweet, the number of Twitter followers, etc. This allows for more thorough analysis of data to find things like pattern of life and better understanding of how users are using Twitter.

Link to blog post: What Type of Data You Can Get From Twitter

How does BrightPlanet compare to a media monitoring tool like Radian6?

Media monitoring and press clipping services are very common tools used by marketers wanting to understand brands and how often they are being mentioned. We often get the question how BrightPlanet compares to companies that offer media monitoring services?

Most of these tools offer data to their users through a dashboard that can be accessed through a web browser. All of the customers of these services have access to the exact same dataset.

We are far removed from traditional media monitoring services in that we are a data harvesting technology company. Major different include that we create custom datasets for our end-users and let them define the data sources. Each of our customers has access to a dataset that they have helped define and understand. Also, our customers get access to a fully customized output which includes direct access to data through API or some other format of their choosing.

What type of data can be collected from Facebook?

In April of 2015, Facebook made a change forcing all users onto their Graph API v2.0 to protect its users. This heavily limited the amount of data that developers could take from Facebook including pages from individual users that could be harvested. So what can we harvest from Facebook?

From Facebook, our harvesters can access public pages that can be liked. This most often happens to be business and celebrity pages. An example of one of those pages is this page for CNN: https://www.facebook.com/cnn.

From a public page, we can harvest the posts individuals made to this page as well as posts created by the CNN page. Because of the terms of use of Facebook, we can no longer collect data from personal pages that need to have a friend request.

How do you calculate sentiment?

Our Natural Language Processing (NLP) partner, Rosoka, helps us calculate sentiment (polarity) at the data point level. This means that instead of having a specific web page and calculating if that web page is positive or negative, we calculate if the mention of a data point (person, place, or company) on that web page is positive or negative. This gives you much better data for analysis and sentiment tracking.

Link to blog post: How Can You Analyze the Relevance and Sentiment of Online Data

How does BrightPlanet Anonymize Web Data?

We anonymize our harvesters for a number of reasons that include protecting customer assets as well as not getting blocked by web servers. We anonymize our web traffic by operating all of our harvesters through a cloud environment and passing all traffic through a web server that is not attributable back to BrightPlanet.

Link to blog post: How Anonymization on the Web Works

Schedule a Consultation

Schedule a free consultation with a BrightPlanet® Data Acquisition Engineer today.

Schedule Now
  • Become a Partner
  • Privacy Policy
Copyright © 2001-2019 BrightPlanet® Corporation. All Rights Reserved.
Scroll to top