Common Deep Web and Big Data Questions Answered – Part 2

Welcome to our second post in our two-part series which answers some of the frequently asked questions we get from visitors and customers. Last week, we posted Part 1 which focused on questions related to the Deep Web and how we get data from it. This post focuses on questions about Big Data and how we enrich and structure it from the Deep Web.

Question 5: What is OSINT data?

Short Answer:

OSINT stands for “Open Source Intelligence”, which refers to any unclassified information and includes anything freely available on the Web. OSINT is the opposite of closed source intelligence or classified information. Common OSINT sources include social networks, forums, business websites, blogs, videos, and news sources.

Link to Blog Post: What is OSINT and How Can Your Organization Use it?

Question 6: What’s the difference between structured and unstructured data?

Short Answer:

Structured data would be any data that has some type of form or structure to it either existing in a database or existing in a spreadsheet with columns and rows.

Unstructured data lacks any standard form or consistency. Unstructured data will typically be free flowing blocks of text. Almost all data on the Web is unstructured data.

Link to Blog Post: “Structured vs. Unstructured Data”

Question 7: How does BrightPlanet structure unstructured data?

Short Answer:

To help structure harvested data from the Surface Web and Deep Web, BrightPlanet partners with analytic technologies. The most common way BrightPlanet structures data is through the process of entity extraction.

Entity extraction identifies key terms of entities within documents and extracts them out for additional analysis. Identifying entities gives the data additional structure from its previous unstructured state.

Link to Blog Post: The Data Enrichment Process Explained

Question 8: Do you work with data in multiple languages?

Short Answer:

BrightPlanet’s harvester can harvest content in any native language that is listed online. We also have strategic partnerships with third-party technologies that can support multiple languages during the enrichment and entity extraction phases.

Link to Blog Post: Harvesting and Enriching Data in Multiple Languages

Question 9: How does BrightPlanet’s technology compare to Kapow and Connotate?

Short Answer:

Companies like Kapow and Connotate have technologies that are known as Web extraction engines. Extraction engines rely on using the structure of webpages to extract content.

Extraction engines are not highly scalable as they need custom configuration to harvest data. They also do not handle unstructured data with free text such as blog postings, news articles, and forums easily. BrightPlanet’s harvest process structures text based on the elements in the text and the text itself as opposed to the makeup of the Web page.

Link to Blog Post: How Deep Web Harvesting Isn’t Your Traditional Web Extraction

Question 10: What’s the difference between the Twitter Firehose and Twitter API?

Short Answer:

Twitter provides access to their data a couple of different ways. The two major options that users have to access Twitter data are the Twitter Firehose and Twitter API. Twitter’s API is a limited amount of tweets and does not guarantee access to all Twitter data like the Firehose. To access all of the data available on Twitter, users should work with a full Firehose provider like BrightPlanet.

Link to Blog Post: Twitter Firehose vs. Twitter API: The Difference and Why You Should Care.

Learn More

Want to learn more about Big Data on the Deep Web, check out our additional free resources for download.

Photo: Raymond Bryson