BrightPlanet has provided terabytes of data for various analytic projects across many industries over the years. Our role is to locate open-source web data, harvest the relevant information, curate the data into semi-structured content, and provide a stream of data feeding directly into analytic engines or final reports.
These collections often contain data from dozens, hundreds, even thousands of websites using various techniques to optimize the best content from each website. All websites are not created equally; there is a very low chance the same harvesting technique will work for every website.
Techniques for Harvesting Content
BrightPlanet has developed several techniques for harvesting data. One of the easiest for people to understand would be a Surface Web harvest, or a hyperlink crawl. Start with a single URL, then follow the outgoing links from that URL, and repeat. This is a typical process for collecting web data and has is utilized in nearly every project we do.
Other techniques increase the harvest complexity, such as a Deep Web harvest. Instead of starting with a URL and following links, our Deep Web harvester can interact with a website’s internal search database, asking it different questions, each time collecting and harvesting the results as they are returned.
Our extensive harvest engine also supports many other techniques, such as a Dark Web harvester, REST-API harvester, and even a custom scripted engine. Each of these techniques allow many variations for diving deep into websites and extracting only relevant content, avoiding noise in the final dataset.
Curating Deep Web Content
As each document is harvested, it is normalized into a consistent stream of data and prepared for curation. It is important the content is properly prepared to optimize entity relationships. If data gets mangled, it can be mismatched and produce poor results. Garbage in, garbage out.
Each project will have its own curation steps defined because what works for a fraud project will be useless for a reputation management project. Our Rosoka integration allows us to be very flexible and extensive when it comes to creating custom named entity recognition solutions.
Data Delivery through REST APIs
Lastly, data needs to be delivered to the analytics platform for further processing. At this point, the data is properly prepared for a third-party solution to ingest the data. The easiest, and most common way to transfer data is via our REST API service.
Typically, our Data-as-a-Service ends with a REST API that our clients or partners can easily integrate into a dashboard, service, analytics engine, data visualization, reports, or even a simple data alert.
BrightPlanet is the leader is providing deep Data-as-a-Service to our customers with open-source, web content through a simple-to-use service. Our customers do not need to worry about the complexities and details about harvesting, curating, and preparing data for analytics. Instead they can focus on what they do best — creating intelligence.