data harvesting

Providing Structure and Security to Web Data Harvesting

Over the years, one of our main focuses has been getting the most out of our Deep Web search. As multiple industries are discovering the demanding need for data harvesting and managing their Big Data, we’ve taken the necessary steps to provide structure and security for their respective businesses.

Here are some of the new and improved ways we are providing structure and security for our clients.

Deep Web Harvester

BrightPlanet’s Deep Web Harvester (DWH) solution is a full Software-as-a-Service (SaaS) platform for web harvesting with no onsite installation or configuration. All web traffic uses a valid SSL certificate, which includes clients browsing our content portal.

DWH is a SaaS solution which is hosted through Amazon Web Services (AWS) commercial infrastructure; this includes our harvest servers, document processing servers, databases, queues, front-end services, and our REST API. Each is monitored 24/7 for stability, performance, and overall server health.

The DWH service has been evolving and improving for over 15 years, constantly keeping up with the ever-changing landscape of the open-source web.

We have even expanded our offerings as social media platforms emerged, broadened our Deep Web harvesting support, added TOR (Dark Web) harvesting solutions, and expanded our source scripting engine to support complex systems with CAPTCHA, paywalls, and dynamic content.

BrightPlanet continues to innovate and expand our core harvest technology. Our core engine, which has been positioned for growth since its inception, will continue to expand and evolve as needed.

REST API

Many of our projects involve integration with external or third party business intelligence (BI) solutions. We developed our REST API to easily allow customization without the need for BrightPlanet to be directly involved with the integrations.

There are no limitations to the BI solutions we can support; most platforms will easily integrate into our RDBMS layer or through our REST API. Some of the tools that we have integrated with include Tableau, IBI, Palantir, Centrifuge, Saffron Technology, and DataVoyant.

A well-documented REST API is available to extract all harvested web data for further use into any third party solution. This API is frequently used by clients for a wide variety of solutions including integration with BI tools, development of external reports, integration with dashboards, and simple post-processing of content.

In addition to our REST API, clients can access their AWS hosted MySQL instance (Aurora) for an additional hosting fee. This further expands the possibility for certain BI tools by allowing them access to more complex SQL queries that can run on the server to avoid additional client processing. Tools like Tableau will benefit by having direct access to MySQL.

Integrating Accessibility for Data Harvesting

Our DWH has several integration points, allowing it to be easily customized by our Data Acquisition Engineers into any platform. A typical integration will be to post-process harvested content into another analytic platform, which is done through our existing REST API.

If analytics need to be integrated into our document processing pipeline, we have developed our open-pipeline platform to accommodate inline processing, like an entity extraction solutions or machine translation. As new or custom features are necessary, these pipelines or APIs will be leveraged to accommodate the RTN tools and process.

Our DWH engine is written entirely in Java and is a hosted solution, so the client does not need to setup or install anything within their infrastructure. The REST API has Python and Java examples, but it is not required to use either of those languages.

Our DWH solution has no third party dependencies. We have fully integrated with the Rosoka Entity Extraction platform to provide a comprehensive content extraction and analysis of content, but there are no external dependencies on our solution.

The only requirement to leverage our harvested content is a modern web browser and access to the Internet. There are no browser plugins, all dashboards are rendered in HTML and use Javascript to improve dynamic content.

The Powerhouse of Data Harvesting

BrightPlanet has developed an extremely robust architecture that supports large-scale harvesting through a distributed solution that leverages virtual servers within a server cluster. Since this infrastructure is developed around the AWS architecture, those virtual servers can be launched with zero lead-time.

Document content processing has been decoupled from the harvesting engine to allow the two processes to scale independent of each other, allowing for a flexible architecture that instantly will accommodate small- to large-scale solutions.

If you have any questions about our Deep Web search and/or data harvesting, tell us what you’re working on. One of our expert Data Acquisition Engineers will work with you to help discover your custom data solution.

WHAT ARE YOU WORKING ON?