How We Do Data Harvesting

Here’s the scenario: you’re faced with a data problem that is negatively affecting your business. You don’t know where to find the data, let alone how you’ll be able to obtain it. That’s where we can help.

Data harvesting is a meticulous process and at the core of all we do. In order to discover and gather data in a way that’s useful for you, our Data Acquisition Engineers (DAEs) pair our advanced technology with a variety of unique collection methods.

Our Data Harvesting Technology

BrightPlanet’s technology includes a Deep Web Harvester (DWH) platform used to set up and configure web harvesting.

DWH is a fully-hosted service and is leveraged by our DAEs on behalf of our clients to properly harvest content. The DWH is not a solution that is client-facing, but rather used by our engineers to properly harvest, tag, and store data for our clients.

Once the data has been harvested with our DWH, the content is available through our Portal Dashboard for clients to perform additional research and analysis. The portal includes a simple-to-use, robust interface that supports features such as full keyword searching, saved searches, faceted searching, and exports. It is also leveraged by clients to research and view the harvested content.

Data acquisition through our DWH operated by our DAEs on behalf of our clients fully supports all techniques necessary to harvest modern day open source websites. This includes complex queries into databases (assuming that the source itself supports complex queries), Deep Web sources, and all Surface Web websites. The events themselves are set up by our DAEs but once configured are fully automated.

The Portal Dashboard

The Portal Dashboard fully supports faceted searching with fully customizable entity extraction providing a custom solution for each client. Default facets include people, companies, domain names, and source types.

Each user can create saved queries that can be used for future research. These queries can include full Boolean syntax, keyword wildcards, date range searching, and faceted keywords. Saved queries are unique by the user and can be used to export data into a CSV format. When saved queries are re-run, they will pick up all new content since the last time that they were executed. This feature can also be used for an RSS feed or our REST API.

As documents are harvested, duplicates are automatically removed. If documents have been modified since their last harvest, each version is stored in the database—although only the most recent version is available within the Portal Dashboard.

Each document gets a unique hash value that is used to detect duplicate content from multiple sources (the same page on two different URLs.) This hash value can be used externally as a universally unique identifier across multiple data feeds.

Portal searches support a relevance rank of results based on the query although search results can also be sorted based on the harvest date. Harvest date sorting becomes very valuable if the content is leveraged through our REST API for external analysis.

Data Acquisition Engineers and Deep Web Harvesting

BrightPlanet provides each client with their own DAE and Project Manager to set up, configure, and maintain their harvest sites and activities. Each DAE is highly trained and skilled in harvesting web content efficiently and accurately.

All harvest events will be set up for the client by our DAE team, including any spidering or web crawling.

The DWH contains an anonymity proxy server cluster that is leveraged for all web harvesting. This is done for two reasons:

Many sites will automatically block your user-agent if it makes too many requests within a specific period of time. Deploying a proxy server with rotating IP addresses will resolve this issue with most websites, although best practice data harvesting techniques are always used to avoid bot detection.
It is best to harvest from generally anonymous reverse IP address servers which are not easily attributable back to BrightPlanet or the client. In our case, these IP addresses are owned and managed by Amazon Web Services (AWS), not BrightPlanet.

While our harvest engine runs through general anonymity if end users click links that leave our portal and go directly to a source website, their local desktop client’s IP addresses will be exposed. This does mean that additional safeguards should be deployed at the client’s site to ensure that this behavior is done in accordance with internal policies.

All common structured and unstructured text content file formats are fully supported by DWH, including, but not limited to, XLS, DOC, and PDF. We do not process non-text formats like images, audio, or video.

In addition, to file format conversions, all meta data is collected for these file formats including meta data, title, date published (if available), domain name, URL, and author (if available). Other data can be identified using entity extraction.

DWH can access most open source websites, including the Deep Web, the Dark Web, and Surface Web websites. This includes, but is not limited to, open subscriptions sites, RSS feeds, sites that return XML or JSON, open social media platforms (including Facebook and Twitter), news websites, blogs, patent documents and databases, websites with conference data, contract awards, or other web content.

Data Harvesting with the Experts

In addition to the types of websites that we can harvest, we can also harvest raw content from them. Such content includes company blogs, product listings, product literature, press releases, aggregated news providers, classified listings, competitor research, market research, people research, social media, patent databases, security auditing, threat assessment, and even job boards.

Anything available on open websites, the Deep Web, and the Dark Web can be harvested, tagged, and databased by our technology. Data that is not publicly available, or private, is not harvestable. This data would need to be loaded manually and includes content like internal email and Raytheon repositories.

If you have any additional questions about our technology, data harvesting, or discovery and gathering methods, tell us what you’re working on. One of our expert DAEs can help identify where we fit in with your business.