BrightPlanet provides terabytes of data for various analytic projects across many industries. Our role is to locate open-source web data, harvest the relevant information, curate the data into semi-structured content, and provide a stream of data feeding directly into analytic engines, data visualizations, or reports. In this blog series, we are going to be diving into our techniques and process that we bring to each project.
In our previous post named “We harvest a lot of websites for our clients, but how do we know which sites to harvest in the first place?”, we talked specifically about finding valuable websites. Today, we are going to discuss how we decide the best ways to harvest content for our clients.
Determining What Content to Harvest
Before we setup any harvests, we always sit down with our clients and perform a client onboarding process, or harvest audit, to determine exactly which data makes the most sense to harvest. It is not sufficient to know only the website domain; we need to know which information to extract from those websites.
Think of a large news website, such as CNN. It is not practical, nor is it useful, to harvest the entire website. For example, if our client is looking for North Korean threats we will target only the sections within CNN which focus on that topics, such as world news, while excluding sports, weather, lifestyle, etc.
Once we have defined our target content, each site may need to be reviewed and processed individually, depending on how targeted the data must be. Each project is different; sometimes a broad harvest with sufficient filtering is enough.
Each harvest event can be defined with term filters, URL filters, domain filters, depth filters, and more. Filters are single items, multiple items, regular expressions, or even a massive list of keywords. There are no practical limits to our filtering system and many filters are applied as the harvest runs, further optimizing harvest efficiency.
Choosing the Right Harvest Techniques
BrightPlanet’s harvest solution contains many different techniques and harvesters allowing us to pick and choose the most efficient way to harvest content for each website. We are not limited to a simple web crawl. This allows us to spend less time harvesting content and more time processing the data, a topic we previously covered here.
Since we utilize multiple harvest engines, we can easily choose the correct harvesting techniques for the website and client’s content needs.
For examples, going back to our earlier CNN example, say we are only interested in North Korean missile testing. Instead of harvesting thousands of irrelevant world news documents looking for the needle in a haystack, we can leverage our Deep Web harvest engine to customize a search of the CNN website looking for specific keywords, ordered by publication date, and filtered by sub-sections and keywords. Now we’re only harvesting only extremely relevant documents
Our Data Acquisition Engineers may even leverage different harvest techniques for the same website, if necessary. Perhaps we need to perform an initial harvest of archived content and then perform daily updates of new content. It is not necessary to constant harvest old data; we would create one harvest to grab all content on the website. A second harvest (typically pointed to an RSS feed or home page) would monitor for new documents.
Advanced Deep Web and Dark Web Tips
Targeting makes a huge difference when it comes to Dark Web content. Dark Web websites often cover many topics, many of them being irrelevant to our client’s needs. Having the ability to quickly filter only relevant content channels allows us to be more efficient and also prevents our harvesters from being blocked – critical when harvesting Dark Web sites.
Another technique we often leverage involves a multi-pass of the same website to pre-process content. Once an initial pass is performed, we will process the data that was harvested to build intelligence into how we should monitor the site over a longer period of time. This per-processed data is typically thrown away since it is not relevant enough to provide value.
Deep Web and Dark Web content may also be scattered with irrelevant content, or content meant to obfuscated valuable data. Leveraging a broad harvest without filtering is often used to target relevant content which may then be re-harvested using additional filters in a new harvest event. This allows us to curate higher quality data with little additional work.
BrightPlanet is the leader is providing deep Data-as-a-Service to our customers with open-source, web content through a simple-to-use service. Our customers do not need to worry about the complexities and details about harvesting, curating, and preparing data for analytics. Instead they can focus on what they do best — creating intelligence.