Website structures are constantly changing. You might be surprised how often websites swap formatting, themes, or its entire layout. These changes will typically break custom harvest scripts from inexpensive or roll-your-own harvesting solutions.
BrightPlanet’s harvest engine and quality assurance solution is designed to be robust and fault tolerant to these types of website changes. Our harvest engine will continue to harvest and provide value without need for external human intervention; this is because we typically leverage unstructured keyword rules instead of hardwired rules.
In our previous post, named “All websites are not created equal”, we talked specifically about our techniques for harvesting the best quality content from websites. Today, we are going to discuss how we ensure that clients’ harvests continue to operate even after the websites change.
Keeping Your Harvests Simple
If there is one thing we have learned over the last 18 years of web scraping, it is to keep your havests simple. I don’t mean harvesting simple websites. Instead of worrying which text nodes to process and which to ignore while harvesting data, let your analytics and unstructured text rules do the heavy lifting.
Sites requiring a tremendous amount of custom steps, extractions, and hardwired rules are going to be the ones which break easily. BrightPlanet takes an approach of loosening up harvest rules to ensure we’re pulling content, but then tightening restrictions on the analytics, qualifications, and post-harvest filtering.
For example, we often restrict a harvest based on URL path, using either substring matching or regular expression, instead of defining a series of rules to determine which links should be followed. It may sound simple, but it is very effective. If you are still picking up false-positive links, use unstructured keyword filters to polish the data set.
Leaving Rules As Unstructured
There is only so much you can do with harvest filters; sometimes you need to jump into the unstructured content to finalize content quality.
In its simplest form, we leverage large keyword lists and require webpages to have one or more of the keywords to keep the content. Our list processing system allows us to easily process up to thousands of keywords.
A great example of this filtering is used to determine if a webpage is selling a product, such as a pharmaceutical drug. We can provide a list of tens of thousands of drug names and then ensure that at least one of them is on the page, otherwise the page is rejected.
Quality Assurance Checks
After the harvest events are defined for a project, we integrate ongoing monitoring to ensure harvests are producing an expected number of documents over time. This will vary from project to project, source to source, and day to day. Using a standard deviation calculation and historic trending data, we can easily tell if a harvest needs to be reviewed or is operating within its expected boundaries.
BrightPlanet is the leader is providing deep Data-as-a-Service to our customers with open-source, web content through a simple-to-use service. Our customers do not need to worry about the complexities and details about harvesting, curating, and preparing data for analytics. Instead they can focus on what they do best – creating intelligence.