We harvest a lot of websites for our clients, but how do we know which sites to harvest in the first place?

BrightPlanet has provided terabytes of data for various analytic projects across many industries over the years. Our role is to locate open-source web data, harvest the relevant information, curate the data into semi-structured content, and provide a stream of data feeding directly into analytic engines or final reports.

The first phase of all projects – we have three phases total – is harvesting. It is not uncommon for a single project to have 10,000-20,000 websites from which we harvest content. While a large volume of websites is impressive, the more important element is knowing which websites are the most appropriate for the current task. This is where our Data Acquisition Engineers become a critical part of the solution.

Techniques for Identifying Websites

BrightPlanet has developed several techniques for locating websites; some are trade secrets whiles others are just common sense. We are going to use the term “website” to refer to a single site that is all within the same domain name. Like “brightplanet.com” or “cnn.com”. As you can imagine, a single website may have a few pages or a several billion pages, it all depends on the site. A useful trick for estimating the number of pages in a website is to do a Google search using only the website parameter. (ex: site:brightplanet.com)

Occasionally clients possess a full list of websites they want to harvest. Clients will typically have a partial list of websites. However, even these lists will usually require some type of curation or quality check.

Most often, our Data Acquisition Engineers will work with a client to define what a ‘good’ website looks like. This will help identify and filter out websites which may be relevant, but not on target with a client’s needs.

As you might have guessed, one way to identify new sites is to “search” for them using surface web sites like Google or DuckDuckGo. Unlike an analyst who might need to search, iterate, search again, and then review each website by hand, we can do that quickly using our Deep Web harvesting and some filter rules. This is extremely effective because we can iterate so fast with our harvest engine.

Locating online lists or directories of similar sources. If you have ever tried to find the “best video editing software” online, you know people love to make lists. Using known entities, we can quickly locate other sites referencing related sources. From there, we can easily harvest and qualify the sites listed.

Advanced Techniques Used For Some Deep Web Harvesting

Those are pretty straightforward, and probably obvious, ways to locate websites. Some projects need a much greater depth to locate valid websites. Here are a few creative techniques we have used on projects.

Monitoring new purchased domains. Each day, a list of the newly purchased (and newly expired) domains is generated. Since we can harvest and validate many sites at a time, we can use these lists in combination with good filtering and validation rules.

Diving deep into topical blogs or messages boards is a great way to find hidden gems. Often, people will post links which get buried deep within these sites. Being able to go deep, and validate them, allows us to locate sites that might otherwise be missed.

Social media contains a wealth of links, but they typically need to be “exploded” since they will be shortened for tracking purposes. Again, our harvest engine and social connectors make it possible to search, locate, harvest, extract, and validate these links.

Conclusion

BrightPlanet is the leader is providing deep Data-as-a-Service to our customers with open-source, web content through a simple-to-use service. Our customers do not need to worry about the complexities and details about harvesting, curating, and preparing data for analytics. Instead they can focus on what they do best — creating intelligence.

SCHEDULE YOUR CONSULTATION hbspt.cta.load(179268, ‘811862a0-8baf-4d9e-b1ca-3fa809ee8f97’, {});