Understanding Bot Blockers and Web Harvesting the Smart Way

Bot blockers are tools that monitor your website traffic and look for patterns indicating a robot is harvesting your site’s content. When a bot blocker detects a robot, the bot blocker usually drops the request, so the robot thinks that the server is not there and will discontinue making data requests.

The bot blocker algorithms are possible due to common patterns left by robots, which are made for grabbing every page as quick as possible. At BrightPlanet, we often deal with bot blockers, but we have adopted a reactive position instead of a defensive position. In this post, you’ll learn how BrightPlanet harvests data and handles bot blockers.

Whatever we do to get around a bot blocker, we will be detected eventually and the owner of the website we harvest will continue to adjust their rules to block our automatic harvest engine. These blocking adjustments often happen in real time. Most platforms, like WordPress, have easy to install plugins to meter traffic and block robots. Some bot blockers are sophisticated and evaluate the specific request patterns, HTTP headers, source address, page request behaviors and more.

Just like with CAPTCHA, content providers want to avoid causing further pain to their valid customers while blocking those interpreted as a threat. This makes CAPTCHA and bot blockers fairly uncommon. Additionally, the top traffic generator to most websites today is Google, everyone wants to make sure there site is predominately represented by Google, the biggest robot on the Web today.

Avoiding Abuse of Bot Blockers

As long as we do not abuse a website, such as repeatedly harvesting an entire site repeatedly, bot blockers won’t block our harvesters. Our harvest engine is very site-friendly and we always avoid abusing websites, which is why we insist that our Data Acquisition Engineers are the ones who configure harvest events for our clients. We monitor the quantity of Web page changes and adjust each harvest event accordingly to hit the Web server as little as possible while still managing our customer’s needs.

One area that we see a lot of bot blocker implementations is within the sites that we harvest for law enforcement. With these projects we harvest content from sites that often operate illegally and tend to be more cautions of robots. We find these sites will simply block entire IP address blocks or even traffic originating from specific countries.

Bypassing bot blockers may open up a legal situation if that content is used in a way that the owner sees as violating their intellectual property. By honoring bot blockers, you are honoring their requests and are only harvesting information similar to what other search engines could collect (like Google.)

In late 2013, the company 3Taps underwent a legal battle for navigating around Craiglist’s bot blocking services and republishing data for use by developers. Craigslist’s main argument was that republishing data from their site reduced the amount of ad revenue generated by individuals visiting the site. The case is still ongoing in court.

At BrightPlanet, we developed a harvest engine to easily control how we collect data from sites in a bot-friendly way. While we want to be able to collect data for our clients, we also do not want to violate an author’s content. To combat this, we always attribute all content back to the original source with a direct Web link and never republish the content itself.

Download our white paper on Google search versus Deep Web harvesting to learn how they work and how they are different.