Proxy Servers: Defying the Data-Collection Odds

At BrightPlanet, data harvesting is our specialty. But data harvesting is anything but simple. That’s where proxy servers come in.

We have an anonymous proxy server and a TOR proxy cluster that we leverage for most of our harvesting.

Friendly Crawling: A Harvesting Necessity

Today many servers will automatically block crawlers (like ours) based on simple activities. Our proxy server provides a buffer for our harvest engine, allowing it to do its job unhindered.

This proxy allows us to extend our collection ability while remaining a friendly crawler. We do not use our proxy for complex anonymization, however. For that we use the a separate TOR proxy cluster.

While we were expanding our TOR proxy cluster a few weeks ago to include specific in-country exit nodes, we reviewed some of the standard traffic statistics. While the numbers are not as gigantic as Google’s, they are impressive nonetheless.

Proxy Server Statistics

Over the last two years, our proxy cluster has harvested just under 22TB (21,897,304,284,302 bytes) of data and has sent just under 545GB (543,973,470,158 bytes) to request headers.

Over that same two year period, our proxy servers had a downtime of only 55 minutes, with one of the servers accounting for 49 minutes. That’s less than an hour, folks!

On average, we harvested just over 30GB of data, everyday, 7 days a week.

What Could You Do With 30GB in One Day?

We couldn’t help but think about how 30GB could otherwise be used in one day, so we looked up a few numbers. If you had 30GB to use in one day, you could:

  • 1.5 Blu-ray movies (2 hours each)
  • 3 HD Movies (2 hours each)
  • 8 DVD Movies (2 hours each)
  • 7,500 MP3 files (4MB each)
  • 10,000 digital photos (3MB each)
  • 30,000 ebooks (1MB each)

Structured Harvesting

Our harvest engine is directed harvesting and we only harvest data that is specifically for our clients’ needs. We never “just harvest” data for the hopes that someone, somewhere will need it. Every document we collect was collected for a specific purpose.

While we may not harvest the most data, we still think 30GB each day is a lot of data.

BrightPlanet’s Data Harvesting Services

Are you in need of structured data harvesting? We’d love to help!

We want to help you find the best way to leverage open source intelligence tools for your specific needs. So, tell us what you’re working on, and one of our Data Acquisition Engineers will be in touch with you to schedule a free consultation.