In this post and the included video, we’re going to talk about how BrightPlanet harvests TOR network data.
Harvesting TOR Data
BrightPlanet has built their web data harvesting engine with the ability to operate directly through a standard proxy server. A proxy is a computer that serves as a middleman between two machines to help mask the identity of where data is flowing.
Because of this ability to operate our harvest engine through a proxy, we are able to harvest data from the TOR network exactly how we harvest data from the rest of the Internet. This means we can complete link crawling of data and find web pages based off clickable hyperlinks at Big Data scale from the Dark Web. In addition to link crawling on the Dark Web, we are also able to perform Deep Web searches in the Dark Web. This may sound confusing, but it means we can automate searches into search boxes on Dark Web sites.
What Resides on TOR
Most people at first glance tend to overestimate the total size of the TOR network. To put things in perspective, consider what you already know, the Internet. The Internet that you’re searching and accessing is made up of hundreds of millions of registered domains and continuously growing.
Now the exact size of the TOR network is almost impossible to predict, but estimates have put the size of the TOR network at tens of thousands of domains, not hundreds of millions. The domains on the TOR network typically include content similar to the Internet. Chat forums, message boards, and blogs are the most common. Also included in some of these standard websites is more nefarious activity that is often associated with anonymity. This includes e-commerce forms that sell drugs, credit card numbers, hacking services, and weapons to name a few.
Interested in learning more about the Deep Web vs. the Dark Web?