Deep Web: Advanced
Can you use Surface Web sites to find Deep Web content?
For all practical purposes, no. Surface Web search results are links based on “relevancy by popularity”, ranked by how often documents link to each other (page rank). Thus, the first results you see are only the ones that have had the most references by other documents, and not necessarily the most relevant or recent data. This typically is the information you are looking for when searching for a good place to eat, the name of a company that you just heard about, or the capital of South Dakota (Pierre).
On occasion, some Deep Web documents will show up in surface Web sites, but typically only when they have been referenced by other surface Web documents. However, the majority of these documents have a very low page rank and are rarely are seen by the average surface Web query.
What is the difference between a BrightPlanet “harvest” and a search engine “search”?
With a standard surface Web search, you do not have access to the content within a document. The only results are the links that the engine returns based on your query. This means that a page could be out of date, or have irrelevant or misleading data.
However, conducting a harvest will provide the actual content that can be processed with analytics, reporting or visualization tools. With a harvest, you don’t have to worry about the content being out of date, irrelevant, or misleading. You can see it all directly in the harvest results.
Additionally, when content is harvested is resides within one of our Silos. So even if a web page goes offline, or is deleted by its owner, that harvested content is still available to you for your analytical and storage needs. Essentially, you own the content, and we’re able to store it for you or help deliver it to your analytics.
BrightPlanet has patented the technology to automate custom queries that target Deep Web sites. These queries reflect your explicit needs and provide highly qualified, relevant results. Well-crafted queries will quickly narrow in on a specific answer without a lot of poking and jabbing (clicking).
What impact is Social Media having on the Deep Web?
This question taps into the fundamental changes taking place in the character of the Deep Web. Not only is the Deep Web tied to dynamic content housed in databases, it is also tied to the widely dispersed, ever-growing amount of content pushed out from social media sites like Facebook, Twitter, and LinkedIn.
With the number of people participating in social media, the growth of online content is staggering. Think about Twitter. With spikes of over 20,000 Tweets every second, the potential for 1.2 million new individual web pages per minute is very real, and actually a conservative estimate. Now add in Facebook and LinkedIn, and you’re looking at content numbers that were impossible to imagine a few short years ago.
Therefore, social media is having a dramatic impact on the accessibility of information. What’s more, the content is harder to find, just because there’s so much more data to wade through. But at BrightPlanet, we’ve patented the technology to navigate this incredible volume of data, harvest only the relevant content, and keep the actual content (not just the link) readily available in a Silo for your analytical needs.
How does Deep Web harvesting reconcile the daily addition and deletion of registered domains?
In terms of raw numbers, the U.S. is seeing about 100,000 new web domains registered daily. This includes domains like .com, .info, .biz, .us, and more. Additionally, there are anywhere from 40,000-70,000 domains that are going offline daily. So while some attrition happens overall, there is still a net gain of at least 30,000 domains daily.
And remember, we’re only talking about the U.S. Global domain registration represents another huge contributor to the overall volume of content that is added to the Web, and it’s growing daily.
Our harvesting capabilities have been engineered to adapt and work within this dynamic, growing framework. Thanks to our Deep Web Harvester and OpenPlanet platform, we are able to scale with the increasing domain numbers, and harvest relevant content as it comes online. Additionally, we can still maintain content in a Silo once a domain is offline, ensuring you still have access to it despite the fact that source is no longer available.
So what can be done to stay on top of these new domains?
One of our pharmaceutical clients was facing the problem of monitoring new domain registration, specifically new domains that contained references to their proprietary drug brand names, but YOU probably suffer from the same problem in your industry.
We were able to work with this client to monitor and alert them any time a new domain was registered that contained their trademarked drug names. We also monitored for any mentions of drug names in existing domains, so the company was able to track down trademark infringement as it happened, helping them protect their intellectual property and enforce the purity and value of their proprietary formulas.
What is the difference between structured and unstructured content?
Structured content is that which is contained in table format, with columns and rows that correspond to direct and discrete pieces of data.
Unstructured content, in contrast, is that which is not contained in table format. This can be a PDF file, an entire web page, emails, a Word document, or really any type of content that does not have fields identifying specific data points.
What is “enrichment” of content?
At BrightPlanet, enrichment is the process of taking unstructured data and making it semi-structured, thereby making it easier to push through analytical tools.
Our enrichment process involves augmenting the unstructured data with meta-data, which is then fed into whatever format our customers need, including structured databases. These meta-data criteria can be used as filters for faceted content searching, and with our third-party integrations we can expand that capability to any specific criteria you may have (names, companies, places, dates, etc).
//