How Deep Web Harvesting Isn’t Your Traditional Web Extraction

At BrightPlanet, we receive a number of questions about how BrightPlanet’s technology differs from our biggest competitors. People will commonly see companies like Kapow and Connotate and assume that our technologies are in direct competition. In reality though, they’re not.

In this post, we hope to give you an understanding of how extraction companies and BrightPlanet’s harvesting technology don’t compete, as one may think, and explore the advantages of each individual technology.

Extraction Companies

Job Posting Page from Microsoft

Job Posting Page from Microsoft

Let’s start by explaining what extraction technology is and whose competing in the space. We would classify both Connotate and Kapow Software as extraction companies, they are much closer to competitors to each other than either company is to BrightPlanet. Both Connotate and Kapow have built user interfaces allowing their end-users to automate the collection of specific components from a page. In Kapow’s Katalyst you build robots, while you construct agents using Connotate’s technology. The key thing to note about extraction companies is that they use the structure of the web page to help deliver you results.

To understand how these extraction engines work, consider this one job posting page from Microsoft for a Senior Software Development Engineer. Connotate and Kapow’s technology both view these pages structured as a web page how you would see it. To extract items from this page, you simply use Connotate and Kapow’s extraction engine and specify what portions of text you want to access. You’ll only be able to extract clean data that is considered structured on the page so you’ll likely highlight the job title, the product, the job category, the location, and the division.  Your output from an extraction engine would look like this:

Output of Extraction from Page.

Note that the only items included in the output are the items you specify to be taken from the page. It’s also very difficult for extraction engines like this to work with full paragraphs of text.

Extraction Versus Harvesting

Now let’s take a look at what harvesting this page would look like through BrightPlanet’s harvesting process. When BrightPlanet sees a page, we focus on extracting all of the relevant text of the page. End-users don’t have to specify what specific components to extract as we pull all of the relevant text from each page, this saves you a tremendous amount of time as there is no customization from each source. Once we harvest the text, it appears like this to our machines.

Unstructured Text

Unstructured Text

You’ll see the document appears as completely unstructured text. To get it in a format that is usable, we then extract out the key terms of interest using a rule-based entity extraction engine that analyzes the full body of text instead of the structure of the HTML page. It is this concept of harvesting all the text from the page and analyzing the text itself that separates us from traditional extraction companies.

The output for the end-user can look the exact same as the previous output shown for extraction engines, but we can also include all of the text extracted from the page as well as develop our own rules to tag items such as programming languages, degree requirements, etc. Imagine this additional tagging applied to all of Microsoft’s Job Postings? You’d be able to answer the the following questions about Microsoft.

  • What are the most common programming languages that Microsoft is hiring for in the last month?
  • What new initiatives are underway by Microsoft?
  • Where is Microsoft listing the most new jobs at?
  • Where is Microsoft looking at expanding?

Now imagine having access to data across all Fortune 500 companies, not limiting yourself to simply one company. You’d be able to answer questions industry wide, such as:

  • What degrees/qualifications/certifications/languages are the most sought after by Fortune 500 Companies?
  • Who’s hiring the most Java Programmers, Data Scientists, etc?
  • Who is lagging behind in new initiatives and who’s ahead of the curve?
  • What’s the average years of experience required for job postings by Fortune 500 companies?
  • Who is looking for entry level employees and who maybe has seasoned employees leaving?
  • What trends can we see when comparing job listings to stock price?
  • Can we use job postings for financial intelligence to predict the success of a company?

Its practically impossible to answer these questions with a traditional web extraction engine, but possible with a full harvest engine like BrightPlanet.

Additional Advantages of Whole Text

Completing extraction through analysis of the whole text instead of the HTML has some other major advantages, which include:

  • Accessing and storing all the text from harvested web pages rather than specific items.
    • BrightPlanet stores all the text from each of the web pages it harvests; this allows for version control among web pages and also allows us to reanalyze the data without having to constantly re-extract content that we may have missed.
  • Scaling more easily across multiple web sources or changing pages.
    • Relying on the page structure to complete the extraction (Kapow/Connotate’s process) requires end-users to manually configure to each site and choose the specific items on the page requiring extraction. Whereas by simply extracting all of the text, BrightPlanet can harvest content at much larger scale without custom scripting for each individual source.
  • Ability to harvest unstructured content from the web.
    • BrightPlanet can harvest content from pages that deliver primarily unstructured data. Some examples of pages include: news stories and blogs.
  • Analyzing and extracting from multiple file types.
    • Extraction engines that rely on extracting key components from web pages will struggle with accessing and collecting content from other online file types such as: Word documents, PowerPoints, text files, etc.

The additional data points and complexity of the harvesting BrightPlanet has developed allow us to offer our Data-as-a-Service model to help answer tougher questions beyond what a traditional extraction engine can answer.

Are you looking to learn how to tap into BrightPlanet’s harvesting and enrichment process through Data-as-a-Service? Request a free demo from our Data Acquisition Engineers today.


Featured Image: Bruce Fingerhood