Welcome to the second post in BrightPlanet’s three part series that follows completely unformatted, unstructured web pages through the three step process that data follows to be made into actionable intelligence.
- Stage 1 – Harvesting
- Stage 2 (this post) – Normalization / Enrichment
- Stage 3 – Reporting and Analytics
In our last blog posting, we covered the first stage, harvesting. The post talks about how BrightPlanet harvested over 100,000 news articles from the top 50 newspapers using the Deep Web Harvester. In this post we’ll talk about the second stage, normalizing.
A couple issues arise when trying to extract information from harvested data from the web for analysis:
- Today’s web content is on the web in numerous formats
- Machines themselves cannot make sense of unstructured textual data
The solution to both of these issues can be found in BrightPlanet’s normalization and enrichment process.
The normalization stage is the first step in preparing harvested data for analysis. Data online exists in many different formats PDF, PowerPoint, HTML, etc. To compare and store these documents, all of the relevant text needs to be extracted from the harvested data types and stored in one uniform database.
The complexity in normalizing data is caused by the dynamics of data on the web today. Hundreds of formats of textual data exist on the web appearing in numerous different types of character encoding (UTF-8, ASCII, etc). BrightPlanet’s process takes these hundreds of formats and encodings and places them into one specific format (normalizes them) for analysis.
After data becomes normalized, BrightPlanet’s enrichment stage is responsible for giving the data additional structure with metadata. Metadata describes other data. It provides information about a certain item’s content.
A lot of metadata about the site itself is included in the code of HTML pages (domain name, keywords, etc.). This information is beneficial, but alone cannot help tell a compelling story. To supplement this information, BrightPlanet performs additional extraction on the harvested text.
To perform entity extraction, BrightPlanet pairs with the entity extraction technology company IMT Holdings’ and their Rosoka platform. Rosoka not only allows BrightPlanet to automatically tag names of people, companies, and places mentioned within documents, but also allows BrightPlanet to write their own custom tagging rules.
BrightPlanet has written rules to automatically tag over 50 different entity types. The ability to write custom entity extraction rules becomes significantly important when trying to help companies answer very specific questions. Some examples of custom rules created by BrightPlanet include:
- job titles
- prescription drug names
- prescription drug dosage amounts
- job qualifications
- chemical numbers
A Visual Example of Normalization and Enrichment
The images below show a web page (top) and what it looks like once it is normalized and the entities are extracted (bottom). The image on the bottom is displaying the web page in Rosoka’s Document viewer. The highlighted text in the second image displays entities that have been extracted from the text of the Job Posting.
Final Stage: Reporting and Analytics
Stay tuned next week to learn more about analytics and reporting of harvested and enriched data.
Need Actionable Intelligence?
Interested in how BrightPlanet can assist you with taking data from the Deep Web to actionable intelligence? Check out our Deep Web solutions that can be tailored for you.