When making sense of a web page’s raw text, one of the ideal pieces of metadata is the “publish date.” Assigning dates to web content attributes the documents, and all other pieces of intelligence found within that document, to a specific time period. This helps the data analyst quickly drill-down into the data by date without having to manually analyze a web page.
The problem is there are often many dates on any given web page. How do you find the publish date?
Most news and blog websites have a publish date at the top of every posted article, under the author’s name. Message board websites usually have many dates on a single web page, one publish date (and often also a user join date) for each post. Academic articles vary between having the date at the beginning or end of the document. Many general web pages have no date at all within the text.
Attributing the Harvest Date
Assigning dates to web content is easy when you have a known target list of websites and continually refreshing data harvests. At BrightPlanet, we populate a ‘harvest date’ field in our database for each document harvested. As long as harvests are refreshed at least every 24 hours to capture new documents, the harvest date and publish date will have the same value.
The challenge with attributing publish dates rears its ugly head when harvesting mass quantities of old data. If you started a data scrape from the archives of a news website and crawled all of the links, how would you attribute a publish date to each of those links?
This issue is brought up at some point in almost every BrightPlanet project. If a client only requires data from a few different websites, it’s easy to manually analyze each site. This allows you to determine the correct date location in the web page and tag that date with a custom property. A harvesting scope spanning more than a 10 websites (many BrightPlanet projects harvest 100+ sources) requires a more automated process for finding publish dates.
The Challenge with Assigning Publish Dates to Web Content
The actual tagging of a date entity isn’t difficult; there are many examples of text parsing software or regular expressions which can detect date/time syntax. Knowing which date is the correct publish date is the challenge.
We’ve found that if there is a date within sanitized text, the first date found within the raw text is the correct publish date about 70 percent of the time. If we’re analyzing the full text on the page, the percentage decreases significantly.
This occurs due to high rates of errant dates appearing in web page headers, footers, sidebars, and unrelated content similar to the boilerpipe. The most common example of this incorrect first date is when a web page is displaying the current time at the top of the page.
For this and other reasons, unrelated text needs to be stripped. For example, the “Top Stories” sidebar on a news website features the same story titles on every web page. This creates many false positive results in a keyword search.
BrightPlanet utilizes the ‘boilerpipe’ Java library. This library is implemented in a custom text-processing stage to extract only the relevant text. Once irrelevant text is purged, the publish date can be attributed to the first date in the text with a high degree of accuracy. Unfortunately, the ‘boilerpipe’ stage will occasionally omit the publish date from what the program deems as relevant text.
Get Date Information from Web Data
Is the date web content was created important to you? We can help you find and analyze that data. Talk to one of our data acquisition engineers to discuss the web data you’re using and how we can help extract the dates.