Data Harvesting From News Stories in Near Real-Time

Recently, a client asked us if we could pull details about car accidents from news stories for a quick proof-of-concept project. They wanted to see if we could extract the names and ages of those involved, the location of the accident, and a time value for when the accident occurred.

Because we hadn’t tagged the entities (key words and information) they were looking for before, we grabbed a few random news articles mentioning car accidents and ran them through our entity extraction solution, Rosoka, to see if we could deliver what they were looking for. Learn what we were able to find and deliver for this client.

The Obstacles in Pulling Details from News Stories

Carrying out this task presented some obstacles. Take the following two news stories for example:

A person can easily scan documents like these and spot entity data without much trouble.  Doing this for 10,000 documents per day, however, quickly eclipses a person’s scale.

For this project, there were two main issues to address:

  1. How do you zero in on the exact documents of interest in near real-time
  2. How do you tag entity data like a person’s name, location, and severity of the accident as written by news media?

Managing Data with the Global New Data Feed

Our Global News Data Feed is the perfect fit to address both of these issues.

With efficiency, the data feed continuously harvests around 10,000 news sources from around the world. By default, we extract over 15 unique entity types from every harvested story. For this project, we also set up new custom entities to tag.

After setting up a few new harvests configurations and performing entity-tagging, we found that the results were unique and impressive. The project specifics are confidential, but we can say that the entity data needed is extremely time sensitive, so combining our depth in harvesting with Rosoka’s entity extraction were the perfect fit for this client’s needs.

We had a base system up and running for our client within hours. Now the client can take that data and apply their own “secret sauce” to leverage a brand new way to process open source news data.

There will be more details on this solution as the project moves forward and commercializes.