Harvesting and Enriching Web Data in Multiple Languages

Being in the business of harvesting content from the web and delivering the data in a usable format, we are often asked if we’re able to deliver data in multiple languages. The easiest answer to that question is yes, we can harvest and enrich content in any language online, but today we’ll dive into more specifically how we harvest data in any foreign language and then work with our partners to enrich that content and make it usable for analysis.

Let’s get started!


FlagsThe first step in our process of making usable data from the web involves the harvesting of the actual data. We’ve discussed in previous white papers our different harvest types. The process of how we harvest text from a page in English is actually the same as how we would handle harvesting text in any other language.

Consider these two news articles about the recent polio outbreaks, one from the Arabian Business News Journal in Arabic and the other from the Trade Arabia site in English. The harvesting process for both is exactly the same, we navigate to each page and store all the text from each page and archive it. This can occur with any language that can be written using characters of some sort online.

Now the more challenging question is how do you analyze two different documents in two different languages and compare them.


We’ve discussed in previous blog postings how we typically enrich unstructured data through the process of entity extraction. You’ll find that the process for extracting from multiple languages is quite similar. We’ll use the two articles mentioned above as examples.

The original text from the Trade Arabia site:

The Horn of Africa (northeast Africa) is currently experiencing an outbreak of wild polio virus type 1 (WPV1), according to a statement.

The text from the Arabian Business site in Arabic:

وسجل في العام  2013  قرابة 85 حالة أي بزيادة 40 عن العام الذي سبقه،  وانتشرت أنباء عن عودة ظهور مرض شلل الأطفال في دول عديدة مثل كل من سوريا والعراق، بعد انقراض هذا المرض منذ 14 عاما وذلك عن طريق مقاتلين م طالبان قدموا من باكستان وأفغانستان بحسب صحيفة الغارديان

You’ll notice that the disease polio is mentioned in both excerpts. If your Arabic is a little rusty, I have highlighted where each mention occurs. Our third party natural language processing (NLP) technology, Rosoka is smart enough to detect key entities in multiple languages and is able to extract both mentions of the diseases without having to perform any full-size machine translation. Not surprisingly this leads to major performance and scalability advantages when looking at very large data sets across multiple languages.

The other unique characteristic of the technology is its ability to normalize multiple extracted tags into one instance, regardless of the language extracted. For example, even though polio is mentioned in both English and Arabic above, our extraction technology knows that the two different highlighted tags are referring to the same disease and can create a common link between the two articles by normalizing both mentions to simply polio instead of their individual tag. You’ll see both harvested documents below with the disease normalized to polio.


This concept greatly increases the accuracy of the data across languages and also empowers your analysts that may have limited language capabilities to easily do analysis of data across multiple data sets spanning multiple languages; all without ever having to introduce full machine translation into the workflow.

Want to learn more about how you can harvest multiple data sources in any language? Request a demo today from one of our Data Acquisition Engineers and learn how to tap into Big Data-as-a-Service.



Photo: Borkur Sigurbjornsson