Scalable Web Data Extraction for Online Market Intelligence

Authors: 
Baumgartner, Robert; Gottlob, Georg; Herzog, Marcus
Author: 
Baumgartner, R
Herzog, M
Gottlob, G
Year: 
2009
Citations: 
21
Citations range: 
10 - 49
AttachmentSize
vldb09-1075.pdf1.02 MB

Online market intelligence (OMI), in particular competitive
intelligence for product pricing, is a very important application
area for Web data extraction. However, OMI presents
non-trivial challenges to data extraction technology. Sophisticated
and highly parameterized navigation and extraction
tasks are required. On-the-fly data cleansing is necessary in
order two identify identical products from different suppliers.
It must be possible to smoothly define data flow scenarios
that merge and filter streams of extracted data stemming
from several Web sites and store the resulting data
into a data warehouse, where the data is subjected to market
intelligence analytics. Finally, the system must be highly
scalable, in order to be able to extract and process massive
amounts of data in a short time. Lixto (www.lixto.com),
a company offering data extraction tools and services, has
been providing OMI solutions for several customers. In this
paper we show how Lixto has tackled each of the above challenges
by improving and extending its original data extraction
software. Most importantly, we show how high scalability
is achieved through cloud computing. This paper also
features a case study from the computers and electronics
market.