Web Data Extraction from Flat and Nested Data Records

- Automatic Wrapper Generation

Textbook, 2nd Edition: Web Data Mining - Exploring Hyperlinks, Contents and Usage Data, 622 pages

This project studies automatic (and semi-automatic) extraction of structured data from Web pages. Each of such pages may contain a single or several groups of structured data records, which could be flat or nested. These data records are usually retrieved from underlying databases and displayed on Web pages following some fixed templates. Examples of such data records include, products sold online, research publications, job postings, customer reviews, and many many more. The objective of this project is to design and experiment novel algorithms to extract such data and put them in database tables. We have proposed several fully automatic approaches based on tree matching and a semi-automatic instance based learning approach.


Download the MDR system here. Two new systems are DEPTA and NET.

Techniques and ideas in the systems and the papers below have been used by several companies.


Created on Oct 3 2003 by Bing Liu; and Yanhong Zhai.