Web Data Extraction from Flat and Nested Data Records

- Automatic Wrapper Generation

Textbook, 2nd Edition: Web Data Mining - Exploring Hyperlinks, Contents and Usage Data, 622 pages

This project studies automatic (and semi-automatic) extraction of structured data from Web pages. Each of such pages may contain a single or several groups of structured data records, which could be flat or nested. These data records are usually retrieved from underlying databases and displayed on Web pages following some fixed templates. Examples of such data records include, products sold online, research publications, job postings, customer reviews, and many many more. The objective of this project is to design and experiment novel algorithms to extract such data and put them in database tables. We have proposed several fully automatic approaches based on tree matching and a semi-automatic instance based learning approach.

Download

Download the MDR system here. Two new systems are DEPTA and NET.

Techniques and ideas in the systems and the papers below have been used by several companies.

E.g., the instance-based learning method in (Zhai and Liu, 2005 and 2007) has been implemented by github.com and is in commerical use. The company said "with some tweaks and extentions it has been used on thousands of websites, extracting millions of items." If you are interested, you can download their code.
DEPTA has been reimplemented by Sigit Dewanto (with some changes), you can download it from here. The source is available here

Publications

Yanhong Zhai and Bing Liu. "Structured Data Extraction from the Web based on Partial Tree Alignment" Accetped for publication in IEEE Transactions on Knowledge and Data Engineering, 2006. [PDF].
Yanhong Zhai and Bing Liu. "Extracting Web Data Using Instance-Based Learning." Journal of World Wide Web, Volume 10 Issue 2, June 2007 (Journal version of the paper below).
Yanhong Zhai and Bing Liu. "Extracting Web Data Using Instance-Based Learning." Proceedings of 6th International Conference on Web Information Systems Engineering (WISE-05), 2005. [PDF] - best paper award
Bing Liu and Yanhong Zhai. "NET - A System for Extracting Web Data from Flat and Nested Data Records." Proceedings of 6th International Conference on Web Information Systems Engineering (WISE-05), 2005. [PDF]
Yanhong Zhai, and Bing Liu. "Web Data Extraction Based on Partial Tree Alignment" To appear in Proceedings of the 14th international World Wide Web conference (WWW-2005), May 10-14, 2005, in Chiba, Japan. [PDF]
Bing Liu, Robert Grossman, Yanhong Zhai. "Mining Data Records in Web Pages." Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD-2003), Washington, DC, USA, August 24 - 27, 2003. [PDF - conference version], [Full version]

Created on Oct 3 2003 by Bing Liu; and Yanhong Zhai.