This project studies automatic (and semi-automatic) extraction of
structured data from Web pages.
Each of such pages may contain a single or several groups of structured data records, which
could be flat or nested. These data records are usually retrieved from underlying databases and displayed on Web pages following some fixed templates.
Examples of such data records include, products
sold online, research publications, job postings, customer reviews, and many many more.
The objective of this project is to design and experiment novel algorithms
to extract such data and put them in database tables. We have proposed
several fully automatic approaches based on tree matching and a semi-automatic instance based learning approach.
Download
Download the MDR system here. Two new systems are DEPTA and NET.
Techniques and ideas in the systems and the papers below have been used by
several companies.
E.g., the instance-based learning method in (Zhai and Liu, 2005 and 2007) has been implemented by github.com and is in commerical use. The company said "with some tweaks and extentions it has been used on thousands of websites, extracting millions of items." If you are interested, you can download their code.
DEPTA has been reimplemented by Sigit Dewanto (with some changes), you can download it from here. The source is available here
Publications
Yanhong Zhai and Bing Liu. "Structured Data Extraction from the Web based on Partial Tree Alignment" Accetped for publication in IEEE Transactions on Knowledge and Data Engineering, 2006. [PDF].
Yanhong Zhai and Bing Liu. "Extracting Web Data Using Instance-Based
Learning." Journal of World Wide Web, Volume 10 Issue 2, June 2007 (Journal version of the paper below).
Yanhong Zhai and Bing Liu. "Extracting Web Data Using Instance-Based
Learning." Proceedings of 6th International Conference on Web
Information Systems Engineering (WISE-05), 2005. [PDF] - best paper award
Bing Liu and Yanhong Zhai. "NET - A System for Extracting Web Data from Flat and Nested Data Records." Proceedings of 6th International Conference on Web Information Systems Engineering (WISE-05), 2005. [PDF]
Yanhong Zhai, and Bing Liu. "Web Data Extraction Based on Partial Tree Alignment" To appear in Proceedings of the 14th international World Wide Web conference (WWW-2005), May 10-14, 2005, in Chiba, Japan. [PDF]
Bing Liu, Robert Grossman, Yanhong Zhai. "Mining Data Records in Web Pages." Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD-2003), Washington, DC, USA, August 24 - 27, 2003. [PDF - conference version], [Full version]