This project studies automatic (and semi-automatic) extraction of
structured data from Web pages.
Each of such pages may contain a single or several groups of structured data records, which
could be flat or nested. These data records are usually retrieved from underlying databases and displayed on Web pages following some fixed templates.
Examples of such data records include, products
sold online, research publications, job postings, customer reviews, and many many more.
The objective of this project is to design and experiment novel algorithms
to extract such data and put them in database tables. We have proposed
several fully automatic approaches based on tree matching and a semi-automatic instance based learning approach.
Download
Download the MDR system here. Drop me an email if you want DEPTA. NET is not ready for release yet.
Techniques and ideas in the systems and the papers below have been used by
several companies.
Publications
Yanhong Zhai and Bing Liu. "Structured Data Extraction from the Web based on Partial Tree Alignment" Accetped for publication in IEEE Transactions on Knowledge and Data Engineering, 2006. [PDF].
Yanhong Zhai and Bing Liu. "Extracting Web Data Using Instance-Based
Learning." Proceedings of 6th International Conference on Web
Information Systems Engineering (WISE-05), 2005. [PDF] - best paper award
Bing Liu and Yanhong Zhai. "NET - A System for Extracting Web Data from Flat and Nested Data Records." Proceedings of 6th International Conference on Web Information Systems Engineering (WISE-05), 2005. [PDF]
Yanhong Zhai, and Bing Liu. "Web Data Extraction Based on Partial Tree Alignment" To appear in Proceedings of the 14th international World Wide Web conference (WWW-2005), May 10-14, 2005, in Chiba, Japan. [PDF]
Bing Liu, Robert Grossman, Yanhong Zhai. "Mining Data Records in Web Pages." Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD-2003), Washington, DC, USA, August 24 - 27, 2003. [PDF - conference version], [Full version]