DEPTA: Data Extraction based on Partial Tree Alignment
DEPTA is a Web mining system that identifies regularly structured data records (e.g., products and data tables) and align/extract their data items/data fields from Web pages automatically. See the paper below for details:
Yanhong Zhai, and Bing Liu. "Web Data Extraction Based on Partial Tree Alignment" in Proceedings of the 14th international World Wide Web conference (WWW-2005), May 10-14, 2005, in Chiba, Japan. [pdf]
only provide executable (.exe) version of the system (without source)
which runs on Windows PC. The program is free for scientific use. Please
contact us by , if you are planning
to use the software for commercial purposes. The software must not be
distributed without prior permission of the authors.
send us an email  if you are interested
in DEPTA system. We will send you the compressed files and put you in
our mailinglist to inform you any new versions and bug-fixs.
|How to use
Click on "depta.exe". You will get a small interface window. IE Web
browser is embeded in the interface.
After the page is fulled loaded into the embed browser, a message box
will pompted up to confirm the extraction. Click "OK".
Example output1: identified data records
Example output2: extracted data items
Only show the data regions with "$" sign : When dealing with E-Commerce websites, most data records of interest are merchandise. If this option is checked, DEPTA only outputs that data regions in which the data records are merchiandise. (Here we assume every merchandise has a price with "$" sign. ) In this way, some data regions that also contain regular pattern data records will not be displayed.
|Last Updated: February 8, 2006|