DEPTA: Data Extraction based on Partial Tree Alignment |
|
DEPTA is a Web mining system that identifies regularly structured data records (e.g., products and data tables) and align/extract their data items/data fields from Web pages automatically. See the paper below for details: Yanhong Zhai, and Bing Liu. "Web Data Extraction Based on Partial Tree Alignment" in Proceedings of the 14th international World Wide Web conference (WWW-2005), May 10-14, 2005, in Chiba, Japan. [pdf] |
|
Executable (.exe) We
only provide executable (.exe) version of the system (without source)
which runs on Windows PC. The program is free for scientific use. Please
contact us by |
|
Please
send us an email [ |
| How to use
1.
Click on "depta.exe". You will get a small interface window. IE Web
browser is embeded in the interface.
3.
After the page is fulled loaded into the embed browser, a message box
will pompted up to confirm the extraction. Click "OK". Example output1: identified data records
Example output2: extracted data items
Some Notes:
Options Only show the data regions with "$" sign : When dealing with E-Commerce websites, most data records of interest are merchandise. If this option is checked, DEPTA only outputs that data regions in which the data records are merchiandise. (Here we assume every merchandise has a price with "$" sign. ) In this way, some data regions that also contain regular pattern data records will not be displayed. |
| Last Updated: February 8, 2006 |