DEPTA: Data Extraction based on Partial Tree Alignment

 

DEPTA is a Web mining system that identifies regularly structured data records (e.g., products and data tables) and align/extract their data items/data fields from Web pages automatically. See the paper below for details:

Yanhong Zhai, and Bing Liu. "Web Data Extraction Based on Partial Tree Alignment" in Proceedings of the 14th international World Wide Web conference (WWW-2005), May 10-14, 2005, in Chiba, Japan. [pdf]

 

Executable (.exe)

We only provide executable (.exe) version of the system (without source) which runs on Windows PC. The program is free for scientific use. Please contact us by , if you are planning to use the software for commercial purposes. The software must not be distributed without prior permission of the authors.

 

Please send us an email [] if you are interested in DEPTA system. We will send you the compressed files and put you in our mailinglist to inform you any new versions and bug-fixs.

 
How to use

1. Click on "depta.exe". You will get a small interface window. IE Web browser is embeded in the interface.
2. Depta can run in two modes:

  • If you want to extract data from a single web page: type or paste a URL (including http://) or a local path (e.g., C:\work\DEPTA\testpage\01_gateway.htm) into the text box, click "Load".
  • If you want to extract data from a set of web pages, you need to first download (crawl) these pages and save them on your local disk, and then type or paste the path of the directory containing saved pages (e.g., C:\work\DEPTA\testpage) into the text box, click "Batch".

3. After the page is fulled loaded into the embed browser, a message box will pompted up to confirm the extraction. Click "OK".
4 . After the execution, an html file and an excel file will be outputed to the same directory as of DEPTA.exe. The html file contains extracted data records. The excel file contains aligned/extracted data items.

Example output1: identified data records

Example output2: extracted data items

Some Notes:

  • You will notice that the output window has some unineresting records. Simple cleaning up can be done to remove them, but we have not done it as it is not much of research ...
  • Please send us [] those pages that you encounter problems with MDR. We hope to improve it over time. Sometimes the unsuccessful extractions are due to some reasons other than the algorithm itself.

Options

Only show the data regions with "$" sign : When dealing with E-Commerce websites, most data records of interest are merchandise. If this option is checked, DEPTA only outputs that data regions in which the data records are merchiandise. (Here we assume every merchandise has a price with "$" sign. ) In this way, some data regions that also contain regular pattern data records will not be displayed.

 
 
 
 
Last Updated: February 8, 2006