MDR: Mining Data Records from Web Pages

MDR Download and Help

MDR is a Web mining system that identifies and extracts regularly structured data records (e.g., products and data tables) from Web pages automatically. See the paper below for details:


Executable (.exe)

We only provide executable (.exe) version of the system (without source) which runs on Windows PC. The program is free for scientific use. Please contact us, if you are planning to use the software for commercial purposes. The software must not be distributed without prior permission of the authors.


Download and Install

  1. Download the MDR program here
  2. Extract the files in the zip file to a directory. You are on your way.
Note: A more robust and efficient implementation of the algorithm is in production use in an ecommerce company

If you have downloaded MDR, Please send us an email so that we can put you in our mailinglist to inform you any new versions and bug-fixes.


How to use

  1. Click on "mdr.exe". You will get a small interface window.
  2. You can type or paste a URL (including http://) or a local path into the Combo Box; the Combo Box contains a list of URLs which you have added. At the begining it may be empty.
  3. If you are interested in extracting tables (or with rows and columns of data), Click on "Extract" in the Table section.
  4. If you are interested in extracting other types of data records, click on "Extract" in the "Data Records (other types)" section. We separate the two functions for efficiency reasons.
  5. After the execution, the output file will be displayed in an IE window. The extracted tables or data regions and data records are there;
Some Notes: Options

Only show the data regions with "$" sign : When dealing with E-Commerce websites, most data records of interest are merchandise. If this option is checked, MDR only outputs that data regions in which the data records are merchiandise. (Here we assume every merchandise has a price with "$" sign. ) In this way, some data regions that also contain regular pattern data records will not be displayed.


Parameter

There is one threshold parameter in the algorithm, which may affect the extraction results of the system.

Similarity Threshold: Only when the similarity value of two tag strings is higher than Similarity Threshold, the two sub-trees represented by these two tag strings are considered as having similar pattern. Then, the data record represented by these two tag strings can be extracted out. The default value of Similarity Threshold is 60%, which is obtained from pilot studies and works well for most Web pages. If you cannot get the expected data record after clicking "Extract", try to change the Similarity Threshold to a larger value.



free hit counter

Created on Oct 3 2003 by Bing Liu; and Yanhong Zhai.