MDR: Mining Data Records from Web Pages

MDR Download and Help

MDR is a Web mining system that identifies and extracts regularly structured data records (e.g., products and data tables) from Web pages automatically. See the paper below for details:

Bing Liu, Robert Grossman, Yanhong Zhai. "Mining Data Records in Web Pages." Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD-2003), Washington, DC, USA, August 24 - 27, 2003. [PDF - conference version], [Full version]

Executable (.exe)

We only provide executable (.exe) version of the system (without source) which runs on Windows PC. The program is free for scientific use. Please contact us, if you are planning to use the software for commercial purposes. The software must not be distributed without prior permission of the authors.

Download and Install

Download the MDR program here
Extract the files in the zip file to a directory. You are on your way.

Note: A more robust and efficient implementation of the algorithm is in production use in an ecommerce company

If you have downloaded MDR, Please send us an email so that we can put you in our mailinglist to inform you any new versions and bug-fixes.

How to use

Click on "mdr.exe". You will get a small interface window.
You can type or paste a URL (including http://) or a local path into the Combo Box; the Combo Box contains a list of URLs which you have added. At the begining it may be empty.
If you are interested in extracting tables (or with rows and columns of data), Click on "Extract" in the Table section.
If you are interested in extracting other types of data records, click on "Extract" in the "Data Records (other types)" section. We separate the two functions for efficiency reasons.
After the execution, the output file will be displayed in an IE window. The extracted tables or data regions and data records are there;

Some Notes:

You will notice that the output window has some unineresting records. Simple cleaning up can be done to remove them, but we have not done it as it is not much of research ...
There are a few output files in the directory which are for our debuging purposes, you don't have to worry about them. You don't have to delete them. But if you want to understand them, please send us an email.
If MDR could not successfully extract the data records in a page, one reason could be that the tags in the page are not well formed for MDR to build a correct tag tree. Although we tried to fixed some of these errors, we did not spend enough time on this. Most of the pages that we tested have reasonably well formed tags. Please send us those pages that you encounter problems with MDR. We hope to improve it over time.

Options

Only show the data regions with "$" sign : When dealing with E-Commerce websites, most data records of interest are merchandise. If this option is checked, MDR only outputs that data regions in which the data records are merchiandise. (Here we assume every merchandise has a price with "$" sign. ) In this way, some data regions that also contain regular pattern data records will not be displayed.

Parameter

There is one threshold parameter in the algorithm, which may affect the extraction results of the system.

Similarity Threshold: Only when the similarity value of two tag strings is higher than Similarity Threshold, the two sub-trees represented by these two tag strings are considered as having similar pattern. Then, the data record represented by these two tag strings can be extracted out. The default value of Similarity Threshold is 60%, which is obtained from pilot studies and works well for most Web pages. If you cannot get the expected data record after clicking "Extract", try to change the Similarity Threshold to a larger value.

free hit counter

Created on Oct 3 2003 by Bing Liu; and Yanhong Zhai.