MDR is a Web mining system that identifies and extracts regularly structured data records (e.g., products and data tables) from Web pages automatically. See the paper below for details:
Bing Liu, Robert Grossman, Yanhong Zhai. "Mining Data Records in Web Pages." Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD-2003), Washington, DC, USA, August
24 - 27, 2003. [PDF - conference version], [Full version]
Executable (.exe)
We only provide executable (.exe) version of the system (without source) which runs on Windows PC. The program is free for scientific use. Please contact us, if you are planning to use the software for commercial purposes. The software must not be distributed without prior permission of the authors.
Extract the files in the zip file to a directory. You are on your way.
Note: A more robust and efficient implementation of the algorithm is in production use in an ecommerce company
If you have downloaded MDR, Please
send us an email so that we can put you in our mailinglist to inform
you any new versions and bug-fixes.
How to use
Click on "mdr.exe". You will get a small interface window.
You can type or paste a URL (including http://) or a local path into the Combo Box; the Combo Box contains a list of URLs which you have added. At the begining it may be empty.
If you are interested in extracting tables (or with rows and columns of data), Click on "Extract" in the Table section.
If you are interested in extracting other types of data records, click on "Extract" in the "Data Records (other types)" section. We separate the two functions for efficiency reasons.
After the execution, the output file will be displayed in an IE window. The extracted tables or data regions and data records are there;
Some Notes:
You will notice that the output window has some unineresting records. Simple cleaning up can be done to remove them, but we have not done it as it is not much of research ...
There are a few output files in the directory which are for our debuging purposes, you don't have to worry about them. You don't have to delete them. But if you want to understand them, please send us an email.
If MDR could not successfully extract the data records in a page, one reason could be that the tags in the page are not well formed for MDR to build a correct tag tree. Although we tried to fixed some of these errors, we did not spend enough time on this. Most of the pages that we tested have reasonably well formed tags. Please send us those pages that you encounter problems with MDR. We hope to improve it over time.
Options
Only show the data regions with "$" sign : When dealing with E-Commerce websites, most data records of interest are merchandise. If this option is checked, MDR only outputs that data regions in which the data records are merchiandise. (Here we assume every merchandise has a price with "$" sign. ) In this way, some data regions that also contain regular pattern data records will not be displayed.
Parameter
There is one threshold parameter in the algorithm, which may affect the extraction results of the system.
Similarity Threshold: Only when the similarity value of two tag strings is higher than Similarity Threshold, the two sub-trees represented by these two tag strings are considered as having similar pattern. Then, the data record represented by these two tag strings can be extracted out. The default value of Similarity Threshold is 60%, which is obtained from pilot studies and works well for most Web pages. If you cannot get the expected data record after clicking "Extract", try to change the Similarity Threshold to a larger value.