Web Page Template Detection and Cleaning for Web Data Mining

Project description

A commercial Web page typically contains many information blocks. Apart from the main content blocks, it usually has such blocks as navigation panels, copyright and privacy notices, and advertisements (for business purposes and for easy user access). We call these blocks that are not the main content blocks of the page the noisy blocks. We show that the information contained in these noisy blocks can seriously harm Web data mining. Eliminating these noises is thus of great importance. In this research, we propose two noise elimination technique, which is based on the following observation: In a given Web site, noisy blocks usually share some common contents and presentation styles, while the main content blocks of the pages are often diverse in their actual contents and/or presentation styles. Based on this observation, we propose some techniques to eliminte Web page noises. Our techniques first detect differnt templates from a set of Web pages and then identify the main content blocks. The proposed techniques have been evaluated with two data mining tasks, Web page clustering and classification. Experimental results show that our noise elimination techniques are able to improve the mining results significantly. The papers are as follows:

Created on Feb 5, 2004 by Bing Liu .