Automatic Web News Extraction Using Tree Edit Distance

• Their approach is based on the concept of tree-edit distance and allows the identification of the pages of interest and the extraction of the relevant text passages discarding non-useful material such as banners, menus, and links.

• The basic idea is:
1. evluating the structural similarities between pages in a target site (similarity measurement: tree edit distance, or, a mapping with minimum cost between two trees);
Note: DOM-trees are labeled ordered rooted trees.

2. applying hierarchical clustering technique to group together pages with similar structures; a constant threshold (T=80%) is adopted to determine if two given clusters should be merged or not.

3. finding a generic representation of the structure of the pages within a cluster (ne-pattern: node extraction pattern, can be considered as regular expression for trees).

4. matching the set of generated ne-patterns to the set of newly crawled pages.

5. using simple heuristics to select and label the title and the body of the news.

• Calculating the edit distance of two general trees is a difficult problem. It's NP-complete for un-ordered trees. In order to reduct the complexity, their approach is based on a restricted version of the top-down mapping (RTDM algorithm).

Actually, it also use the bottom-up distance

Question:

1. Since a threshold is adopted, there is no guarantee that inputed pages are grouped 100% correct. If there are some noisy pages (not generated by the same template) being grouped into a cluter, what ne-pattern will be generated? If this ne-pattern is supposed to accept all the pages in this cluster, then it might be over-generalized.

2. After a set of ne-patterns are generated, a new tree needs to compare with each of them and the one with the minimum cost will be chosen as the appropriate ne-pattern. Is there any more efficient way?

last update: Mar.31.2005