Automatic Web News Extraction Using Tree Edit Distance
• Their approach is based on the concept of tree-edit distance and
allows the identification of the pages of interest and the extraction of
the relevant text passages discarding non-useful material such as banners,
menus, and links.
• The basic idea is:
1. evluating the structural similarities between pages in a target site
(similarity measurement: tree edit distance, or, a mapping with minimum
cost between two trees);
Note: DOM-trees are labeled
ordered rooted trees.
2. applying hierarchical clustering technique
to group together pages with similar structures; a constant threshold (T=80%)
is adopted to determine if two given clusters should be merged or not.
3. finding a generic representation of the structure of the pages within
a cluster (ne-pattern: node extraction pattern, can be considered as regular
expression for trees).
4. matching the set of generated ne-patterns to the set of newly crawled pages.
5. using simple heuristics to select and label the title
and the body of the news.
• Calculating the edit distance of two general trees is a difficult problem. It's NP-complete for un-ordered trees. In order to reduct the complexity, their approach is based on a restricted version of the top-down mapping (RTDM algorithm).
Actually, it also use the bottom-up distance
Question:
1. Since a threshold is adopted, there is no guarantee that inputed pages are grouped 100% correct. If there are some noisy pages (not generated by the same template) being grouped into a cluter, what ne-pattern will be generated? If this ne-pattern is supposed to accept all the pages in this cluster, then it might be over-generalized.
2. After a set of ne-patterns are generated, a new tree needs to compare with each of them and the one with the minimum cost will be chosen as the appropriate ne-pattern. Is there any more efficient way?