Using the Structure of Web Sites for Automatic Segmentation of Tables


• They do need multiple pages. Their method works as following:

1.. Given multiple list pages, derive a template and use it to identify the tables inside the pages.
a .. A template finding algorithm (e.g.roadrunner) is used to derive the template.
b .. The sections of a page that are not part of the template are called "slots".
c .. They use a heuristic to find the table containing records: "the table in the slot that contains the largest number of text tokens is the table containing records".
2.. Next, extract data from the table.
They give two models to solve the problem, one is CSP and the other is Probabilistic Model.

They do extract data items using the following rules: An attribute value belongs to a record only if it appears on the detail page corrsponding to that record.
After record segmentation, they use a probabilistic model to predict which column an attribute should be assigned to.

last update: April 21, 2005