Using the Structure of Web Sites for
Automatic Segmentation of Tables
• They do need multiple pages. Their method works as following:
1.. Given multiple list pages, derive a template and use it to identify
the tables inside the pages.
a .. A template finding algorithm (e.g.roadrunner) is used to derive the
template.
b .. The sections of a page that are not part of the template are called
"slots".
c .. They use a heuristic to find the table containing records: "the
table in the slot that contains the largest number of text tokens is the
table containing records".
2.. Next, extract data from the table.
They give two models to solve the problem, one is CSP and the other is Probabilistic
Model.
They do extract data items using the following rules: An attribute value
belongs to a record only if it appears on the detail page corrsponding to
that record.
After record segmentation, they use a probabilistic model to predict which
column an attribute should be assigned to.