RoadRunner


• The algorithm works on two input pages. One is used as the wrapper, w, and the other is used as the sample, s. The algorithm tries to generalize the wrapper by matching it with the sample; this is done by parsing the sample using the wrapper, and by trying to solve all mismatches that are encountered during the parsing. The algorithm has exponential time complexity with respect to the input lengths (token number in pages).
click for large size

• Can handle nested structures;

• Union-free regular expressions: It's one of the limitations of their technique. "Union-free" means they don't handle disjunction cases. As an example of a disjunction, an address information in a web page could be in one of two formats, based on whether the address is a US address or not, in which case the wrapper is supposed to contain a disjunction of the RE for US addresses and the RE for non-US addresses.

They mentioned that introducing disjunctions, i.e., union operators would strongly increase the complexity of their wrapper inference process.

Also, it can not be used to describe context-free language, such as S->aSb|c (which can infer (a^n)c(b^n));


• Matching as an AND-OR tree
In order to limit the complexity, they introduced several pruning techniques (which reduced the expressiveness of the formalism and affected the accuracy of the wrapper accordingly), in order to skip subtrees that most likely don't correspond to meaningful solutions.
last update: Mar.30.2005