|Page-level Template Detection via Isotonic Smoothing|
posted by shuri on 2007-05-10 10:00:26
|Cute work about template detection, short summary follows.|
Previous work, site based, two phase. The limitations of this technique, pages may not be processed in site order, new sites may be a problem and processing may be inefficient.
Features they use include: placement on the screen,back ground color, identify series of links that are likely to be part of the template, average sentence size. Then they use a classifier to differentiate between the template parts of a page and the content.
- obtain training data site specific
- learn site specific templates
- try to learn a global detector for templateness.
In the results they show that shingling after template detection works better than shingling without template detection.