U-Rest: An Unsupervised Record Extraction SysTem
We demonstrate a system that extracts record sets from record-list web pages with no direct human supervision. Our system, U-REST, reframes the problem of unsupervised record extraction as a two-phase machine learning problem with a clustering phase, where structurally similar regions are discovered, and a record cluster detection phase, where discovered grouping of regions are ranked by their likelihood of being records. This framework simplifies the record extraction task, and allows for independent analysis of the algorithms and the underlying features. In our work, we survey a large set of features under this simplified framework. We conclude with an preliminary comparison of U-REST against similar systems and show improvements in the extraction accuracy.