This repository will contain all of the datasets used to obtain the results presented in the paper. The datasets will be made available for public use after the completion of the review procedure.
We analyze two corpora of web pages annotated with multiple web genres:
20-genres corpora and SPIRIT.
We consider the following types of features:
We exploit the following hierarchy construction methods:
We used the CLUS tool - a decision tree and rule induction system that implements the predictive clustering framework. This framework unifies unsupervised clustering and predictive modeling and allows for a natural extension to more complex prediction settings such as multi-task learning and multi-label classification. While most decision tree learners induce classification or regression trees, Clus generalizes this approach by learning trees that are interpreted as cluster hierarchies. We call such trees predictive clustering trees or PCTs. Depending on the learning task at hand, different goal criteria are to be optimized while creating the clusters, and different heuristics will be suitable to achieve this. Clus has been successfully applied to many different tasks including multi-task learning (multi-target classification and regression), structured output learning, multi-label classification, hierarchical classification, and time series prediction. Next to these supervised learning tasks, PCTs are also applicable to semi-supervised learning, subgroup discovery, and clustering.
CLUS is co-developed by the Declarative Languages and Artificial Intelligence group of the Katholieke Universiteit Leuven, Belgium, and the Department of Knowledge Technologies of the Jožef Stefan Institute, Ljubljana, Slovenia. It is written in Java and is open-source software licensed under the GPL.