Data repository

Datasets

This repository will contain all of the datasets used to obtain the results presented in the paper. The datasets will be made available for public use after the completion of the review procedure.

Corpora

We analyze two corpora of web pages annotated with multiple web genres:
20-genres corpora and SPIRIT.

Feature sets

We consider the following types of features:

Linguistic and presentational features for web genres Surface features (bag-of-words and tfidf representation) Structural features Presentation features Context features

Paragraph vector features Paragraph vectors on raw pages Paragraph vectors on clean pages (HTML tags removed) Paragraph vectors using GloVe pre-trained model on raw pages Paragraph vectors using GloVe pre-trained model on clean pages (HTML tags removed)

Character n-grams

Web genre hierarchies

We exploit the following hierarchy construction methods:

Data-driven Balanced k-means clustering (BkM) predictive clustering trees (PCTs) Clustering with complete linkage (CL) Clustering with single linkage (SL)

Random hierarchy construction

Expert/manual hierarchy construction

Methods

We used the CLUS tool - a decision tree and rule induction system that implements the predictive clustering framework. This framework unifies unsupervised clustering and predictive modeling and allows for a natural extension to more complex prediction settings such as multi-task learning and multi-label classification. While most decision tree learners induce classification or regression trees, Clus generalizes this approach by learning trees that are interpreted as cluster hierarchies. We call such trees predictive clustering trees or PCTs. Depending on the learning task at hand, different goal criteria are to be optimized while creating the clusters, and different heuristics will be suitable to achieve this. Clus has been successfully applied to many different tasks including multi-task learning (multi-target classification and regression), structured output learning, multi-label classification, hierarchical classification, and time series prediction. Next to these supervised learning tasks, PCTs are also applicable to semi-supervised learning, subgroup discovery, and clustering.

CLUS is co-developed by the Declarative Languages and Artificial Intelligence group of the Katholieke Universiteit Leuven, Belgium, and the Department of Knowledge Technologies of the Jožef Stefan Institute, Ljubljana, Slovenia. It is written in Java and is open-source software licensed under the GPL.