Corso di Laurea Magistrale in
Ingegneria Informatica
Data Mining (6 CFU - 48 ore)
A.A. 2016/2017



The dataset consists of a set of Wikipedia pages (in English), where each page is characterized by a unique ID, a title, a text (sequence of words separated by space), and a list of at most 279 categories out of a set of size 1102644. We will make available two datasets (both links work only from within the department's network):

We expect projects to use the medium-size dataset (or even a smaller sample of it) but we make available the large-size dataset in case a group wants to explore a more challenging computational scenario. (Note: there is no need to uncompress the datasets: Spark is able to uncompress bzip2 files on the fly.)


The project aims at implementing and testing clustering strategies on the dataset. The following two alternative goals can be pursued (however, a group may decide to explore other avenues).

Goal A. Choose a specific type of clustering (e.g., k-center) and explore the quality-performance tradeoffs that can be achieved, where quality is measured by the objective function of the chosen type of clustering, while performance is measured by running time, scalability, and maximum data size it can handle.

Goal B. Try to assess to what extent a clustering is consistent with the categories attached to the Wikipedia pages. To this purpose you will have to decide how to measure the consistency of a clustering with the categories, and to find out which preprocessing, document representation and clustering type, yields the best results.


The following sequence of steps will need to be executed.

  1. Choose a suitable representation for the Wikipedia pages: for example, bag-of-words/tf-idf (Spark implementation here); word2vec (Spark implementation here)
  2. Write code to transform the input file into the chosen representation, possibly using functions in the Spark library. Code examples for this task will be provided.
  3. Choose and implement a suitable distance function between Wikipedia pages.
  4. Choose and implement the clustering algorithm(s) you want to test
  5. Prepare a plan of experiments
  6. Run the experiments and collect the results.
  7. Summarize the work done in the report.
For developing the code, we encourage you to exploit functions (e.g., transformations, distance functions, clustering algorithms) available in the Spark library.

Remark: Some groups (e.g., those pursuing Goal A) may decide to devote more effort on the ex-novo implementation and optimization of a type of clustering, while other groups (e.g., those pursuing Goal B) may decide to aim at a deeper data analysis, mostly relying on code (including clustering) already available in the Spark library.


A project stub, already configured with Spark, is available here. It includes some utilities for input and output, as well as the implementation of some algorithms and some usage examples. You can use it as a starting point for developing your own code.

Ultimo aggiornamento: 4 maggio 2017 Vai alla pagina iniziale