Corso di Data Mining

Corso di Laurea Magistrale in
Ingegneria Informatica

Data Mining (6 CFU - 48 ore)
A.A. 2016/2017

SUGGESTED PROJECT

DATASET

The dataset consists of a set of Wikipedia pages (in English), where each page is characterized by a unique ID, a title, a text (sequence of words separated by space), and a list of at most 279 categories out of a set of size 1102644. We will make available two datasets (both links work only from within the department's network):

medium-size dataset (compressed) (100054 pages, 67 MB)
medium-size dataset (uncompressed) (100054 pages, 234 MB)
large-size dataset (500137 pages, 330 MB compressed, 1.2 GB uncompressed).

We expect projects to use the medium-size dataset (or even a smaller sample of it) but we make available the large-size dataset in case a group wants to explore a more challenging computational scenario. (Note: there is no need to uncompress the datasets: Spark is able to uncompress bzip2 files on the fly.)

OBJECTIVES

The project aims at implementing and testing clustering strategies on the dataset. The following two alternative goals can be pursued (however, a group may decide to explore other avenues).

Goal A. Choose a specific type of clustering (e.g., k-center) and explore the quality-performance tradeoffs that can be achieved, where quality is measured by the objective function of the chosen type of clustering, while performance is measured by running time, scalability, and maximum data size it can handle.

Goal B. Try to assess to what extent a clustering is consistent with the categories attached to the Wikipedia pages. To this purpose you will have to decide how to measure the consistency of a clustering with the categories, and to find out which preprocessing, document representation and clustering type, yields the best results.

STEPS

The following sequence of steps will need to be executed.

Choose a suitable representation for the Wikipedia pages: for example, bag-of-words/tf-idf (Spark implementation here); word2vec (Spark implementation here)
Write code to transform the input file into the chosen representation, possibly using functions in the Spark library. Code examples for this task will be provided.
Choose and implement a suitable distance function between Wikipedia pages.
Choose and implement the clustering algorithm(s) you want to test
Prepare a plan of experiments
Run the experiments and collect the results.
Summarize the work done in the report.

For developing the code, we encourage you to exploit functions (e.g., transformations, distance functions, clustering algorithms) available in the Spark library.

Remark: Some groups (e.g., those pursuing Goal A) may decide to devote more effort on the ex-novo implementation and optimization of a type of clustering, while other groups (e.g., those pursuing Goal B) may decide to aim at a deeper data analysis, mostly relying on code (including clustering) already available in the Spark library.

PROJECT STUB

A project stub, already configured with Spark, is available here. It includes some utilities for input and output, as well as the implementation of some algorithms and some usage examples. You can use it as a starting point for developing your own code.

Ultimo aggiornamento: 4 maggio 2017

Vai alla pagina iniziale