Corso di Laurea Magistrale in |
Data Mining (6 CFU - 48 ore)|
SUGGESTED PROJECT |
The dataset consists of a set of Wikipedia pages (in English), where each page is characterized by a unique ID, a title, a text (sequence of words separated by space), and a list of at most 279 categories out of a set of size 1102644. We will make available two datasets (both links work only from within the department's network):
The project aims at implementing and testing clustering strategies on the dataset. The following two alternative goals can be pursued (however, a group may decide to explore other avenues).
Goal A. Choose a specific type of clustering (e.g., k-center) and explore the quality-performance tradeoffs that can be achieved, where quality is measured by the objective function of the chosen type of clustering, while performance is measured by running time, scalability, and maximum data size it can handle.
Goal B. Try to assess to what extent a clustering is consistent with the categories attached to the Wikipedia pages. To this purpose you will have to decide how to measure the consistency of a clustering with the categories, and to find out which preprocessing, document representation and clustering type, yields the best results.
The following sequence of steps will need to be executed.
Remark: Some groups (e.g., those pursuing Goal A) may decide to devote more effort on the ex-novo implementation and optimization of a type of clustering, while other groups (e.g., those pursuing Goal B) may decide to aim at a deeper data analysis, mostly relying on code (including clustering) already available in the Spark library.
A project stub, already configured with Spark, is available here. It includes some utilities for input and output, as well as the implementation of some algorithms and some usage examples. You can use it as a starting point for developing your own code.
|Ultimo aggiornamento: 4 maggio 2017||Vai alla pagina iniziale|