Corso di Data Mining

Corso di Laurea Magistrale in
Ingegneria Informatica

Data Mining (6 CFU - 48 ore)
A.A. 2016/2017

LEZIONI GIORNALIERE

Introduction to the course

28 febbraio: Introduction to data mining and organization of the course.

Material:

Slides (draft): Introduction

Association Analysis

28 febbraio: Market basket analysis: frequent itemsets and association rules.

1 marzo: Mining of frequent itemsets and (interesting association rules): definition of the computational problem; max number of possible patterns; anti-monotonicity of support; lattice of frequent itemsets; algorithm A-Priori (pseudocode and example).

7 marzo: Analysis of A-Priori: correctness, polynomiality with respect to input and output sizes. Association rules: anti-monotonicity of confidence; algorithm for generating all association rules with respect to give support and confidence thresholds, and its analysis (correctness, polynomiality with respect to input and output sizes).

8 marzo: Exercises from the slides (Part 1). Notions of closed, frequent closed, and maximal itemset. Representation of frequent itemsets and their supports, through maximal itemsets and their supports (lossy) and frequent closed itemsets and their supports. Closure of an itemset and its properties.

14 marzo: Closed/Maximal itemsets: derivation of the frequent itemsets and their supports from the frequent closed itemsets and their support; example where maximal/closed frequent itemsets are exponential in the input size. Top-K frequent (closed) itemsets: definition; control of K on the output size. Significance of itemsets/rules.

15 marzo: Significance of itemsets/rules: lift, conviction and their comparison with confidence. Mining of frequent patterns and rules from non-transactional datasets. Examples of theory questions. Exercise from the slides (Part 2).

21 marzo: Exercises from the slides (Part 2). Case study: Survey on 2000 US Elections.

Material:

Slides (final): Association Analysis (Part 1)
Slides (final): Association Analysis (Part 2)
Exercises on Association Analysis
J. Leskovec, A. Rajaraman and J. Ullman. Mining Massive Datasets. Cambridge University Press, 2014. Chapter 6.
P.N.Tan, M.Steinbach, V.Kumar. Introduction to Data Mining. Addison Wesley, 2006. Chapter 6

Computational Frameworks: MapReduce

21 marzo: Motivation for the MapReduce framework

22 marzo: Specification of MapReduce algorithms and their execution on a distributed system with a fault-tolerant distributed file system. Key performance indicators and complexity measure for MapReduce algorithms. Examples: word count; matrix-vector multiplication; matrix-matrix multiplication. Observations on tradeoffs between number of rounds and space requirements.

28 marzo: Exercises: detailed specification of matrix-vector multiplication algorithm; computation of an inverted index. MapReduce Sample Sort: algorithm and example. Chernoff bound.

29 marzo (Pietracaprina): Analysis of MapReduce Sample Sort.

29 marzo (Ceccarello): Introduction to functional programming in Java. Lambda functions and collections transformations.

4 aprile (Ceccarello): Introduction to Spark. Examples: computing primes smaller than a given integer, two different versions of word-count.

5 aprile (Ceccarello): Optimization of Spark Programs: local aggregation and broadcast variables.

11 aprile: Exercise 2.3.1 (points (b) and (d)) from the book by Leskovec, Rajaraman and Ullman. MapReduce implementation of SON algorithm for frequent itemsets: algorithm and analysis.

12 aprile: Sampling approach by Riondato and Upfal for mining frequent itemsets. Presentation of project.

18 aprile: Exercise: computation a sample of size K out of a set of N transactions. Examples of theory questions.

Material:

Slides (final): MapReduce
Dr. Ceccarello: Laboratory 1
Dr. Ceccarello: Laboratory 2 code and slides
Dr. Ceccarello: Laboratory 3 code
Exercises on MapReduce
Article: M. Riondato, E. Upfal: Efficient Discovery of Association Rules and Frequent Itemsets through Sampling with Tight Performance Guarantees. ACM Trans. on Knowledge Discovery from Data, 2014.

Clustering

18 aprile: Introduction to clustering: definition and some applications. Metric space: definition; relevant distances (L_r-norm, Jaccard, Cosine, Edit, Hamming)

19 aprile: Exercise on L_r, Cosine and Edit distances. Curse of dimensionality. Types of clustering. Center-based k-clustering: k-center, k-means, k-median. k-center clustering: farthest-first traversal algorithm (complexity).

26 aprile: k-center clustering: farthest-first traversal algorithm (approximation factor); MapReduce algorithm (complexity and approximation factor). k-means clustering: notion of centroid and its properties. Discussion on projects.

3 maggio: k-means clustering: Lloyd's algorithm; convergence; discussion on worst-case performance; k-means++; exercise on repeated application of k-means++.

9 maggio: k-median clustering: PAM algorithm (pseudocode, example approximation factor). Choice of the number of clusters. Exercises on the properties of k-center clustering and on the analysis of the MR-Farthest-First Traversal. Cluster evaluation: tendency (Hopkins statistic); unsupervised (cohesion, separation, silhouette coefficien); supervised (entropy of a cluster, entropy of a class, rand statistic, Jaccard coefficient).

10 maggio: Case study: clustering of deputies based on their key votes. Hierarchical clustering: introduction; general agglomerative strategy; merging critera (single linkage, complete linkage, average linkage, Ward's method); stopping conditions; time complexity; accuracy. Case study: hierarchical clustering of amendments to draft laws. Examples of theory questions.

Material:

Slides (final): Clustering (Part 1)
Slides (final): Clustering (Part 2)
Exercises on Clustering
P.N.Tan, M.Steinbach, V.Kumar. Introduction to Data Mining. Addison Wesley, 2006. Chapter 8
Article: D. Arthur, S. Vassilvitskii. k-means++: the advantages of careful seeding. Proc. of ACM-SIAM SODA 2007: 1027-1035.
Article: B. Bahmani et al. Scalable K-Means++. PVLDB 5(7):622-633, 2012
Article: F. Murtagh, P. Contreras. Algorithms for hierarchical clustering: an overview. Wiley Interdisc. Rew.: Data Mining and Knowledge Discovery 2(1):86-97, 2012
Article: C. Jin et al. A Scalable Hierarchical Clustering Algorithm Using Spark. BigDataService 2015:418-426, 2015.

Graph Analytics

16 maggio: Data network and graphs. Objectives of graph analytics, examples and applications. Review of basic graph primitives (BFS, DFS, SSSP, MST, Connected Components) and their sequential complexities. MapReduce algorithm for MST and Connected Components.

17 maggio: Average degrees of separation and diameter: trivial solutions; inapplicability to very large graphs (e.g., Facebook); main ideas of HyperANF.

23 maggio: Diameter: exact computation; approximation algorithms (1 BFS, 4-SWEEP). Centralities: notion of node centrality; types of centrality measures; closeness centrality (for connected and disconnected graphs); harmonic centrality; betwenness centrality; page rank. Computation of closeness centrality: exact; Eppstein-Wang's unbiased estimation. Exercises: diameter approximation with several BFSes; proof that Eppstein-Wang's estimation of centralities is unbiased.

30 maggio: Ripasso: esercizio sul calcolo di una minimum spanning forest in MapReduce; analisi dell'algoritmo Sample Sort per MapReduce; domande teoriche varie.

Material:

Projects

6 giugno (mattina): Presentations of early groups

6 giugno (pomeriggio): Presentations of early groups

7 giugno (pomeriggio): Presentations of early groups

Ultimo aggiornamento: 19 luglio 2017

Vai alla pagina iniziale