Introduction to the course
 28 febbraio: Introduction to data mining and
organization of the course.
Material:
 Slides (draft):
Introduction
Association Analysis
 28 febbraio: Market basket analysis:
frequent itemsets and association rules.
 1 marzo:
Mining of frequent itemsets and (interesting association rules):
definition of the computational problem;
max number of possible patterns; antimonotonicity of support;
lattice of frequent itemsets; algorithm APriori (pseudocode and
example).
 7 marzo:
Analysis of APriori: correctness, polynomiality with respect to input
and output sizes. Association rules: antimonotonicity of confidence;
algorithm for generating all association rules with respect to give
support and confidence thresholds, and its analysis (correctness,
polynomiality with respect to input and output sizes).
 8 marzo: Exercises from the slides (Part 1).
Notions of closed, frequent closed, and maximal itemset.
Representation of frequent itemsets and their supports, through
maximal itemsets and their supports (lossy) and
frequent closed itemsets and their supports.
Closure of an itemset and its properties.
 14 marzo: Closed/Maximal itemsets:
derivation of the frequent itemsets
and their supports from the frequent closed itemsets and their support;
example where maximal/closed frequent itemsets are exponential
in the input size. TopK frequent (closed) itemsets:
definition; control of K on the output size. Significance
of itemsets/rules.
 15 marzo: Significance
of itemsets/rules: lift, conviction and their comparison with
confidence. Mining of frequent patterns and rules from
nontransactional datasets. Examples of theory questions.
Exercise from the slides (Part 2).
 21 marzo: Exercises from the slides (Part 2).
Case study: Survey on 2000 US Elections.
Material:
 Slides (final):
Association Analysis (Part 1)
 Slides (final):
Association Analysis (Part 2)

Exercises on Association Analysis

J. Leskovec, A. Rajaraman and J. Ullman. Mining Massive Datasets. Cambridge
University Press, 2014. Chapter 6.

P.N.Tan, M.Steinbach, V.Kumar.
Introduction to Data Mining. Addison Wesley, 2006.
Chapter 6
Computational Frameworks: MapReduce
 21 marzo: Motivation for the MapReduce framework
 22 marzo: Specification of MapReduce algorithms
and their execution on a distributed system with a faulttolerant
distributed file system. Key performance indicators
and complexity measure for MapReduce algorithms.
Examples: word count; matrixvector multiplication; matrixmatrix
multiplication. Observations on tradeoffs between number of rounds
and space requirements.
 28 marzo: Exercises: detailed specification of
matrixvector multiplication algorithm; computation of an inverted index.
MapReduce Sample Sort: algorithm and example. Chernoff bound.
 29 marzo (Pietracaprina): Analysis of
MapReduce Sample Sort.
 29 marzo (Ceccarello):
Introduction to functional programming in Java.
Lambda functions and collections transformations.
 4 aprile (Ceccarello):
Introduction to Spark.
Examples: computing primes smaller than a given integer, two different versions of wordcount.
 5 aprile (Ceccarello):
Optimization of Spark Programs: local aggregation and broadcast variables.
 11 aprile:
Exercise 2.3.1 (points (b) and (d)) from the book by Leskovec,
Rajaraman and Ullman. MapReduce implementation of SON algorithm for
frequent itemsets: algorithm and analysis.
 12 aprile:
Sampling approach by Riondato and Upfal
for mining frequent itemsets. Presentation of project.
 18 aprile: Exercise: computation a sample
of size K out of a set of N transactions. Examples of
theory questions.
Material:
 Slides (final):
MapReduce
 Dr. Ceccarello:
Laboratory 1
 Dr. Ceccarello: Laboratory 2
code and
slides
 Dr. Ceccarello: Laboratory 3
code

Exercises on MapReduce

Article:
M. Riondato, E. Upfal:
Efficient Discovery of Association Rules and Frequent Itemsets through
Sampling with Tight Performance Guarantees. ACM
Trans. on Knowledge Discovery from Data, 2014.
Clustering
 18 aprile: Introduction to clustering: definition
and some applications. Metric space: definition; relevant distances
(L_rnorm, Jaccard, Cosine, Edit, Hamming)
 19 aprile: Exercise on L_r, Cosine and Edit distances.
Curse of dimensionality. Types of clustering.
Centerbased kclustering: kcenter, kmeans, kmedian.
kcenter clustering: farthestfirst traversal algorithm (complexity).
 26 aprile: kcenter clustering:
farthestfirst traversal algorithm (approximation factor); MapReduce
algorithm (complexity and approximation factor). kmeans clustering:
notion of centroid and its properties. Discussion on projects.
 3 maggio: kmeans clustering: Lloyd's algorithm; convergence;
discussion on worstcase performance; kmeans++; exercise on
repeated application of kmeans++.
 9 maggio: kmedian clustering: PAM algorithm
(pseudocode, example approximation factor). Choice of the number of
clusters. Exercises on the properties of kcenter clustering and
on the analysis of the MRFarthestFirst Traversal.
Cluster evaluation: tendency (Hopkins statistic); unsupervised (cohesion,
separation, silhouette coefficien); supervised (entropy of a cluster,
entropy of a class, rand statistic, Jaccard coefficient).
 10 maggio: Case study: clustering of deputies
based on their key votes. Hierarchical clustering:
introduction; general agglomerative strategy; merging critera
(single linkage, complete linkage, average linkage, Ward's method);
stopping conditions; time complexity; accuracy. Case study:
hierarchical clustering of amendments to draft laws.
Examples of theory questions.
Material:
 Slides (final):
Clustering (Part 1)
 Slides (final):
Clustering (Part 2)

Exercises on Clustering

P.N.Tan, M.Steinbach, V.Kumar.
Introduction to Data Mining. Addison Wesley, 2006.
Chapter 8
 Article:
D. Arthur, S. Vassilvitskii.
kmeans++: the advantages of careful seeding.
Proc. of ACMSIAM SODA 2007: 10271035.
 Article:
B. Bahmani et al.
Scalable KMeans++. PVLDB 5(7):622633, 2012

Article:
F. Murtagh, P. Contreras.
Algorithms for hierarchical clustering: an overview.
Wiley Interdisc. Rew.: Data Mining and Knowledge Discovery 2(1):8697,
2012

Article:
C. Jin et al.
A Scalable Hierarchical Clustering Algorithm
Using Spark. BigDataService 2015:418426, 2015.
Graph Analytics

16 maggio: Data network and graphs. Objectives of graph analytics,
examples and applications. Review of basic graph primitives (BFS, DFS,
SSSP, MST, Connected Components) and their sequential complexities.
MapReduce algorithm for MST and Connected Components.

17 maggio: Average degrees of separation and diameter:
trivial solutions; inapplicability to very large graphs (e.g., Facebook);
main ideas of HyperANF.

23 maggio: Diameter: exact computation; approximation
algorithms (1 BFS, 4SWEEP). Centralities: notion of node centrality;
types of centrality measures; closeness centrality (for connected and
disconnected graphs); harmonic centrality; betwenness centrality;
page rank. Computation of closeness centrality: exact; EppsteinWang's
unbiased estimation. Exercises: diameter approximation with several BFSes;
proof that EppsteinWang's estimation of centralities is unbiased.

30 maggio: Ripasso: esercizio sul calcolo di una minimum spanning forest
in MapReduce; analisi dell'algoritmo Sample Sort per MapReduce;
domande teoriche varie.
Material:
 Slides (final):
Graph Analytics

Exercises on Graph Analytics
 Article:
S. Lattanzi et al.
Filtering: a method for solving graph problems in MapReduce.
ACMSPAA 2011: 8594

S. Bhagat et al. Three and a half degrees of separation.
 Article:
P. Boldi, M. Rosa, S. Vigna.
HyperANF: approximating the neighbourhood function of very large graphs on a
budget. WWW 2011: 625634.
 Article:
P. Boldi, S. Vigna.
Axioms for Centrality. Internet Mathematics 10(34): 222262 (2014)
 Article:
D. Eppstein, J. Wang:
Fast Approximation of Centrality. J. Graph Algorithms Appl. 8: 3945 (2004)
Projects

6 giugno (mattina): Presentations of early groups

6 giugno (pomeriggio): Presentations of early groups

7 giugno (pomeriggio): Presentations of early groups
