Introduction to the course
 26/02/2018 Introduction and
organization of the course.
Material:
 Slides:
Introduction
Computational Frameworks: MapReduce/Spark
 26/02/2018 Big data challenges. MapReduce:
introduction to the framework; typical platform architecture;
MapReduce computation.
 27/02/2018 MapReduce:
Specification of MapReduce algorithms
and their execution on a distributed system with a faulttolerant
distributed file system; Key performance indicators
for the analysis MapReduce algorithms.
Basic MapReduce primitives/techniques: word count.
 06/03/2018 Basic MapReduce primitives/techniques:
partitioning (improved Word count and Category counting);
trading accurcay for efficiency (Maximum pairwise distance).
Tools: Chernoff bound.
 07/03/2018 Exercise: exact maximum
pairwise distance in constant rounds, sublinear
local space and quadratic aggregate space.
Basic MapReduce primitives/techniques: sampling
(SampleSort)
 13/03/2018 Basic MapReduce primitives/techniques:
analysis of SampleSort. Description of
Homework 1.
 14/03/2018 Introduction to Apache Spark:
software architecture; notions of driver/executor processes,
context, cluster manager; RDD (partitioning, lineage, caching,
operations). Exercises 2.3.1.(b) and 2.3.1.(d) from
J. Leskovec, A. Rajaraman and J. Ullman. Mining Massive Datasets. Cambridge
University Press, 2014.
 20/03/2018 Basic MapReduce primitives/techniques:
matrixvector multiplication; matrixmatrix
multiplication; observations on tradeoffs between number of rounds
and space requirements.
Material:
 Slides:
Computational Frameworks  MapReduce (final with errata)
 Slides:
Apache Spark Fundamentals (final with errata)
 Set of
Exercises with solutions

Article:
J. Dean, S. Ghemawat. MapReduce: simplified data processing
on large clusters. Commun. ACM 51(1): 107113 (2008)

Article:
A. Pietracaprina, G. Pucci, M. Riondato, F. Silvestri, E. Upfal:
Spaceround tradeoffs for MapReduce computations. ACM ICS 2012: 235244.

Article:
M. Zaharia et al.
Resilient Distributed Datasets: A FaultTolerant Abstraction
for InMemory Cluster Computing. NSDI 2012: 1528

Article:
A Gentle Introduction to Apache Spark. From
B. Chambers, M. Zaharia. Spark: The Definite Guide, Databricks 2018.
Clustering
 20/03/2018 Introduction to clustering: definition
and some applications.
 21/03/2018 Metric space: definition; relevant distances
(L_rnorm, Jaccard, Cosine, Edit, Hamming);
Exercise on L_1 and Edit distances.
Curse of dimensionality: general characterization of the problem;
analysis of pairwise distance of N points with t random
coordinates in [0,1].
 27/03/2018 Clustering Types.
Centerbased kclustering: kcenter, kmeans, kmedian.
kcenter clustering: farthestfirst traversal (algorithm
and analysis of the approximation factor); MapReduce (MR)
farthestfirst traversal (algorithm).
 28/03/2018 MapReduce (MR)
farthestfirst traversal (analysis). Exercises: 1round
MapReduce implementation of Partition(P,S); relation between
optimal clustering of a pointset P and of a subset T of P.
Brief outline of homework 2.
 04/04/2018 Discussion of Homework 2
and live demo on the Spark Web interface.
kmeans clustering: notion of centroid and its properties;
Lloyd's algorithm; observations on the quality of
the clustering returned by Lloyd's algorithm.
 10/04/2018: kmeans clustering:
convergence of Lloyd's algorithm; kmeans++.
kmedian clustering: PAM algorithm
(pseudocode, example, convergence, approximation factor).
 11/04/2018: kmedian clustering:
adaptation of Lloyd's algorithm strategy; identification
of a suitable value for k; weigthed kmedian clustering
(definition of the problem and adaptation of PAM algorithm);
MapReduce MRPAM algorithm (description and analysis).
Exercises: MapReduce implementation of one iteration of
Lloyd's algorithm; proof that log_2 N independent runs
of kmeans++ provide the same quality guarantees as kmeans++
but with high probability.
 17/04/2018:
Cluster evaluation: tendency (Hopkins statistic); unsupervised (cohesion,
separation, silhouette coefficien); supervised (entropy of a cluster,
entropy of a class). Case study: clustering of deputies
based on their key votes. Presentation of dott.
Francesco Silvestri.
 18/04/2018:
Hierarchical clustering:
introduction; general agglomerative strategy; merging critera
(single linkage, complete linkage, average linkage);
stopping conditions; time complexity; discussion of
accuracy.
Examples of theory questions.
Presentation of
Homework 3.
Material:
 Slides:
Clustering (Part 1) (final with errata)
 Slides:
Clustering (Part 2) (final with errata)
 Slides:
Clustering (Part 3) (final with errata)
 Set of
Exercises with solutions

P.N.Tan, M.Steinbach, V.Kumar.
Introduction to Data Mining. Addison Wesley, 2006.
Chapter 8
 Article:
D. Arthur, S. Vassilvitskii.
kmeans++: the advantages of careful seeding.
Proc. of ACMSIAM SODA 2007: 10271035.
 Article:
A. Ene, S. Im, B. Moseley.
Fast clustering using MapReduce. Proc. of KDD 2011: 681689.
Association Analysis
 24/04/2018: Market basket analysis:
frequent itemsets and association rules.
Mining of frequent itemsets and of interesting association rules:
definition of the computational problem;
max number of patterns; antimonotonicity of support;
lattice of frequent itemsets; algorithm APriori (pseudocode).
 02/05/2018:
Analysis of APriori: correctness, polynomiality with respect to input
and output sizes. Exercise on an alternative candidate generation.
Association rules: antimonotonicity of confidence;
algorithm for generating all association rules with respect to give
support and confidence thresholds
(analysis of correctness and polynomiality omitted).
 08/05/2018:
Mining frequent itemsets from large datasets:
optimization by Park, Chen and Yu to filter out
infrequent candidate itemsets of size 2;
MapReduce implementation of the SON algorithm;
sampling approach by Riondato and Upfal
(definition of epsilonapproximate frequent itesemts,
sampling algorithm, bound on the sample size).
 09/05/2018:
Description of
Homework 4.
Notions of closed, frequent closed, and maximal itemsets.
 15/05/2018:
Representation of frequent itemsets and their supports, through
maximal itemsets and their supports (lossy) and
frequent closed itemsets and their supports.
Closure of an itemset and its properties.
example where maximal/closed frequent itemsets are exponential
in the input size. TopK frequent (closed) itemsets:
definition.
 16/05/2018:
TopK frequent (closed) itemsets:
control of K on the output size. Exercises:
example showing that the TopK frequent itemsets
can be exponential in the output size even for small K; exercise from the
example of written test.
 22/05/2018: Exercises: exercise on the intersection of
closed itemsets; exercise on the (anti)monotonicity property of
a new metric for itemsets.
N.B.: Slides 1923 of Part 2 have not been covered in class and
are not required for the exam.
Material:
 Slides:
Association Analysis (Part 1) (final with errata)
 Slides:
Association Analysis (Part 2) (final with errata)
 Set of
Exercises with solutions

Article:
M. Riondato, E. Upfal:
Efficient Discovery of Association Rules and Frequent Itemsets through
Sampling with Tight Performance Guarantees. ACM
Trans. on Knowledge Discovery from Data, 2014.
Graph Analytics
 22/05/2018: Data network and graphs. Objectives of graph analytics,
examples and applications. Review of basic graph primitives (BFS, DFS,
SSSP, MST, Connected Components) and their sequential complexities.
 23/05/2018:
MapReduce algorithm for MST and Connected Components.
 29/05/2018:
Average Degrees of Separation (ADS) and Diameter:
straighforward algorithms and their impracticality for very large graphs;
Neighborhood function; exact computation of ADS and Diameter
from the Neighborhood function; HyperANF (main ideas and definition of
the sketched based on hyperloglog counters used to represent sets of nodes);
approximation of the diameter using 1 pivot (1 execution of BFS/SSSP)
or 4 pivots (using 4 executions of BFS/SSSP, i.e., 4SWEEP algorithm).
 30/05/2018: Exercises: estimation of a graph's diameter
with l pivots; generalization of the 2round MapReduce algorithm for
MSF with local space M=n^{1+epsilon}. Review: notion of local and
aggregate space in MapReduce; 2nd exercise from the
set of exercies on MapReduce; 5th exercise from the set of exercies on
Clustering; theory questions from the
example of written test
N.B.: Slides 5768 have not been covered in class and
are not required for the exam.
Material:
 Slides:
Graph Analytics (final with errata)
 Set of
Exercises with solutions

Article:
P. Boldi, M. Rosa, S. Vigna.
HyperANF: Approximating the Neighbourhood Function of
Very Large Graphs on a Budget. Proc. WWW, pages 625634, 2011.

Article:
P. Crescenzi, R. Grossi, M. Habib, L. Lanzi, A. Marino.
On computing the diameter of realworld undirected graphs.
Theor. Comput. Sci. 514:8495, 2013.
