Introduction to the course
 09/03/2020 (online) Introduction and
organization of the course (videos 01, 02).
Material:
 Slides:
Presentation1920.pdf
Computational Frameworks: MapReduce/Spark
 09/03/2020 (online) MapReduce (MR): motivating scenario;
introduction to MR; platforms; software frameworks; MR computation (videos 03, 04).
 11/03/2020 (online) MapReduce (MR): Implementation of a MR round;
Specification and analysis of a MR algorithm; Word count example;
design goals for MR algorithms; pros and cons of MR (videos 05, 06, 07).
 16/03/2020 (online)
Random partitioning: Chernoff bound; Improved Word count (algorithm and analysis);
Deterministic partitioni: Class count algorithm (videos 08, 09, 10).
 19/03/2020 (online)
Accurcay vs efficiency: approximation algorithm for Maximum pairwise distance
(Algorithm and analysis).
Sampling: SampleSort algorithm (videos 11, 12, 13, 14).
 23/03/2020 (online) Exercises 1 (2.3.1(b) and 2.3.1(d)
from [LRU14]) and 3 from the slides on MapReduce (videos 15, 16, 16,
18).
 25/03/2020 (online) Exercise 4 from the slides on
MapReduce (video 18). Introduction to Apache Spark: software
architecture; notions of driver/executor processes, context, cluster
manager; RDD (partitioning, lineage, caching, operations) (Video 19).
 30/03/2020 (online) Presentation of Homework 1 e description
of TemplateHW1 (Videos 20, 21ab, 22).
Material:
 Slides (final with errata, update 21/03/2020):
MapReduce1920.pdf
 Slides:
MRExercises1920.pdf
 Slides (final):
Spark1920.pdf
 Set of
Exercises with solutions

Chapter 2 (Sections 2.12.3) from
J. Leskovec, A. Rajaraman and J. Ullman. Mining Massive Datasets. Cambridge
University Press, 2014

Article:
J. Dean, S. Ghemawat. MapReduce: simplified data processing
on large clusters. Commun. ACM 51(1): 107113 (2008)

Article:
A. Pietracaprina, G. Pucci, M. Riondato, F. Silvestri, E. Upfal:
Spaceround tradeoffs for MapReduce computations. ACM ICS 2012: 235244.
Clustering
 01/04/2020 (online)
Introduction to clustering. Metric spaces: Minkowski, Cosine, Hamming,
Jaccard, and Edit distances. Curse of dimensionality and
combinatorial optimization problems. (Videos 23, 24, 25, 26, 27)
 06/04/2020 (online)
Exercise on metric spaces. Centerbased clustering: definitions and notations.
kcenter clustering:
FarthestFirst Traversal (algorithm and analysis). (Videos 28, 29, 30)
 08/04/2020 (online)
kcenter clustering: MRFarthestFirst Traversal (algorithm and analysis).
Exercises on clustering (from Slides 40 and 42) (Videos 31, 32, 33)
 15/04/2020 (online)
kmeans clustering: notion of centroid and its properties; Lloyd's algorithm;
examples and critical assessment of Lloyd's algorithm with respect to quality
of solution; kmeans++ (algorithm, implementation of center selection,
quality guarantee) (Videos 34, 35, 36).
 20/04/2020 (online)
Kmeans clustering for big data: critical assessment of
MapReduce implementation of Lloyd's and kmeans++ algorithm;
coresetbased approach for kmeans clustering (general idea,
weighted variant of the problem, MapReduce algorithm, analysis
of local and aggregate space, analysis of approximation factor
with some
proofs) (Videos 37, 38)
 22/04/2020 (online)
kmeans clustering in Spark.
kmedian clustering: PAM algorithm (pseudocode, example, approximation factor);
(general idea,
weighted variant of the problem, MapReduce algorithm, approximation
factor with no proofs.). Comparison between the kcenter, kmeans and kmedian clustering problems (Videos 39, 40, 41).
 27/04/2020 (online)
Presentation of Homework 2 and of the Aircraft/Uber datasets (videos 42, 43).
Exercises on the MapReduce implementation of LLoyd's algorithm
and on making approximation guarantee of kmeans++ stronger (videos 44, 45).
 29/04/2020 (online)
Cluster evaluation. Cluster tendency: Hopkins statistics.
Unsupervised evaluation: cohesion and separation, silhouette coefficient.
Supervised evaluation: cluster and class entropy. (Videos 4649)

04/05/2020 (online): Exercises. (Videos 50, 51)
Material:
 Slides (final):
Clustering Part 1
 Slides (final):
Clustering Part 2
 Slides:
CLUSTExercises11920.pdf
 Slides (final with errata, update 22/04/20):
Clustering Part 3
 Slides:
CLUSTExercises21920.pdf
 Slides:
Clustering Part 4
 Set of
Exercises with solutions

Chapter 7 (Sections 7.17.3) from
A. Blum, J. Hopcroft, and R. Kannan. Foundations of Data Science.
Manuscript, January 2018

Chapter 7 (Sections 7.1, 7.3.17.3.3) from
J. Leskovec, A. Rajaraman and J. Ullman. Mining Massive Datasets. Cambridge
University Press, 2014
 Article:
D. Arthur, S. Vassilvitskii.
kmeans++: the advantages of careful seeding.
Proc. of ACMSIAM SODA 2007: 10271035.

P.N.Tan, M.Steinbach, V.Kumar.
Introduction to Data Mining. Addison Wesley, 2006.
Chapter 8
Association Analysis

04/05/2020 (online): Market basket analysis: itemsets and association
rules. Mining of frequent itemsets and of interesting association
rules: definition of the computational problem; max number of
patterns; antimonotonicity of support; lattice of frequent
itemsets. (Videos 52, 53)

06/05/2020 (online):
Algorithm APriori for mining frequent itemsets:
description, example, correctness, efficiency.
(Videos 5457)

11/05/2020 (online):
Optimizations of the APriori algorithm.
Association rules: antimonotonicity of confidence; algorithm for generating all
association rules with respect to give support and confidence thresholds (analysis
of correctness and polynomiality omitted). (Videos 5859)

13/05/2020 (online):
Mining frequent itemsets from large datasets: MapReduceimplementation of the
SON algorithm; sampling approach by Riondato and Upfal (definition of epsilonapproximate
frequent itesemts, sampling algorithm, bound on the sample size). (Videos 6063)

18/05/2020 (online):
Exercises: (anti)monotonicity of
a measure for itemsets based on the minimum confidence of rules with
antecedent of size 1; mining of frequent substrings.
Closed, maximal and frequent closed itemsets: definitions, exemples, and
properties. Representation of frequent itemsets through maximal and
frequent closed itemsets. (Videos 6466)

20/05/2020 (online):
Closure of an itemset. Exercise showing the exponentiality of the
number of maximal/closed frequent itemsets. TopK frequent (closed) itemsets:
definition; control of K on the output size. (Videos 6769)

25/05/2020 (online):
Exercises from Slides 30, 32 and 33 on Association Analysis (Part 3).
Videos (7073). Presentation of Homework 3: access to CloudVeneto;
specification of the homework, datasets. (Videos 7476)
Material:
 Slides:
Association Analysis Part 1
 Slides:
Association Analysis Part 2
 Slides:
Association Analysis: CaseStudy
 Slides (update 25/05/20):
AAExercises11920.pdf
 Slides:
Association Analysis Part 3
 Set of
Exercises with solutions

P.N.Tan, M.Steinbach, V.Kumar.
Introduction to Data Mining. Addison Wesley, 2006.
Chapter 6

Article:
M. Riondato, E. Upfal:
Efficient Discovery of Association Rules and Frequent Itemsets through
Sampling with Tight Performance Guarantees. ACM
Trans. on Knowledge Discovery from Data, 2014.
 Article:
M. MacDougall: Shopping for Voters: Using Association
Rules to Discover Relationships in Election Survey Data, 2003.
Miscellanea

02/06/2020 (online):
Exercises: Ex. n.4 of Exercises on MapReduce; two exercises on Association
Analysis. Tips on Spark. (Videos 77, 78, 79)

10/06/2020 (online)
Presenation of two mock exams and their solutions (videos 80, 81, 82).
Exam simulation via MoodleEsami and Zoom.
Material:
 Slides:
Mix of Exercises
 Slides:
Spark Tips

Solutions to Mock Exams 1 and 2, Part 1

Solutions to Mock Exam 1, Part 2

Solutions to Mock Exam 2, Part 2
