Big Data Computing (6 CFU - 48h)
Class 1: ID's last digit 5-9
A.A. 2019/2020
Francesco Silvestri

Lectures

Lectures from A.A. 18/19

Introduction to the course

  • 09/03/2020 (online) Introduction and organization of the course (videos 01, 02).
Material:
  1. Slides: Presentation1920.pdf

Computational Frameworks: MapReduce/Spark

  • 09/03/2020 (online) MapReduce (MR): motivating scenario; introduction to MR; platforms; software frameworks; MR computation (videos 03, 04).
  • 11/03/2020 (online) MapReduce (MR): Implementation of a MR round; Specification and analysis of a MR algorithm; Word count example; design goals for MR algorithms; pros and cons of MR (videos 05, 06, 07).
  • 16/03/2020 (online) Random partitioning: Chernoff bound; Improved Word count (algorithm and analysis); Deterministic partitioni: Class count algorithm (videos 08, 09, 10).
  • 19/03/2020 (online) Accurcay vs efficiency: approximation algorithm for Maximum pairwise distance (Algorithm and analysis). Sampling: SampleSort algorithm (videos 11, 12, 13, 14).
  • 23/03/2020 (online) Exercises 1 (2.3.1(b) and 2.3.1(d) from [LRU14]) and 3 from the slides on MapReduce (videos 15, 16, 16, 18).
  • 25/03/2020 (online) Exercise 4 from the slides on MapReduce (video 18). Introduction to Apache Spark: software architecture; notions of driver/executor processes, context, cluster manager; RDD (partitioning, lineage, caching, operations) (Video 19).
  • 30/03/2020 (online) Presentation of Homework 1 e description of TemplateHW1 (Videos 20, 21a-b, 22).
Material:
  1. Slides (final with errata, update 21/03/2020): MapReduce1920.pdf
  2. Slides: MRExercises1920.pdf
  3. Slides (final): Spark1920.pdf
  4. Set of Exercises with solutions
  5. Chapter 2 (Sections 2.1-2.3) from J. Leskovec, A. Rajaraman and J. Ullman. Mining Massive Datasets. Cambridge University Press, 2014
  6. Article: J. Dean, S. Ghemawat. MapReduce: simplified data processing on large clusters. Commun. ACM 51(1): 107-113 (2008)
  7. Article: A. Pietracaprina, G. Pucci, M. Riondato, F. Silvestri, E. Upfal: Space-round tradeoffs for MapReduce computations. ACM ICS 2012: 235-244.

Clustering

  • 01/04/2020 (online) Introduction to clustering. Metric spaces: Minkowski, Cosine, Hamming, Jaccard, and Edit distances. Curse of dimensionality and combinatorial optimization problems. (Videos 23, 24, 25, 26, 27)
  • 06/04/2020 (online) Exercise on metric spaces. Center-based clustering: definitions and notations. k-center clustering: Farthest-First Traversal (algorithm and analysis). (Videos 28, 29, 30)
  • 08/04/2020 (online) k-center clustering: MR-Farthest-First Traversal (algorithm and analysis). Exercises on clustering (from Slides 40 and 42) (Videos 31, 32, 33)
  • 15/04/2020 (online) k-means clustering: notion of centroid and its properties; Lloyd's algorithm; examples and critical assessment of Lloyd's algorithm with respect to quality of solution; k-means++ (algorithm, implementation of center selection, quality guarantee) (Videos 34, 35, 36).
  • 20/04/2020 (online) K-means clustering for big data: critical assessment of MapReduce implementation of Lloyd's and k-means++ algorithm; coreset-based approach for k-means clustering (general idea, weighted variant of the problem, MapReduce algorithm, analysis of local and aggregate space, analysis of approximation factor with some proofs) (Videos 37, 38)
  • 22/04/2020 (online) k-means clustering in Spark. k-median clustering: PAM algorithm (pseudocode, example, approximation factor); (general idea, weighted variant of the problem, MapReduce algorithm, approximation factor with no proofs.). Comparison between the k-center, k-means and k-median clustering problems (Videos 39, 40, 41).
  • 27/04/2020 (online) Presentation of Homework 2 and of the Aircraft/Uber datasets (videos 42, 43). Exercises on the MapReduce implementation of LLoyd's algorithm and on making approximation guarantee of kmeans++ stronger (videos 44, 45).
  • 29/04/2020 (online) Cluster evaluation. Cluster tendency: Hopkins statistics. Unsupervised evaluation: cohesion and separation, silhouette coefficient. Supervised evaluation: cluster and class entropy. (Videos 46-49)
  • 04/05/2020 (online): Exercises. (Videos 50, 51)
Material:
  1. Slides (final): Clustering Part 1
  2. Slides (final): Clustering Part 2
  3. Slides: CLUSTExercises-1-1920.pdf
  4. Slides (final with errata, update 22/04/20): Clustering Part 3
  5. Slides: CLUSTExercises-2-1920.pdf
  6. Slides: Clustering Part 4
  7. Set of Exercises with solutions
  8. Chapter 7 (Sections 7.1-7.3) from A. Blum, J. Hopcroft, and R. Kannan. Foundations of Data Science. Manuscript, January 2018
  9. Chapter 7 (Sections 7.1, 7.3.1-7.3.3) from J. Leskovec, A. Rajaraman and J. Ullman. Mining Massive Datasets. Cambridge University Press, 2014
  10. Article: D. Arthur, S. Vassilvitskii. k-means++: the advantages of careful seeding. Proc. of ACM-SIAM SODA 2007: 1027-1035.
  11. P.N.Tan, M.Steinbach, V.Kumar. Introduction to Data Mining. Addison Wesley, 2006. Chapter 8

Association Analysis

  • 04/05/2020 (online): Market basket analysis: itemsets and association rules. Mining of frequent itemsets and of interesting association rules: definition of the computational problem; max number of patterns; anti-monotonicity of support; lattice of frequent itemsets. (Videos 52, 53)
  • 06/05/2020 (online): Algorithm A-Priori for mining frequent itemsets: description, example, correctness, efficiency. (Videos 54-57)
  • 11/05/2020 (online): Optimizations of the A-Priori algorithm. Association rules: anti-monotonicity of confidence; algorithm for generating all association rules with respect to give support and confidence thresholds (analysis of correctness and polynomiality omitted). (Videos 58-59)
  • 13/05/2020 (online): Mining frequent itemsets from large datasets: MapReduceimplementation of the SON algorithm; sampling approach by Riondato and Upfal (definition of epsilon-approximate frequent itesemts, sampling algorithm, bound on the sample size). (Videos 60-63)
  • 18/05/2020 (online): Exercises: (anti)monotonicity of a measure for itemsets based on the minimum confidence of rules with antecedent of size 1; mining of frequent substrings. Closed, maximal and frequent closed itemsets: definitions, exemples, and properties. Representation of frequent itemsets through maximal and frequent closed itemsets. (Videos 64-66)
  • 20/05/2020 (online): Closure of an itemset. Exercise showing the exponentiality of the number of maximal/closed frequent itemsets. Top-K frequent (closed) itemsets: definition; control of K on the output size. (Videos 67-69)
  • 25/05/2020 (online): Exercises from Slides 30, 32 and 33 on Association Analysis (Part 3). Videos (70-73). Presentation of Homework 3: access to CloudVeneto; specification of the homework, datasets. (Videos 74-76)
Material:
  1. Slides: Association Analysis Part 1
  2. Slides: Association Analysis Part 2
  3. Slides: Association Analysis: Case-Study
  4. Slides (update 25/05/20): AAExercises-1-1920.pdf
  5. Slides: Association Analysis Part 3
  6. Set of Exercises with solutions
  7. P.N.Tan, M.Steinbach, V.Kumar. Introduction to Data Mining. Addison Wesley, 2006. Chapter 6
  8. Article: M. Riondato, E. Upfal: Efficient Discovery of Association Rules and Frequent Itemsets through Sampling with Tight Performance Guarantees. ACM Trans. on Knowledge Discovery from Data, 2014.
  9. Article: M. MacDougall: Shopping for Voters: Using Association Rules to Discover Relationships in Election Survey Data, 2003.

Miscellanea

  • 02/06/2020 (online): Exercises: Ex. n.4 of Exercises on MapReduce; two exercises on Association Analysis. Tips on Spark. (Videos 77, 78, 79)
  • 10/06/2020 (online) Presenation of two mock exams and their solutions (videos 80, 81, 82). Exam simulation via Moodle-Esami and Zoom.
Material:
  1. Slides: Mix of Exercises
  2. Slides: Spark Tips
  3. Solutions to Mock Exams 1 and 2, Part 1
  4. Solutions to Mock Exam 1, Part 2
  5. Solutions to Mock Exam 2, Part 2

Last update: 10/06/2020 Back to home page