Big Data Computing (6 CFU - 48h)
2017/2018

Lectures

Introduction to the course

  • 27/02/2018 Introduction and organization of the course.
Material:
  1. Slides: Introduction

Computational Frameworks: MapReduce/Spark

  • 27/02/2018 Big data challenges. MapReduce: introduction to the framework; typical platform architecture; MapReduce computation.
  • 28/02/2018 MapReduce: Specification of MapReduce algorithms and their execution on a distributed system with a fault-tolerant distributed file system; Key performance indicators for the analysis MapReduce algorithms. Basic MapReduce primitives/techniques: word count.
  • 06/03/2018 Basic MapReduce primitives/techniques: partitioning (improved Word count and Category counting); trading accurcay for efficiency (Maximum pairwise distance). Tools: Chernoff bound.
  • 07/03/2018 Exercise: exact maximum pairwise distance in constant rounds, sublinear local space and quadratic aggregate space. Basic MapReduce primitives/techniques: sampling (SampleSort)
  • 13/03/2018 Basic MapReduce primitives/techniques: analysis of SampleSort. Description of Homework 1.
  • 14/03/2018 Introduction to Apache Spark: software architecture; notions of driver/executor processes, context, cluster manager; RDD (partitioning, lineage, caching, operations). Exercises 2.3.1.(b) and 2.3.1.(d) from J. Leskovec, A. Rajaraman and J. Ullman. Mining Massive Datasets. Cambridge University Press, 2014.
  • 20/03/2018 Basic MapReduce primitives/techniques: matrix-vector multiplication; matrix-matrix multiplication; observations on tradeoffs between number of rounds and space requirements.
Material:
  1. Slides: Computational Frameworks - MapReduce (final with errata)
  2. Slides: Apache Spark Fundamentals (final with errata)
  3. Set of Exercises with solutions
  4. Article: J. Dean, S. Ghemawat. MapReduce: simplified data processing on large clusters. Commun. ACM 51(1): 107-113 (2008)
  5. Article: A. Pietracaprina, G. Pucci, M. Riondato, F. Silvestri, E. Upfal: Space-round tradeoffs for MapReduce computations. ACM ICS 2012: 235-244.
  6. Article: M. Zaharia et al. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. NSDI 2012: 15-28
  7. Article: A Gentle Introduction to Apache Spark. From B. Chambers, M. Zaharia. Spark: The Definite Guide, Databricks 2018.

Clustering

  • 20/03/2018 Introduction to clustering: definition and some applications.
  • 21/03/2018 Metric space: definition; relevant distances (L_r-norm, Jaccard, Cosine, Edit, Hamming); Exercise on L_1 and Edit distances. Curse of dimensionality: general characterization of the problem; analysis of pairwise distance of N points with t random coordinates in [0,1].
  • 27/03/2018 Clustering Types. Center-based k-clustering: k-center, k-means, k-median. k-center clustering: farthest-first traversal (algorithm and analysis of the approximation factor); MapReduce (MR) farthest-first traversal (algorithm).
  • 28/03/2018 MapReduce (MR) farthest-first traversal (analysis). Exercises: 1-round MapReduce implementation of Partition(P,S); relation between optimal clustering of a pointset P and of a subset T of P. Brief outline of homework 2.
  • 04/04/2018 Discussion of Homework 2 and live demo on the Spark Web interface. k-means clustering: notion of centroid and its properties; Lloyd's algorithm; observations on the quality of the clustering returned by Lloyd's algorithm.
  • 10/04/2018: k-means clustering: convergence of Lloyd's algorithm; k-means++. k-median clustering: PAM algorithm (pseudocode, example, convergence, approximation factor).
  • 11/04/2018: k-median clustering: adaptation of Lloyd's algorithm strategy; identification of a suitable value for k; weigthed k-median clustering (definition of the problem and adaptation of PAM algorithm); MapReduce MR-PAM algorithm (description and analysis). Exercises: MapReduce implementation of one iteration of Lloyd's algorithm; proof that log_2 N independent runs of k-means++ provide the same quality guarantees as k-means++ but with high probability.
  • 17/04/2018: Cluster evaluation: tendency (Hopkins statistic); unsupervised (cohesion, separation, silhouette coefficien); supervised (entropy of a cluster, entropy of a class). Case study: clustering of deputies based on their key votes. Presentation of dott. Francesco Silvestri.
  • 18/04/2018: Hierarchical clustering: introduction; general agglomerative strategy; merging critera (single linkage, complete linkage, average linkage); stopping conditions; time complexity; discussion of accuracy. Examples of theory questions. Presentation of Homework 3.
Material:
  1. Slides: Clustering (Part 1) (final with errata)
  2. Slides: Clustering (Part 2) (final with errata)
  3. Slides: Clustering (Part 3) (final with errata)
  4. Set of Exercises with solutions
  5. P.N.Tan, M.Steinbach, V.Kumar. Introduction to Data Mining. Addison Wesley, 2006. Chapter 8
  6. Article: D. Arthur, S. Vassilvitskii. k-means++: the advantages of careful seeding. Proc. of ACM-SIAM SODA 2007: 1027-1035.
  7. Article: A. Ene, S. Im, B. Moseley. Fast clustering using MapReduce. Proc. of KDD 2011: 681-689.

Association Analysis

  • 24/04/2018: Market basket analysis: frequent itemsets and association rules. Mining of frequent itemsets and of interesting association rules: definition of the computational problem; max number of patterns; anti-monotonicity of support; lattice of frequent itemsets; algorithm A-Priori (pseudocode).
  • 02/05/2018: Analysis of A-Priori: correctness, polynomiality with respect to input and output sizes. Exercise on an alternative candidate generation. Association rules: anti-monotonicity of confidence; algorithm for generating all association rules with respect to give support and confidence thresholds (analysis of correctness and polynomiality omitted).
  • 08/05/2018: Mining frequent itemsets from large datasets: optimization by Park, Chen and Yu to filter out infrequent candidate itemsets of size 2; MapReduce implementation of the SON algorithm; sampling approach by Riondato and Upfal (definition of epsilon-approximate frequent itesemts, sampling algorithm, bound on the sample size).
  • 09/05/2018: Description of Homework 4. Notions of closed, frequent closed, and maximal itemsets.
  • 15/05/2018: Representation of frequent itemsets and their supports, through maximal itemsets and their supports (lossy) and frequent closed itemsets and their supports. Closure of an itemset and its properties. example where maximal/closed frequent itemsets are exponential in the input size. Top-K frequent (closed) itemsets: definition.
  • 16/05/2018: Top-K frequent (closed) itemsets: control of K on the output size. Exercises: example showing that the Top-K frequent itemsets can be exponential in the output size even for small K; exercise from the example of written test.
  • 22/05/2018: Exercises: exercise on the intersection of closed itemsets; exercise on the (anti)monotonicity property of a new metric for itemsets.

    N.B.: Slides 19-23 of Part 2 have not been covered in class and are not required for the exam.

Material:
  1. Slides: Association Analysis (Part 1) (final with errata)
  2. Slides: Association Analysis (Part 2) (final with errata)
  3. Set of Exercises with solutions
  4. Article: M. Riondato, E. Upfal: Efficient Discovery of Association Rules and Frequent Itemsets through Sampling with Tight Performance Guarantees. ACM Trans. on Knowledge Discovery from Data, 2014.

Graph Analytics

  • 22/05/2018: Data network and graphs. Objectives of graph analytics, examples and applications. Review of basic graph primitives (BFS, DFS, SSSP, MST, Connected Components) and their sequential complexities.
  • 23/05/2018: MapReduce algorithm for MST and Connected Components.
  • 29/05/2018: Average Degrees of Separation (ADS) and Diameter: straighforward algorithms and their impracticality for very large graphs; Neighborhood function; exact computation of ADS and Diameter from the Neighborhood function; HyperANF (main ideas and definition of the sketched based on hyperloglog counters used to represent sets of nodes); approximation of the diameter using 1 pivot (1 execution of BFS/SSSP) or 4 pivots (using 4 executions of BFS/SSSP, i.e., 4-SWEEP algorithm).
  • 30/05/2018: Exercises: estimation of a graph's diameter with l pivots; generalization of the 2-round MapReduce algorithm for MSF with local space M=n^{1+epsilon}. Review: notion of local and aggregate space in MapReduce; 2nd exercise from the set of exercies on MapReduce; 5th exercise from the set of exercies on Clustering; theory questions from the example of written test

    N.B.: Slides 57-68 have not been covered in class and are not required for the exam.

Material:
  1. Slides: Graph Analytics (final with errata)
  2. Set of Exercises with solutions
  3. Article: P. Boldi, M. Rosa, S. Vigna. HyperANF: Approximating the Neighbourhood Function of Very Large Graphs on a Budget. Proc. WWW, pages 625-634, 2011.
  4. Article: P. Crescenzi, R. Grossi, M. Habib, L. Lanzi, A. Marino. On computing the diameter of real-world undirected graphs. Theor. Comput. Sci. 514:84-95, 2013.

Last update: 31/05/2018 Back to home page