## Foundations of Data Science - Cornell Computer Science

A, onto v is |ai Â·v| and the sum of length squared of the projections is |Av|2. ...... format. To access the file afterwards you may need to add the file extension .jpg.
Foundations of Data Science

1

John Hopcroft Ravindran Kannan

Version 11/4/2014

These notes are a first draft of a book being written by Hopcroft and Kannan and in many places are incomplete. However, the notes are in good enough shape to prepare lectures for a modern theoretical course in computer science. Please do not put solutions to exercises online as it is important for students to work out solutions for themselves rather than copy them from the internet. Thanks JEH

1

1

Contents 1 Introduction

7

2 High-Dimensional Space 2.1 Properties of High-Dimensional Space . . . . . . . . . . . . . . . . . 2.2 The Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . 2.3 The High-Dimensional Sphere . . . . . . . . . . . . . . . . . . . . . 2.3.1 The Sphere and the Cube in High Dimensions . . . . . . . . 2.3.2 Volume and Surface Area of the Unit Sphere . . . . . . . . . 2.3.3 The Volume is Near the Equator . . . . . . . . . . . . . . . 2.3.4 The Volume is in a Narrow Annulus . . . . . . . . . . . . . . 2.3.5 The Surface Area is Near the Equator . . . . . . . . . . . . 2.4 Volumes of Other Solids . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Generating Points Uniformly at Random on the Surface of a Sphere 2.6 Gaussians in High Dimension . . . . . . . . . . . . . . . . . . . . . 2.7 Bounds on Tail Probability . . . . . . . . . . . . . . . . . . . . . . . 2.8 Applications of the tail bound . . . . . . . . . . . . . . . . . . . . . 2.9 Random Projection and Johnson-Lindenstrauss Theorem . . . . . . 2.10 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Best-Fit Subspaces and Singular Value Decomposition (SVD) 3.1 Singular Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Singular Value Decomposition (SVD) . . . . . . . . . . . . . . . . . 3.3 Best Rank k Approximations . . . . . . . . . . . . . . . . . . . . . 3.4 Left Singular Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Power Method for Computing the Singular Value Decomposition . . 3.6 Applications of Singular Value Decomposition . . . . . . . . . . . . 3.6.1 Principal Component Analysis . . . . . . . . . . . . . . . . . 3.6.2 Clustering a Mixture of Spherical Gaussians . . . . . . . . . 3.6.3 Spectral Decomposition . . . . . . . . . . . . . . . . . . . . 3.6.4 Singular Vectors and Ranking Documents . . . . . . . . . . 3.6.5 An Application of SVD to a Discrete Optimization Problem 3.7 Singular Vectors and Eigenvectors . . . . . . . . . . . . . . . . . . . 3.8 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Random Graphs 4.1 The G(n, p) Model . . . . . . . . . . . . . 4.1.1 Degree Distribution . . . . . . . . . 4.1.2 Existence of Triangles in G(n, d/n) 4.2 Phase Transitions . . . . . . . . . . . . . . 4.3 The Giant Component . . . . . . . . . . . 2

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . .

. . . . . . . . . . . . . . . .

10 12 13 15 16 17 20 23 24 26 27 27 33 35 38 41 42

. . . . . . . . . . . . . .

52 53 56 58 60 62 64 64 65 71 71 72 75 76 77

. . . . .

85 85 86 91 93 101