Laurens van der Maaten: Constructing Maps to Visualize Big Data
Abstract:
Visualization techniques are essential tools for every data scientist. Unfortunately, the majority of visualization techniques can only be used to inspect a limited number of variables of interest simultaneously. As a result, these techniques are not suitable for big data that is very high-dimensional.
An effective way to visualize high-dimensional data is to represent each data object by a two-dimensional point in such a way that similar objects are represented by nearby points, and that dissimilar objects are represented by distant points. The resulting two-dimensional points can be visualized in a scatter plot. This leads to a map of the data that reveals the underlying structure of the objects, such as the presence of clusters.
The talk gives an overview of techniques that can be used to construct such maps. In addition, we present a new technique to construct such maps, called t-Distributed Stochastic Neighbor Embedding (t-SNE). We demonstrate the value of t-SNE in domains such as computer vision and bioinformatics, and we show how to scale up t-SNE to Big Data sets with millions of objects.
|