What is Topological Data Analysis?
Introduction
Topological Data Science(TDA) has been bursting with new applications in machine learning and data science this year.
The central dogma of TDA is that data (even complex, and high dimensional) has an underlying shape, and understanding this shape helps to reveal
the process used to generate it. For a great survey on TDA for data scientists, check out [1].
One of the ways which TDA helps understand this shape is through persistence homology, which will be explained in this article.
For actual computations, giotto-tda is a great toolkit for calculations and examples.
The software is free, and instructions for downloading it are aviailable on the giotto-tda Github page
along with several Jupyter notebooks highlighting some of the giotto-tda features.
Persistent Homology
The goal of persistent homology is to determine the true topological descriptors of a dataset.
As an intuitive motivation, suppose \(X\) a is a set of randomly generated points in two-dimensions, such as the one below.
Although real world data is much more complicated than this, it is helpful to understand how TDA applies to datasets which we already understand. We know that the data shown above, for example, is a single connected component with a single hole in the middle. Hence, we should be able to confirm this using persistent homology. But first, we need a few preliminary concepts from Algebraic Topology.
Simplicial Complexes
Simplicial complexes are the main object of study in TDA. They are a topological space which can be thought of as higher dimensional graphs. They are composed by “gluing together” multiple simplices.
- A \(k\)-simplex \(\sigma\) is a generalization of a triangle in \(k\) dimensions.
More speficically, given a set \(X=\{x_1,\dots,x_d\}\) of affinely independent points in \(\mathbb{R}^d\), a \(k\)-dimensional simplex \(\sigma=[x_1,\dots,x_k]\) is the convex hull of \(X\).
(a) 0-simplex is a vertex, (b) 1-simplex is a line, (c) 2-simplex is a triangle, (d) 3-simplex is a tetrahedron |
Simplicies make up simplicial complexes, that is, in a very informal way, we can say that
- A simplicial complex is a set of simplices which are “glued together”.
I am avoiding mathematical rigor here by saying “glued together”, but I think it is more useful to understand what the general vibe is first. The picture below is an example of a simplicial complex with 18 \(0\)-simplexes, 23 \(1\)-simplexes, 8 \(2\)-simplexes, and a single \(3\)-simplex
Notice the way that simplices are glued together. They must be glued in a certain way. To show what I mean by this without getting too technical, consider the non-example shown below.
The motivation for defining simplicial complexes is that we can think of each one of our data points as a simplex, and thus by
creating a simplicial complex from these simplices, we are able to extract meaningful topological descriptors which can reveal
the shape of our dataset.
But how do know which points in our data represent which simplexes? And
how do we actually obtain a simplicial complex?
It’s crucial to keep these questions in mind as we go along, so that we don’t get lost in the abstractions, and to make use of the definitons.
The Čech Complex
Given a dataset \(X\) (a point cloud), we want to create a simplicial complex.
To do this, the idea is to place a closed ball \(\bar{B}_\epsilon(X)\) of radius \(\epsilon\) around each point, and create an edge between any two points if their closed \(\epsilon\) balls intersect.
- Fix \(\epsilon>0\), and assume \(X=\{x_1,\dots,x_d\}\). The Čech Complex \(Č_\epsilon(X)\) is a simplicial complex whose simplicies are defined as follows:
- For each subset of \(S\subseteq X\), form a closed \(\frac{\epsilon}{2}\) ball around each point. If each closed ball has a nonempty intersection, then include \(S\) as a simplex. (See the image below)
Above we see the simplicial complex has 1 connected component, and 1 hole.
But what if we change the value of \(\epsilon\)? For example, if \(\epsilon\) is equal to the largest distance between any two points,
then we obtain a single simplex.
The natural question to ask at this point, is what is the best choice for the value of \(\epsilon\)? But we are not interested in answering this. Rather, we analyze the topological features which persist as we let \(\epsilon\) vary (hence, the name persistence homology). The features which persist as \(\epsilon\) varies greatly are considered to be the “true” features.
Persistence Diagrams
The way that we record the topological features (betti numbers) is by creating persistence diagrams. These diagrams record the birth time, and death time of a feature, with the birth time on the \(x\) axis, and the death time on the \(y\) axis. Traits which are closer to the diagonal usually represent noise (but this is not always the case, especially when applied to biological data), and traits further from the diagonal represent features that persist.
In the next article, we’ll take a look at some concrete examples and applications of Persistence. But that’s all for this one folks. Stay Vibing or whatever.
References
-
Chazal, Frédéric, and Bertrand Michel. “An Introduction to Topological Data Analysis: Fundamental and Practical Aspects for Data Scientists.” ArXiv.org, 11 Oct. 2017, arxiv.org/abs/1710.04019.
-
https://en.wikipedia.org/wiki/Simplicial_complex
-
https://github.com/giotto-ai/giotto-tda