Event 

Title:
Bayesian identifications of data intrinsic dimensions
When:
02.04.2019 - 02.04.2019
Where:
USI Lugano Campus - Lugano
Category:
ICS Events

Showcase

No Images!

Event Videos

Item not found.
Check All Videos

Description

Bayesian identifications of data intrinsic dimensions

Tuesday, April 2, 2019, 14:30, aula 402 (main building)

Authors: Michele Allegra, Francesco Denti, Elena Facco, Alessandro Laio, Michele Guindani, Antonietta Mira
Presenter: Antonietta Mira, Università della Svizzera italiana and University of Insubria
Abstract:
Even if defined on a large dimensional  space, data points usually lie onto one or more hypersurfaces, or manifolds, with much smaller intrinsic dimensions (ID). The recent TWO-NN method (Facco et al., 2017, Scientific Report), allows estimating the ID when all points lie onto a single sub-manifold.
TWO-NN only assumes that the density of points is approximately constant in a small neighborhood around each point. Under this hypothesis, the ratio of the distances of a point from its first and second neighbor follows a Pareto distribution that depends parametrically only on the ID. We first extend the TWO-NN model to the case in which the data lie onto several sub-manifolds each one with its own different ID. While the idea behind the model extension is simple (the Pareto is replaced by a finite mixture of $K$ Pareto distributions), a non-trivial Bayesian algorithm is required for estimating the model and assigning each point to its own manifold. Applying this method, which we dub Hidalgo (Heterogeneous Intrinsic Dimension ALGOrithm), we uncover a surprising ID variability in several real-world datasets. In fact, we are able to show how this methodology helps to discover latent clusters hidden in data of different nature, ranging from protein folding trajectory to financial indexes computed on balance sheets. Hidalgo obtains remarkable results, but its main limitation consists in fixing a priori the number of sub-manifolds, i.e. of components in the mixture. To overcome this issue we employ a flexible Bayesian Nonparametric approach and model the data as an infinite mixture of Pareto distributions using a Dirichlet Process Mixture Model. This framework allows evaluating the uncertainty relative to the number of mixture components and to the assignments of data points to sub-manifolds. Since the posterior distribution has no closed form,  to perform inference we employ the Slice Sampler algorithm. From preliminary analyses on simulated and well-known datasets (e.g. Fisher's Iris dataset), the full Bayesian nonparametric version of the TWO-NN provides promising results allowing to recover a rich data structure starting from the intrinsic dimension, a pure geometric data feature, and only requiring the definition of a distance measure.

Venue

Venue:
USI Lugano Campus
Street:
Via G. Buffi 13
ZIP:
6900
City:
Lugano

Venue Description

Sorry, no description available

cardio-centro-ticnic-logo

logo cscs

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Read more