Variational Autoencoders (VAEs) provide a flexible and scalable framework for non-linear dimensionality reduction. However, in application domains such as genomics where data sets are typically tabular and high-dimensional, a black-box approach to dimensionality reduction does not provide sufficient insights. Common data analysis workflows additionally use clustering techniques to identify groups of similar features. This usually leads to a two-stage process, however, it would be desirable to construct a joint modelling framework for simultaneous dimensionality reduction and clustering of features. In this paper, we propose to achieve this through the BasisVAE: a combination of the VAE and a probabilistic clustering prior, which lets us learn a one-hot basis function representation as part of the decoder network. Furthermore, for scenarios where not all features are aligned, we develop an extension to handle translation-invariant basis functions. We show how a collapsed variational inference scheme leads to scalable and efficient inference for BasisVAE, demonstrated on various toy examples as well as on single-cell gene expression data..
Variational Autoencoders (VAEs) have become a popular approach for dimensionality reduction. However, despite their ability to identify latent low-dimensional structures embedded within high-dimensional data, these latent representations are typically hard to interpret on their own. Due to the black-box nature of VAEs, their utility for healthcare and genomics applications has been limited. In this paper, we focus on characterising the sources of variation in Conditional VAEs. Our goal is to provide a feature-level variance decomposition, i.e. to decompose variation in the data by separating out the marginal additive effects of latent variables z and fixed inputs c from their non-linear interactions. We propose to achieve this through what we call Neural Decomposition – an adaptation of the well-known concept of functional ANOVA variance decomposition from classical statistics to deep learning models. We show how identifiability can be achieved by training models subject to constraints on the marginal properties of the decoder networks. We demonstrate the utility of our Neural Decomposition on a series of synthetic examples as well as high-dimensional genomics data.
We used single cell sequencing to map the repertoire of cell types in the fallopian tube epithelium of ovarian cancer patients. We discovered six cell subtypes, with one mesenchymal-high HGSOC subtype robustly correlated with poor prognosis.
As part of the CONSORT-AI and SPIRIT-AI Steering Group, we are developing reporting guidelines for the clinical trial evaluation of AI-driven medical interventations.
We proposed a structured kernel decomposition in a hybrid Gaussian Process model which we call the Covariate Gaussian Process Latent Variable Model (c-GPLVM). We show how the c-GPLVM can extract low-dimensional structures from highdimensional data sets whilst allowing a breakdown of feature-level variability that is not present in other commonly used dimensionality reduction approaches.
Computational techniques have arisen from single-cell ‘omics and cancer modelling where pseudotime can be used to learn about cellular differentiation or tumour progression. However, methods to date typically implicitly assume homogeneous genetic, phenotypic or environmental backgrounds, which becomes limiting as data sets grow in size and complexity. We describe a novel statistical framework that learns how pseudotime trajectories can be modulated through covariates that encode such factors.
We present a probabilistic treatment of Boolean tensor decomposition which allows us to approximate data consisting of multi-way binary relationships as products of interpretable low-rank binary factors, following the rules of Boolean algebra.