Publications

BasisVAE: Translation-invariant feature-level clustering with Variational Autoencoders

Variational Autoencoders (VAEs) provide a flexible and scalable framework for non-linear dimensionality reduction. However, in application domains such as genomics where data sets are typically tabular and high-dimensional, a black-box approach to dimensionality reduction does not provide sufficient insights. Common data analysis workflows additionally use clustering techniques to identify groups of similar features. This usually leads to a two-stage process, however, it would be desirable to construct a joint modelling framework for simultaneous dimensionality reduction and clustering of features. In this paper, we propose to achieve this through the BasisVAE: a combination of the VAE and a probabilistic clustering prior, which lets us learn a one-hot basis function representation as part of the decoder network. Furthermore, for scenarios where not all features are aligned, we develop an extension to handle translation-invariant basis functions. We show how a collapsed variational inference scheme leads to scalable and efficient inference for BasisVAE, demonstrated on various toy examples as well as on single-cell gene expression data..

Neural Decomposition: Functional ANOVA with Variational Autoencoders

Variational Autoencoders (VAEs) have become a popular approach for dimensionality reduction. However, despite their ability to identify latent low-dimensional structures embedded within high-dimensional data, these latent representations are typically hard to interpret on their own. Due to the black-box nature of VAEs, their utility for healthcare and genomics applications has been limited. In this paper, we focus on characterising the sources of variation in Conditional VAEs. Our goal is to provide a feature-level variance decomposition, i.e. to decompose variation in the data by separating out the marginal additive effects of latent variables z and fixed inputs c from their non-linear interactions. We propose to achieve this through what we call Neural Decomposition – an adaptation of the well-known concept of functional ANOVA variance decomposition from classical statistics to deep learning models. We show how identifiability can be achieved by training models subject to constraints on the marginal properties of the decoder networks. We demonstrate the utility of our Neural Decomposition on a series of synthetic examples as well as high-dimensional genomics data.

The repertoire of serous ovarian cancer non-genetic heterogeneity revealed by single-cell sequencing of normal fallopian tube epithelial cells

We used single cell sequencing to map the repertoire of cell types in the fallopian tube epithelium of ovarian cancer patients. We discovered six cell subtypes, with one mesenchymal-high HGSOC subtype robustly correlated with poor prognosis.

Reporting guidelines for clinical trials evaluating artificial intelligence interventions are needed

As part of the CONSORT-AI and SPIRIT-AI Steering Group, we are developing reporting guidelines for the clinical trial evaluation of AI-driven medical interventations.

Decomposing feature-level variation with Covariate Gaussian Process Latent Variable Models

We proposed a structured kernel decomposition in a hybrid Gaussian Process model which we call the Covariate Gaussian Process Latent Variable Model (c-GPLVM). We show how the c-GPLVM can extract low-dimensional structures from highdimensional data sets whilst allowing a breakdown of feature-level variability that is not present in other commonly used dimensionality reduction approaches.

Bayesian statistical learning for big data biology

Uncovering pseudotemporal trajectories with covariates from single cell and bulk expression data

Computational techniques have arisen from single-cell ‘omics and cancer modelling where pseudotime can be used to learn about cellular differentiation or tumour progression. However, methods to date typically implicitly assume homogeneous genetic, phenotypic or environmental backgrounds, which becomes limiting as data sets grow in size and complexity. We describe a novel statistical framework that learns how pseudotime trajectories can be modulated through covariates that encode such factors.

Probabilistic Boolean Tensor Decomposition

We present a probabilistic treatment of Boolean tensor decomposition which allows us to approximate data consisting of multi-way binary relationships as products of interpretable low-rank binary factors, following the rules of Boolean algebra.

MixDir: Scalable Bayesian Clustering for High-Dimensional Categorical Data

The Hamming ball sampler

Testing and learning on distributions with symmetric noise invariance

Bayesian Boolean matrix factorisation

Statistical inference in hidden markov models using k-segment constraints

Order Under Uncertainty

Hamming ball auxiliary sampling for factorial hidden Markov models

A sequential algorithm for fast fitting of Dirichlet process mixture models

A decision-theoretic approach for segmental classification

Bayesian non-parametric hidden Markov models with applications in genomics

Comparing CNV detection methods for SNP arrays

CNV discovery using SNP genotyping arrays

Quantitative image analysis of chromosome dynamics in early Drosophila embryos

Bayesian Hidden Markov Models for Detecting Regions of Deletion and Duplication in the Human Genome using Illumina BeadChip Arrays