Development site for the EIT FOAI CDT
While language models have made significant strides in textual reasoning, complex visual reasoning, analogous to question-answering but in the pixel space, remains a frontier. Most models can classify or segment, but struggle with compositional or counterfactual visual tasks.
This project will explore the use of conditional diffusion models as a backbone for visual reasoning. The goal is to move beyond static tasks and develop models that can answer “what if” questions about visual scenes. For example, given an image, the model could be prompted to perform tasks like: “show me this scene if the car were red,” “realistically remove this object,” or “predict the shadow’s position if the light source moved.” This involves designing novel visual reasoning benchmarks and developing architectures that can interpret multimodal prompts (e.g., text + masks) to perform complex, compositional image edits that demonstrate a form of visual understanding.
Strong proficiency in Python and PyTorch/JAX, good understanding of deep learning fundamentals, experience with generative models (diffusion models are a plus).
Advanced generative modeling, multimodal fusion architectures, novel benchmark design for AI, conditional image generation, and evaluating abstract reasoning capabilities in vision models.