Methods

Our current projects:

Cross species analysis and methodology transfer for genomic prediction

Project leads: Saranya Arirangan
Major collaborators: Matthew Tegtmeyer (Purdue, Dept of Biological Sciences), Luiz Brito (Purdue, Dept of Animal Sciences), Mitchell Tuinstra (Purdue, Dept of Agronomy)
Objective: This is a broad study of how genetic variants, and epigenetic modifications contribute to withstanding and recovering from disease and extracting the mechanism that supports resilience. Another perspective of this research may extend to investigate how epigenetic modifications contribute to the ability to recover from trauma, stress and bad social behaviors. A major part of this research is to uncover similar patterns in other species to promote resilience.

Integration of RNA-Seq and morphology data using a shared autoencoder model

Project leads: Zahra Paylakhi
Major collaborators: Matt Tegtmeyer (Purdue, Dept of Biological Sciences)
Objective: To develop a deep learning framework that integrates gene expression and cellular morphology profiles through a shared latent space, enabling accurate cross-modal predictions and identification of biologically relevant features.
Summary: This project uses RNA-seq gene expression profiles and high-dimensional Cell Painting morphology data to train a shared autoencoder model for cross-modal prediction. The shared latent space enables translation between different types of datasets, allowing for accurate prediction of morphology features from gene expression data. Feature importance analyses and pathway enrichment are used to validate model interpretability and uncover mechanistic links between genetic and morphological variation.

Protein language models for predicting the functional impact of synonymous mutations

Project leads: Zahra Paylakhi
Major collaborators: Michel Nivard (The University of Bristol)
Objective: To explore the application of protein language models for evaluating the structural and functional consequences of synonymous mutations, with a focus on model-driven feature extraction and prediction.
Summary: This project leverages large-scale protein language models (e.g., ESM-2, AlphaFold2) to study the effects of synonymous mutations, which can influence mRNA stability, translation efficiency, and co-translational protein folding without altering amino acid sequences. By generating structural and sequence embeddings for wild-type and mutant proteins, we compute mutation impact scores and identify potential deleterious variants. The approach aims to advance understanding of silent mutation biology and inform variant interpretation in precision medicine.

What machine learning teaches us about depression prediction across the life course: An exploratory comparison of predictive models

Project leads: Rafael Geurgas
Major collaborators: Katherine N. Thompson (Purdue, Dept of Sociology), Evalina T. Akimova (Purdue, Dept of Sociology), Felix Tropf (University College London), Saul Newman (University College London)
Objective: Leveraging machine learning models to predict the risk of depression in adolescence and adulthood by analyzing early-life environmental factors and genetic predispositions, with a focus on improving prediction accuracy over traditional methods.
Summary: We applied machine learning models to predict depression risk across two stages: adolescence and adulthood, using early-life environmental factors, genetic data, and self-reported symptoms. By analyzing a nationally representative longitudinal dataset, we compared the performance of four machine learning algorithms (Random Forest, XGBoost, Support Vector Machine, and Neural Networks) to a traditional statistical method (Logistic Regression). The study demonstrated that XGBoost consistently outperformed others in predicting depressive symptoms and clinical diagnoses, but the gains were minimal compared to Logistic Regression. Early-life data showed substantial predictive value for both adolescent and adult depressive symptoms, as well as for clinical diagnoses, emphasizing adolescence as a critical period for identifying individuals at long-term mental health risk. Notably, polygenic scores added minimal predictive power when combined with environmental data. Feature importance (most important questions) analyses revealed that questions related to self-perception and physical health were the strongest predictors of depressive symptoms. At the same time, trauma and life-changing events emerged as more important for predicting clinical depression.

AI- Driven framework for detecting nonlinear genetic interactions in complex traits

Project leads: Saranya Arirangan
Major collaborators: Boran Gao (Purdue, Dept of Biological Sciences; Dept of Statistics), Geyu Zhou (Purdue, Dept of Statistics, Dept of Biological Sciences), Matthew Tegtmeyer (Purdue, Dept of Biological Sciences), Luiz Brito (Purdue, Dept of Animal Sciences), Mitchell Tuinstra (Purdue, Dept of Agronomy)
Objective: To develop a scalable and interpretable AI driven framework for identifying disease-associated loci, including SNP-SNP interactions, in complex traits, and to uncover biologically meaningful multi-locus patterns through analysis of feature importance.
Summary: GWAS have identified many trait-associated loci, but these explain only a small portion of total heritability, often referred to as missing heritability. Epistasis, or non-linear interactions between multiple genetic variants, may contribute to this missing heritability. The true impact of epistatic effects on human traits is still poorly understood, with limited confirmed examples in large-scale studies. We propose a deep learning-based framework for epistasis discovery that leverages Variational Autoencoders (VAE) for unsupervised dimensionality reduction of high-dimensional genotype data, followed by Random Forests (RF) for non-linear feature selection and interaction modeling. The VAE learns compressed, latent representations that capture underlying structure in the genotype space, while the RF model uses these features to identify SNPs both individually and in interaction that influence complex phenotypes. Together, this hybrid approach aims to overcome the limitations of marginal testing by accounting for multi-locus dependencies and enhancing interpretability in causal SNP prioritization.

Beyond risk factors: Using large language models for narrative analysis of undetermined intent cases in NVDRS

Project leads: Rafael Geurgas
Major collaborators: Alina Arseniev-Koehler (Purdue, Dept of Sociology)
Summary: Undetermined deaths are difficult to classify as suicide, accident, or other causes, which limits the accuracy of public health data. This project uses the National Violent Death Reporting System (NVDRS) to analyze both structured variables and narrative reports from death investigations. By applying Large Language Models (LLMs) to these narratives, we aim to uncover hidden patterns that shed light on the social and contextual factors surrounding these deaths. The findings will help communities better understand how such cases are classified, improve suicide surveillance, and reveal how social and institutional processes influence the way deaths are recorded.

Additional research areas

Colorful pixels in the shape of two human silhouettes facing each other

Data

Two hands holding connected plastic molecules