Genotype Embeddings for Pharmacogenomics (Ongoing)
Developing a representation learning framework for high-dimensional genotype data using LD block–based variational autoencoders. Designed to overcome n << d limitations (~1K samples, 339K variants), the approach learns biologically meaningful embeddings for downstream tasks such as patient stratification and drug response modeling.
This framework improves structure discovery compared to traditional LD-pruned PCA (PC1 variance: 13% vs <1%) and is being extended to large-scale cohorts (UK Biobank) to enable scalable genotype representation learning beyond single-variant and PRS-based methods.
Key insights
- LD-aware modeling: Block-based embeddings capture local genetic structure beyond LD-pruned SNP approaches.
- High-dimensional learning: Overcomes n << d constraints, enabling modeling of 300K+ variants with limited samples.
- Improved structure discovery: Embeddings reveal meaningful variation (PC1 ~13%) compared to traditional PCA (<1%).
- Scalable framework: Designed for extension to UK Biobank-scale data for pharmacogenomic applications.
Proteomics Analysis of Pediatric Long COVID (Ongoing)
Analyzed Olink NPX proteomics data to identify neurocognitive subtype–specific protein signatures in pediatric Long COVID. Applied covariate-adjusted logistic regression across 87 plasma proteins, accounting for cohort heterogeneity and clinical confounding.
Extended the analysis to the UK Biobank adult cohort using PySpark and SQL-based pipelines, enabling scalable integration of large proteomics and phenotype datasets for subtype discovery.
Key insights
- Subtype discovery: Identified protein signatures associated with neurocognitive Long COVID presentations.
- Robust modeling: Covariate-adjusted logistic regression accounted for cohort and clinical heterogeneity.
- Scalable analysis: Leveraged PySpark and SQL pipelines for large-scale UK Biobank data processing.
- Translational relevance: Highlights proteomic markers linked to post-viral neurological outcomes.
Endophenotype-driven GWAS Framework for Asthma Heterogeneity
Developed a machine learning–driven GWAS framework to model asthma heterogeneity using PCA-derived clinical endophenotypes. Patients were stratified into five ordinal subgroups (Q1–Q5) spanning mild to severe, atopy-rich disease. These subtypes were integrated into a categorical ANCOVA model, enabling simultaneous association testing across biologically coherent patient groups and improving detection of subtype-specific genetic signals.
PCA-based clinical stratification into endophenotypes (Q1–Q5) integrated with categorical GWAS to identify subtype-specific genetic associations.
Key insights
- Subtype-aware GWAS: ANCOVA-based modeling across all endophenotypes increased power, identifying 244 significant SNPs versus limited findings from standard approaches.
- Robust replication: Six loci (e.g., DGKI, MIR99AHG) replicated across independent cohorts, enriched in severe subgroups (Q4–Q5).
- Treatment stratification: Endophenotypes predicted ICS response, with the most severe group showing greatest lung function improvement.
- Biological alignment: Genetic signals mapped to clinical gradients (IgE, lung function), improving interpretability for precision medicine.
Representative publications: PCA based endophenotype definiton , ANOVA for endophenotype specific GWAS , Asthma pharmacogenetics through subtype specific associations
miRNA Modifiers of Inhaled Corticosteroid Response in Asthma
Investigated molecular drivers of variable response to inhaled corticosteroids (ICS) in asthma using circulating microRNA (miRNA) profiles. Applied interaction modeling (Exacerbation ~ ICS × miRNA) across two pediatric cohorts (CAMP discovery and GACRS replication) to identify biomarkers associated with treatment resistance.
Identified miR-584-5p as a novel modifier of ICS response, where higher expression was associated with increased exacerbation risk specifically in treated patients, highlighting its potential as a biomarker for steroid resistance.
Pathway enrichment highlighting TGF-β, NF-κB, and immune signaling pathways linked to miR-584-5p–mediated corticosteroid response.
Key insights
- Interaction-based discovery: ICS × miRNA modeling identified modifiers missed by standard differential expression approaches.
- Novel biomarker: miR-584-5p was associated with increased exacerbation risk specifically in ICS-treated patients.
- Cross-cohort validation: Findings were consistent across independent pediatric cohorts (CAMP, GACRS).
- Biological relevance: Enriched pathways (TGF-β, NF-κB, T cell signaling) link miRNA regulation to airway inflammation and remodeling.
Representative publication: Micro-RNA-584-5p as a key modulator of ICS resistance (Full paper in progress)
Machine Learning Analysis of Pediatric COVID-19 Clinical Data
Developed a Random Forest Classifier to predict COVID-19 infection status in children using a dataset of 2,572 chest X-ray impressions and electronic medical records. By integrating radiological findings, clinical symptoms, and demographic variables, the model achieved an F1-score of 0.79 and an AUC of 0.85.
SHAP summary highlighting radiological, symptom, and demographic features that contributed most to pediatric COVID-19 prediction.
Key insights
- Feature engineering & selection: NLP and NegEx were used to identify 16 radiographic findings, including pneumonia, atelectasis, and small airways disease.
- Model interpretability: SHAP showed that radiological and constitutional features were more predictive than prior medical history.
- Demographic insights: Age, sex, and ethnicity, particularly Hispanic males under age 8, contributed meaningfully to model performance.
- Incremental modeling: Compared five Random Forest models to assess feature contributions, showing radiological features as primary predictors, with symptoms and demographics improving performance.
Representative publication: Case control prediction using clinical data
Context-aware testing of mobile applications
Developed a framework for testing context-aware mobile applications by modeling event-driven behavior using probabilistic and neural approaches. Combined conditional random fields (CRFs), neural network–based component discovery, and combinatorial optimization to capture dependencies between user actions, system states, and environmental context, enabling systematic exploration of large, constrained event spaces in Android applications.
This work established a foundation for modeling structured, dependent data and scalable search strategies—principles that extend to my current work in genomics and clinical machine learning.
Key insights
- Structured sequence modeling: CRFs captured temporal dependencies in event-driven data.
- Neural component discovery: Learned representations identified latent system components.
- Scalable test generation: Combinatorial methods efficiently explored large event spaces.
- System-level impact: Built CATDroid for automated testing of sensor-driven Android apps.