LD Block-Based Genotype Embeddings for Asthma Biology and UK Biobank Extension (Ongoing)

Built a two-stage LD-block representation learning framework for asthma genotype data, combining block-wise VAEs with transformer-based contextualization to learn biologically meaningful subject embeddings.

Genomics Representation Learning VAE Transformer LD Blocks Asthma UK Biobank PyTorch
LD block-based genotype embedding framework showing block VAE encoding, transformer contextualization, emergent genomic structure, and biological interpretation in asthma

Two-stage LD block embedding framework showing independent block encoding, transformer contextualization, emergent genomic structure, and biological interpretation including HLA class II dominance and PDE4D emergence.

Key insights

  • Recovered known structure: HLA class II emerged as the dominant unsupervised axis, showing that the model recovers meaningful asthma biology without using phenotype labels.
  • Found secondary biology: After masking HLA effects, PDE4D appeared as an independent signal, suggesting the framework can reveal additional disease-relevant pathways beyond the strongest locus.
  • Transformer adds value: Cross-block contextualization preserved local LD structure while reorganizing subjects into clearer biological patterns than block-level embeddings alone.
  • Scaling next: The same framework is now being extended to UK Biobank cohort.

Single-Cell Immune State Modeling and Multi-Modal Integration

Built single-cell analysis workflows spanning RNA-only PBMC benchmarking and ongoing paired RNA+ATAC integration. Combined clustering, cell-state scoring, cell–cell communication, chromatin accessibility, and transcription-factor inference to characterize immune-cell states across stimulated and control conditions.

Single Cell scRNA-seq scATAC-seq Multiome Scanpy Seurat WNN Cell-State Scoring Cell-Cell Communication TF Inference
UMAP projection of PBMC single-cell RNA-seq data showing annotated immune cell populations

UMAP of PBMC single-cell RNA-seq data after clustering and annotation, highlighting major immune-cell populations used for downstream cell-state scoring, differential expression, and multi-modal integration analyses.

Highlights

  • RNA-only benchmarking: Compared multiple cell-state scoring strategies across PBMC3k and interferon-stimulated Kang datasets using clustering, marker genes, and immune program scores.
  • Cross-modal integration: Integrated paired RNA+ATAC data with weighted nearest neighbors (WNN) to evaluate how chromatin accessibility refines immune-cell separation.
  • Regulatory modeling: Inferred transcription-factor activity from RNA and validated it against ATAC motif accessibility, with ongoing work on pseudotime and dynamic regulatory programs.

Proteomics Analysis of Long COVID Heterogeneity

Built a proteomics analysis workflow to study neurocognitive Long COVID using Olink NPX data, combining covariate-adjusted modeling in a pediatric cohort with scalable phenotype–proteomics integration in ~50,000 UK Biobank participants.

Proteomics Olink NPX Long COVID UK Biobank PySpark Spark SQL Logistic Regression

Highlights

  • Subtype-focused analysis: Distinguished neurocognitive and non-neurocognitive Long COVID using protein-level signatures.
  • Replication at scale: Extended the workflow to ~50K UK Biobank participants with distributed Spark-based integration.
  • Biological follow-up: Candidate proteins and pathways point to immune and neuroinflammatory mechanisms underlying Long COVID heterogeneity.

Endophenotype-driven GWAS Framework for Asthma Heterogeneity

Developed a subtype-aware GWAS framework for asthma by combining PCA-derived clinical endophenotypes with categorical association modeling, enabling genetic analysis across biologically coherent patient subgroups rather than treating asthma as a single uniform phenotype.

PCA GWAS ANCOVA Statistical Genetics Endophenotypes Asthma ICS Response R
Endophenotype-driven GWAS framework illustrating PCA-based stratification and subtype-specific association modeling

PCA-based clinical stratification into endophenotypes integrated with subtype-aware GWAS to identify asthma-associated loci across disease severity groups.

Highlights

  • Subtype-aware association testing: Endophenotype-based modeling improved detection of genetic signals compared with standard case-control GWAS.
  • Replication across cohorts: Multiple loci, including DGKI and MIR99AHG, replicated in independent asthma datasets.
  • Clinical relevance: Endophenotypes aligned with severity, IgE, lung function, and differential ICS treatment response.

miRNA Modifiers of Inhaled Corticosteroid Response in Asthma

Identified miR-584-5p as a potential biomarker of inhaled corticosteroid resistance in asthma using interaction modeling across two pediatric cohorts.

miRNA Pharmacogenomics Asthma ICS Response RNA-seq

Machine Learning Analysis of Pediatric COVID-19 Clinical Data

Built a Random Forest model integrating chest X-ray impressions, symptoms, and demographics to predict pediatric COVID-19 infection status.

Random Forest Clinical NLP SHAP Chest X-ray EHR

Structured Modeling of Context-Aware Systems

PhD research on modeling event-driven mobile applications using probabilistic sequence models, neural representation learning, and combinatorial optimization.

CRF Neural Networks Sequence Modeling Optimization PhD Dissertation