Dimension reduction in RNA-Seq based model

While designing a model to describe a patient's immuno-oncology (IO) status, Ghost encountered a significant dimension reduction problem. The model aims to integrate well-known immune-related factors (listed in the figure below).

The Dimension Reduction Challenge in IO Modeling

Most of these factors can be evaluated using RNA sequencing (RNA-Seq) or whole-exome sequencing (WES)—with the exception of tumor-infiltrating lymphocyte (TIL) localization and possibly PD-L1 staining for CD8+ activation. From a modeling perspective, this process inherently involves dimension reduction, as raw genomic data (e.g., expression levels of >20,000 genes or hundreds of mutations identified via RNA-Seq/WES) must be condensed into a set of biologically meaningful IO-related features, such as:

Immunogenicity
TIL composition
T-cell receptor (TCR) status

These features offer critical biological insights into the patient's immune landscape.

However, epitope presentation and leukocyte recruitment remain underrepresented in standard feature extraction. How can we derive features that effectively capture these processes? Below, we compare common feature extraction methods and their limitations.

Common Feature Extraction Methods and Their Drawbacks

1. Manual Gene Selection

A straightforward approach involves manually curating genes known to influence specific biological processes.

Drawback: This method is inherently arbitrary and often includes irrelevant genes, diluting well-defined features and reducing model robustness.

2. Dimensionality Reduction Techniques (e.g., NMF, PCA, SVD)

Another approach is applying well-established dimension reduction methods, such as:

Non-negative matrix factorization (NMF)
Principal component analysis (PCA)
Singular value decomposition (SVD)

Drawback: Although NMF provides a sparse solution, the extracted features lose biological interpretability due to the nature of matrix transformation/decomposition. As a result, downstream analyses become difficult to interpret in a biological context.

Thus, both approaches present challenges:

Gene-level features may misalign with other model features, reducing coherence.
Matrix decomposition-based features lack biological interpretability, making downstream results harder to validate.

A More Elegant Approach to Feature Extraction

Ghost found inspiration in Vesteinn Thorsson’s work, which presents a refined approach to feature selection in IO modeling. Let’s briefly outline their methodology:

Step 1: Gene Signature Selection

A set of 160 IO-related gene expression signatures was manually curated from various sources, including MSigDB.

Step 2: First-Round Feature Reduction via Gene Set Enrichment Analysis (GSEA)

GSEA was performed on tumor samples to generate a sample × signature matrix with enrichment scores.

Step 3: Second-Round Reduction Using Weighted Gene Co-expression Network Analysis (WGCNA)

Hierarchical clustering (WGCNA) was applied, reducing the dataset into nine eigen-signatures (top of the figure).

Step 4: Cluster Validation & Refinement

Predictive strength validation (Gaussian mixture modeling) was used to prevent overfitting, eliminating three unstable clusters.
The final five robust features were selected (highlighted in the figure).

Step 5: Clustering Based on Final Features

Using these five features, the authors identified six biologically meaningful patient clusters using median absolute deviation (MAD).

Key Advantages of This Approach

Significant dimensionality reduction while maintaining biologically meaningful features

Statistical robustness at every stage of feature extraction

Interpretability—clusters derived from these features align with known immune-biological mechanisms

Feature Selection in Survival Analysis: Regularization via Elastic Net

For downstream survival regression, the authors employed a Cox proportional hazards (CoxPH) model, regularized via elastic net regression—a hybrid of LASSO and ridge regression.

Why Elastic Net?

Elastic net regularization is particularly well-suited for high-dimensional genomic data because:

It eliminates redundant features (like LASSO).
It retains correlated variables as a single unit (overcoming LASSO’s limitation).

Key Insight:

Ridge regression mitigates multicollinearity but does not eliminate redundant features.
LASSO regression enforces sparsity, setting some coefficients to zero for feature selection.
Elastic net overcomes LASSO’s drawback by grouping related variables, which is particularly useful in pathway-based gene selection.

Why Grouping Effect Matters in IO Studies

In biological datasets, genes often function within pathways rather than as independent variables. The grouping effect in elastic net preserves biologically relevant relationships, making it a preferred choice for feature selection in IO modeling.

Conclusion: A Robust Framework for Feature Extraction in IO Modeling

In summary, Thorsson’s methodology provides a statistically sound and biologically meaningful way to reduce feature dimensions while preserving interpretability. By employing:

Hierarchical clustering (WGCNA) for feature reduction
Gaussian mixture validation to remove unstable clusters
Elastic net regularization to optimize feature selection

The result is a highly refined model capable of capturing key immune-oncology dynamics while maintaining robust biological relevance.

The power of biological replicates in statistical analysis

MCMC II: Applying MCMC in somatic variant calling

MCMC: Monte Carlo sampling and Markov Chain