Probability distribution in NGS data analysis

In a previous post, we introduced the fundamentals of probability distributions and their role in hypothesis testing. In this discussion, we extend that foundation by exploring the application of probability distributions in next-generation sequencing (NGS) data analysis. Specifically, we address the rationale behind choosing specific probability distributions to model NGS data and the underlying statistical logic guiding these choices. We illustrate this through two common examples in NGS analysis: somatic variant calling and differential expression analysis in RNA-Seq.

Example 1: Probability Distributions in Somatic Variant Calling

Somatic variant calling aims to distinguish true somatic mutations from artifacts introduced during sequencing. Artifacts can arise from various sources, including sequencing errors, PCR amplification biases, and read misalignment. Given a known sequencing error rate, we can define a null hypothesis describing the expected distribution of sequencing errors. Variants that deviate significantly from this expected distribution are classified as somatic mutations.

A natural starting point for modeling sequencing errors is the Binomial distribution, given that sequencing errors arise from an independent sampling process. However, two key factors influence this choice:

Sequencing errors are rare in mature sequencing platforms (~0.1%).
Tumor samples typically have high sequencing depth.

When n (sequencing depth) is large and p (error rate per base) is small, the Binomial distribution can be well approximated by the Poisson distribution. In fact, some somatic variant callers employ a Poisson model, where the sequencing error rate is represented by λ (the expected number of sequencing errors per site).

Since sequencing error rates vary across genomic regions, site-specific error models can be built using base quality scores.
Sequencing artifacts extend beyond simple errors, including PCR amplification errors and misalignments.
Bayesian approaches enhance accuracy by incorporating prior probabilities that account for additional error sources, such as mappability and sequencing context.

A detailed discussion on Bayesian methods will be covered in a future post.

Example 2: Probability Distributions in RNA-Seq Differential Expression Analysis

In RNA-Seq analysis, we aim to identify differentially expressed genes between sample groups with distinct phenotypic conditions. The challenge is quantifying how different is different? To address this, we establish a null hypothesis that defines the expected distribution of gene expression under a given condition.

Conceptually, RNA-Seq read counts can be thought of as a random sampling process from a pool of total reads. If we assume:

Total sequencing reads (n) is large
The proportion of reads from a specific gene (p) is small

Then, the Binomial distribution applies. However, under these conditions, the Binomial process is often well approximated by a Poisson distribution.

While the Poisson distribution assumes mean = variance, gene expression data often exhibits greater variance than expected due to:

Technical variation from sequencing protocols
Biological variability between individuals

This phenomenon, known as overdispersion, suggests that a negative binomial (NB) distribution—a generalized Poisson model with an extra dispersion parameter—better fits RNA-Seq data. The figure below (based on scRNA-Seq expression profiles) confirms that a negative binomial model provides a better statistical fit for real-world gene expression data.

In this discussion, we have outlined the statistical reasoning behind selecting probability distributions in somatic variant calling and RNA-Seq differential expression analysis. By making appropriate probabilistic assumptions, we can improve estimation accuracy even when working with limited data. Thoughtful application of statistical models ensures that our findings remain robust and biologically meaningful.

The power of biological replicates in statistical analysis

MCMC II: Applying MCMC in somatic variant calling

MCMC: Monte Carlo sampling and Markov Chain

Probability distribution in NGS data analysis

Recent Posts