Batch Correction for Bounded Variables: A Statistical Conundrum | Ranjan Kumar

Imagine you’re working on a project where you need to analyze drug response data from 30 samples, each with its own set of clinical and genetic data. Sounds straightforward, right? But what if I told you that the samples were tested in two batches, and you noticed significant differences between the two batches for some compounds? That’s exactly the problem I’m facing, and I’m not alone.

The Data

Each sample was tested with 50 compounds, each with 5 concentrations, in duplicate. The raw data is a fluorescence value related to cell survival, ranging from 0 to whopping 40,000 units. To make sense of this data, I fit a four-parameter log-logistic function, then determine the area under the curve (AUC) and express it as a percentage of the maximum theoretical area. This gives me a final AUC% value bound between 0% (no cells died) and 100% (all cells died).

The Problem

The data isn’t normally distributed, and certain weaker compounds never show values above 10% AUC. To make matters worse, I need to account for the batch effect when testing for associations between drug response and genetic alterations. I’ve been using a stratified Wilcoxon-Mann-Whitney test, but I’m not sure if that’s enough.

Batch Harmonization

What I really want to do is harmonize the AUC values across the two batches, so I can perform cluster analysis without worrying about batch effects. But, due to the 0-100 range, I’m not sure if methods like ComBat will work. And, with a vast number of sparse clinical and genetic variables, I’m stuck on how to model the data.

Current Approach

For now, I’m performing clustering without batch harmonization. I remove drugs with low biological activity, rescale the remaining ones to 0-100 of their max activity, and transform to a sample-wise Z-score. While I do see interesting patterns, I want to make sure I’m doing the right thing.

Your Feedback

If you’re a statistician or data analyst, I’d love to hear your thoughts on this problem. Am I missing something obvious? Is there a better way to approach batch correction for bounded variables? Let’s discuss!

The Data

The Problem

Batch Harmonization

Current Approach

Your Feedback

Leave a Comment Cancel Reply