I was wondering (1) how does rlog
( )
calculate values from raw counts, (2) the relationship between rlog counts and the effect estimator (βir), and (3) what information rlog counts tell.
In vignette:
 log2(qij) = Xj βi
Q1: Is log2(qij) here represent the values fromrlog
(object)
?
Q2: My understanding from DESeq2 paper is that log2(qij) are the values calculated from Xj βi. Is that correct?
Q3: If Q2 is yes. What's "log qij" values for log qij=∑r Xjr βir? My understanding is: log (raw counts/size factor) instead of log2. Is that correct?
I saw many research made a heatmap of rlog counts. But I'm not sure what kind of information can rlog counts can tell me.  Counts comparisons from my analysis are:

Control_1 Control_2 Control_3 Trt_1 Trt_2 Trt_3 Raw count 135 258 190 0 0 0 count(DESeq, normalized = T)
151.73 249.95 198.23 0 0 0 rlog(object, blind=F) 6.50 6.87 6.69 5.33 5.34 5.35 rlog(object, blind=T) 6.50 6.89 6.70 5.20 5.21 5.22 Size factor 0.89 1.03 0.96 1.17 1.02 0.98  And the log2fold change are:

result(object) lfcShrink(res, type = "normal") lfcShrink(res, type = "apeglm") 10.16 2.68 11.24
It seems either
count(DESeq, normalized = T)
or log2foldchange can illustrate the result better. Why we use rlog count for count heatmap?
Would you recommend to present the result by using rlog count heatmap? or log2foldchange?
Thank you!
Yuya
Thank you for replying! It helps a lot.
I have two more questions:
The log_qij for solving βir is the value from
rlog
()
, is that correct? (I'm not sure is log2 or log10.)2. I also found a formula in the above link: y = log2 ( n + 1). Is this y same as log_qij?
I don't understand why sample with 0 raw counts will have the value greater than 0. Just like the example, you showed
above.
Yes, that is the log q_ij we are after, except the standard X is not used but a different one (see the paper Methods and vignette  the link I gave above  for full details).
We write log in the Methods, but convert to log2 whenever we give output to the user. So you can imagine it is log2 q_ij. DESeq2 provides estimates of log2 q_ij as the rlog output.
No, you can ignore that y=log2(n+1), that y is not relevant here when talking about rlog. That section is talking about the simple transformation f(x)=log(x+1).
Got it!
I found a previous reply A: How does DESeq2 handle zero counts in one condition?. Is this the reason why sample with 0 raw counts will have the value greater than 0?
Not exactly, but it’s a somewhat related point. That post is about estimated LFCs and your question is about the rlog.
Can you give me some information for why rlog makes raw counts from zero to nonzero?
I expected to see huge different in this kind of extremely opposite (i.e. zero vs. some counts) but not. Even the rlog counts were not that different, they still got large LFC and significant pdaj. I'm just curious.
It shrinks the differences between samples: where you have some samples with 0s and some with high counts, they all move toward the middle. Again it's doing something very different than log(x + pseudocount), and the transformation are useful for computing distances across many genes. For visualizing the data for a given gene, we use normalized counts in the package. For a heatmap you can use log normalized counts or the transformed data, this is up to you.
Thank you so much!!
So I can just simply say that this difference between raw count and rlog count due to "counts in all samples were moved toward the center(middle)" and this will help to calculate distances, correct? (also help to calculate coefficient as well?)
Not useful for estimating the coefficient, I said this in my first reply above. “Note that the difference in the vst or rlog transformed counts is not the best estimator for the LFC...”
Thank you for bringing this out again!
I didn't express this clearly. I guess "Not recommend to use vst or rlog for statistical analysis of LFCs per gene" meaning not use the rlog counts for comparing control and treatment.
If rlog is not helpful for solving βir, why formula (2) we put rlog instead of others?
DESeq()
doesn't need to compute distances across genes, right?I don't think we're getting anywhere here, but instead going in circles.
I think that vignette, the workflow, and the paper are the best places to understand the difference between the transformations and the GLM for testing coefficients. All the formula and the motivation for the methods are there.
I will read them again more carefully. Thank you for explaining everything!