Arithmetics with extremely small values in R - r

I am facing a big issue in computing the cumsum() of a vector. The vector has a length of ~ 10,000 elements and from element say 2000 the values go down to 1e-310. To give a feeling of the distribution I am dealing with, here is a plot.
When I try to apply cumsum() I get lots of ones, which is impossible, and a minimum value around 10^-2. I am porting a code which we developed in Matlab and of course no problems there. For some reason, R seems to have troubles in working with such small numbers to the extent that applying standard functions returns unexpected, and wrong, results.
I searched over stack overflow and found these two posts:
R: Number precision, how to prevent rounding?
Controlling number of decimal digits in print output in R
Unfortunately, none of them helped me out.
I also tried to use Rcpp cumsum() function with no luck. I guess the problem comes directly from the precision of my matrix object.
I am not even sure how to reproduce this so I am happy to share my 9137 x 2 matrix. I am completely stuck with this.
Looking forward to hearing from you guys!
Thanks
Update
Here is a sample of 100 elements from my matrix:
y <- sample( BestPair, 100 )
dput( y )
c(7.74958917761984e-289, 4.19283869686715e-319, 1.52834266884531e-319,
2.22089175309335e-297, 4.93980517192742e-298, 1.37861543592719e-301,
1.47044459800611e-317, 6.49068860911021e-319, 1.83302927898675e-305,
8.39514422452147e-312, 2.88487711616485e-300, 0.000544461085044608,
0.000435738736513519, 1.35649914843994e-309, 4.30826678309556e-310,
2.60728322623343e-319, 0.000544460617547516, 5.28815204888643e-299,
0.00102710912090133, 0.00198425117943324, 1.99711912768841e-304,
8.34594499227505e-306, 7.42055412763084e-300, 5.00039717762739e-311,
1.8750204972032e-305, 1.06513324565406e-310, 5.00487443690634e-313,
3.4890421843663e-319, 7.48945537292364e-310, 1.92948452007191e-310,
1.19840058299897e-305, 0.000532438536688165, 6.53966533658245e-318,
0.000499821676385928, 2.02305525482572e-305, 5.18981575493413e-306,
8.82648276295387e-320, 7.30476057376283e-320, 1.23073492422415e-291,
4.1801705284367e-311, 5.10863383734203e-318, 1.12106998189819e-298,
9.34823978505262e-297, 0.00093615863896107, 5.3667092510628e-311,
3.85094526994501e-319, 1.3693720559483e-313, 3.96230943126873e-311,
2.03293191294298e-319, 2.38607510351427e-291, 0.000544460855322345,
1.74738584846597e-310, 1.41874408662835e-318, 5.73056861298345e-319,
3.28565325597139e-302, 3.5412805275117e-310, 1.19647007227024e-302,
1.71539915106223e-305, 2.10738303243284e-311, 6.47783846432427e-313,
5.0072402480979e-303, 7.7250380240544e-303, 9.75451890703679e-309,
0.000533945755492525, 0.00211359631486468, 1.6612179399628e-312,
0.000521804571338402, 4.12194185271951e-308, 1.12829365794294e-313,
8.89772702908418e-319, 5.092756929242e-312, 7.45208240537024e-311,
6.60385177095196e-298, 0.000544461017733648, 1.62108867188263e-318,
3.95135528339003e-309, 1.8792966379072e-292, 5.98494480819088e-295,
0.00051614492665081, 2.25198141886419e-300, 7.97467977809552e-305,
1.78098757558338e-311, 1.66946525895122e-313, 0.000778442249425894,
6.58100207570114e-312, 0.00120733768329515, 3.44432924341767e-319,
6.38151190880713e-313, 7.1129669216109e-300, 4.11319531475755e-319,
7.21747577033383e-304, 1.48709312807403e-318, 1.39519898470211e-303,
4.58585270141592e-312, 2.16309869205282e-295, 7.55248601743105e-314,
3.16365476892733e-310, 1.96961507010996e-305, 3.21125377499206e-318,
3.66277772043574e-304)
Update 2
Apparently, imposing the following:
BestPair[ BestPair < .Machine$double.eps ] <- 0
does not solve the issue. Still finding weird results from cumsum(). Here is a plot to better explain what I am dealing with. The Cumulative Prob. has this shape because BestPair has been sorted by decreasing order. I want to have the 1 from cumsum() on top of my vector.
Here is a summary of the ob
> summary(CumProb)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0250 1.0000 1.0000 0.9685 1.0000 1.0000
Update 3. Results from Matlab
Here is the result as computed with Matlab. As you can see, I can get a pretty decent distribution which I cannot achieve in R even if I truncated the original matrix.

Related

Is there a way to handle calculations invovling exponential of big values in R?

I have looked a bit online and in the site but I did not find any solution. My problem is relatively simple so if you could point me to a possible solution, much appreciated.
test_vec <- c(2,8,709,600)
mean(exp(test_vec))
test_vec_bis <- c(2,8,710,600)
mean(exp(test_vec_bis))
exp(709)
exp(710)
# The numerical limit of R is at exp(709)
How can I calculate the mean of my vector and deal with the Inf values knowing that R could probably handle the mean value but not all values in the numerator of the mean calculation ?
There is an edge case where you can solve your problem by simply restating your problem mathematically, but that would require that the length of your vector is extremely large and/or that your large exp. numbers are close to the numeric limit:
Since the mean sum(x)/n can be written as sum(x/n) and since exp(x)/exp(y) = exp(x-y), you can calculate sum(exp(x-log(n))), which gives you a relief of log(n).
mean(exp(test_vec))
[1] 2.054602e+307
sum(exp(test_vec - log(length(test_vec))))
[1] 2.054602e+307
sum(exp(test_vec_bis - log(length(test_vec_bis))))
[1] 5.584987e+307
While this works for your example, most likely this won't work for your real vector.
In this case, you will have to consult packages like Rmpfr as suggested by #fra.
Here's one way where you qualify to only select those in your test_vec that give an answer < Inf:
mean(exp(test_vec)[which(exp(test_vec) < Inf)])
[1] 1.257673e+260
t2 <- c(2,8,600)
mean(exp(t2))
[1] 1.257673e+260
This assumes you were looking to exclude values that result in Inf, of course.

Calculate significant digits of a number in R

Maybe a daft question but why does R remove the significant 0 in the end of a number? For example 1.250 becomes 1.25 which has not the same accuracy. I have been trying to calculate the number of significant digits of a number by using as.character() in combination with gsub() and regular expressions (according to various posts) but i get the wrong result for numbers such as 1.250, since as.character removes the last 0 digit. Therefore the answer for 1.250 comes out as 2 digits rather than 3 which is the correct.
To be more specific why this is an issue for me:
I have long tables in word comprising of bond lengths which are in the format eg: 1.2450(20):
The number in parenthesis is the uncertainty in the measurement which means that the real value is somewhere between 1.2450+0.0020 and 1.2450-0.0020. I have imported all these data from word in a large data frame like so:
df<-data.frame(Activity = c(69790, 201420, 17090),
WN1=c(1.7598, 1.759, 1.760),
WN1sd=c(17, 15, 3))
My aim is to plot the WN1 values against activity but also have the error bar on. This means that i will need to manually convert the WN1sd to: WN1sd=c(0.0017, 0.015, 0.003) which is not the R way to go, hence the need to obtain the number of significant digits of WN1. This works fine for the first two WN1 values but not for the 3rd value since R mistakenly thinks that the last 0 is not significant.
You have to prepare the standard deviations at the time you import your data from your word document
There's a point where you should have strings like that :
"1.2345(89)" "4.230(34)" "3.100(7)"
This is a function you can apply to those chars and get the sd right:
split.mean.sd = function(mean.sd) {
mean <- gsub("(.*)\\(.*", "\\1", mean.sd)
sd <- gsub(".*\\((.*)\\)", "\\1", mean.sd)
digits.after.dot <- nchar(gsub(".*\\.(.*).*", "\\1", mean))
sd <- as.numeric(sd)*10^(-digits.after.dot)
mean <- as.numeric(mean)
c(mean, sd)
}
For example:
v <- c("1.2345(89)","4.230(34)","3.100(7)")
sapply(v, split.mean.sd)
gives you
1.2345(89) 4.230(34) 3.100(7)
[1,] 1.2345 4.230 3.100
[2,] 0.0089 0.034 0.007
Most programming languages, R included, do not track the number of significant digits for floating-point values. This is because in many cases significant digits are not necessary, would significantly slow down computations and require more RAM.
You may want to be interested in some libraries for computations with uncertainties, like the errors (PDF) package.

How to filter genes in matrix based on quantile cutoff?

This is a matrix with some example data:
S1 S2 S3
ARHGEF10L 11.1818 11.0186 11.243
HIF3A 5.2482 5.3847 4.0013
RNF17 4.1956 0 0
RNF10 11.504 11.669. 12.0791
RNF11 9.5995 11.398 9.8248
RNF13 9.6257 10.8249 10.5608
GTF2IP1 11.8053 11.5487 12.1228
REM1 5.6835 3.5408 3.5582
MTVR2 0 1.4714 0
RTN4RL2 8.7486 7.9144 7.9795
C16orf13 11.8009 9.7438 8.9612
C16orf11 0 0 0
FGFR1OP2 7.679 8.7514 8.2857
TSKS 2.3036 2.8491 0.4699
I have a matrix "h" with 10,000 genes as rownames and 100 samples as columns. I need to select top 20% highly variable genes for clustering. But I'm not sure about what I gave is right or not.
So, for this filtering I have used genefilter R package.
varFilter(h, var.func=IQR, var.cutoff=0.8, filterByQuantile=TRUE)
Do you think the command which I gave is right to get top 20% highly variable genes? And can anyone please tell me how this method works in a statistical way?
I haven't used this package myself, but the helpfile of the function you're using makes the following remark:
IQR is a reasonable variance-filter choice when the dataset is split
into two roughly equal and relatively homogeneous phenotype groups. If
your dataset has important groups smaller than 25% of the overall
sample size, or if you are interested in unusual individual-level
patterns, then IQR may not be sensitive enough for your needs. In such
cases, you should consider using less robust and more sensitive
measures of variance (the simplest of which would be sd).
Since your data has a bunch of small groups, it might be wise to follow this advice to change your var.func to var.func = sd.
sd computes the standard deviation, which should be easy to understand.
However, this function expects its data in the form of an expressionSet object. The error message you got (Error in (function (classes, fdef, mtable) : unable to find an inherited method for function 'exprs' for signature '"matrix"') implies that you don't have that, but just a plain matrix instead.
I don't know how to create an expressionSet, but I think that doing that is overly complicated anyways. So I would suggest going with the code that you posted in the comments:
vars <- apply(h, 1, sd)
h[vars > quantile(vars, 0.8), ]

R Selecting only up to maxmimum

I want to calculate the mean of all the acceleration values from a displacement value of 0.01 to max. However, I do not want to include any acceleration values after the maximum value. How is this done?
mean(
subset(S1_Intns40_chainno-Sheet1,
Displacement>0.01:max(Displacement),
select=c("Acceleration"))$Acceleration)
[1] -0.8371687
Like so:
limitedAvgAcc <- mean(data$Acceleration[(data$Displacement <= max)&(data$Displacement >= 0.01)])
This works because the bracketed statement create a Boolean vector, which is then used to subset the Acceleration vector, which is then averaged.
lift_S1_intns40_chaino=
(S1_Intns40_chainno[(which.min(S1_Intns40_chainno$Displacement<0.01)):
(which.max(S1_Intns40_chainno$Dis)),])
I have gotten the following code to work well enough for my purpose.

acos(1) returns NaN for some values, not others

I have a list of latitude and longitude values, and I'm trying to find the distance between them. Using a standard great circle method, I need to find:
acos(sin(lat1)*sin(lat2) + cos(lat1)*cos(lat2) * cos(long2-long1))
And multiply this by the radius of earth, in the units I am using. This is valid as long as the values we take the acos of are in the range [-1,1]. If they are even slightly outside of this range, it will return NaN, even if the difference is due to rounding.
The issue I have is that sometimes, when two lat/long values are identical, this gives me an NaN error. Not always, even for the same pair of numbers, but always the same ones in a list. For instance, I have a person stopped on a road in the desert:
Time |lat |long
1:00PM|35.08646|-117.5023
1:01PM|35.08646|-117.5023
1:02PM|35.08646|-117.5023
1:03PM|35.08646|-117.5023
1:04PM|35.08646|-117.5023
When I calculate the distance between the consecutive points, the third value, for instance, will always be NaN, even though the others are not. This seems to be a weird bug with R rounding.
Can't tell exactly without seeing your data (try dput), but this is mostly likely a consequence of FAQ 7.31.
(x1 <- 1)
## [1] 1
(x2 <- 1+1e-16)
## [1] 1
(x3 <- 1+1e-8)
## [1] 1
acos(x1)
## [1] 0
acos(x2)
## [1] 0
acos(x3)
## [1] NaN
That is, even if your values are so similar that their printed representations are the same, they may still differ: some will be within .Machine$double.eps and others won't ...
One way to make sure the input values are bounded by [-1,1] is to use pmax and pmin: acos(pmin(pmax(x,-1.0),1.0))
A simple workaround is to use pmin(), like this:
acos(pmin(sin(lat1)*sin(lat2) + cos(lat1)*cos(lat2) * cos(long2-long1),1))
It now ensures that the precision loss leads to a value no higher than exactly 1.
This doesn't explain what is happening, however.
(Edit: Matthew Lundberg pointed out I need to use pmin to get it tow work with vectorized inputs. This fixes the problem with getting it to work, but I'm still not sure why it is rounding incorrectly.)
I just encountered this. This is caused by input larger than 1. Due to the computational error, my inner product between unit norms becomes a bit larger than 1 (like 1+0.00001). And acos() can only deal with [-1,1]. So, we can clamp the upper bound to exactly 1 to solve the problem.
For numpy: np.clip(your_input, -1, 1)
For Pytorch: torch.clamp(your_input, -1, 1)

Resources