I have a matrix, and the entries are all probabilities. Most of the entries have very low probabilities. Some have zeros. I need to do log of the matrix. However, since there are zeros in the matrix, R generates -inf for those zero entries. My goal is to feed this log(matrix) into the image.plot(). When I feed this into the image.plot, I kept getting this error:
Error in seq.default(minz + binwidth/2, maxz - binwidth/2, by = binwidth) :
invalid (to - from)/by in seq(.)
Is there any solution in here that can help me get around this?
Here is what the matrix looks like:
0 1 2 3 4 5 6
[1,] -0.0007854138 -8.9132811 -10.011893 -10.705041 -9.606428 -9.318746 -Inf
[2,] -0.3402118357 -1.6137090 -2.742625 -4.215836 -5.721434 -7.121522 -9.606428
[3,] -0.2912175507 -2.0296478 -3.521929 -4.275321 -4.426519 -4.187369 -3.715705
[4,] -1.5244380532 -0.7048802 -2.001368 -3.405243 -3.713864 -3.143919 -3.781412
[5,] -0.7572491288 -0.7487709 -3.981208 -5.110329 -5.228577 -5.095569 -5.293395
[6,] -0.0007629648 -Inf -8.759130 -7.613998 -9.606428 -Inf -Inf
[7,] -0.0020658381 -7.4861648 -7.526987 -7.094123 -9.318746 -Inf -Inf
[8,] -0.0295715883 -6.7160566 -7.208533 -6.610696 -6.485533 -6.813220 -6.387552
[9,] -0.0032128722 -6.7160566 -7.613998 -7.871827 -7.760602 -8.759130 -8.759130
[10,] -0.4869248130 -1.3225132 -2.518576 -3.768698 -5.140520 -6.183252 -7.208533
7 8 9
[1,] -Inf -10.705041 -10.011893
[2,] -Inf -Inf -7.149693
[3,] -4.965248 -5.968842 -6.428374
[4,] -4.696227 -5.091913 -4.669559
[5,] -5.163777 -5.468599 -6.577906
[6,] -Inf -Inf -Inf
[7,] -Inf -Inf -Inf
[8,] -6.627503 -6.456545 -6.400976
[9,] -10.011893 -10.011893 -Inf
[10,] -8.402456 -7.814669 -6.546158
Here is the structure :
structure(c(0.999214894571557, 0.71161956034096, 0.747353073126963,
0.217743382682817, 0.468954688200987, 0.999237326155227, 0.997936294302378,
0.970861372812921, 0.996792283535218, 0.614513234634365, 0.000134589502018843,
0.199147599820547, 0.13138178555406, 0.49416778824585, 0.472947510094213,
0, 0.000560789591745177, 0.00121130551816958, 0.00121130551816958,
0.266464782413638, 4.48631673396142e-05, 0.0644010767160162,
0.0295423956931359, 0.135150291610588, 0.0186630776132795, 0.00015702108568865,
0.00053835800807537, 0.000740242261103634, 0.000493494840735756,
0.0805742485419471, 2.24315836698071e-05, 0.0147599820547331,
0.0139075818752804, 0.0331987438313145, 0.00603409600717811,
0.000493494840735756, 0.000829968595782862, 0.00134589502018843,
0.000381336922386721, 0.0230820995962315, 6.72947510094213e-05,
0.00327501121579183, 0.0119560340960072, 0.0243831314490803,
0.00536114849708389, 6.72947510094213e-05, 8.97263346792284e-05,
0.00152534768954688, 0.000426200089726335, 0.00585464333781965,
8.97263346792284e-05, 0.000807537012113055, 0.0151861821444594,
0.0431135038133692, 0.00612382234185734, 0, 0, 0.00109914759982055,
0.00015702108568865, 0.00206370569762225, 0, 6.72947510094213e-05,
0.0243382682817407, 0.022790489008524, 0.00502467474203679, 0,
0, 0.00168236877523553, 0.00015702108568865, 0.000740242261103634,
0, 0, 0.00697622252131, 0.00912965455361149, 0.00572005383580081,
0, 0, 0.00132346343651862, 4.48631673396142e-05, 0.000224315836698071,
2.24315836698071e-05, 0, 0.00255720053835801, 0.00614625392552714,
0.00421713772992373, 0, 0, 0.0015702108568865, 4.48631673396142e-05,
0.000403768506056528, 4.48631673396142e-05, 0.000785105428443248,
0.00161507402422611, 0.00937640197397936, 0.00139075818752804,
0, 0, 0.00165993719156572, 0, 0.00143562135486765), .Dim = c(10L,
10L), .Dimnames = list(NULL, c("0", "1", "2", "3", "4", "5",
"6", "7", "8", "9")))
If these zeros are caused by a physical measurement which should yield a positive-definite results but fails to do so for technical reasons, it might be reasonable to substitute 1/2 of the lower limit of detection for the zeros.
M2 <- M
print( min(M[M!=0]), digits=16)
#[1] 2.24315836698071e-05
M2[M2==0] <- 0.5*min(M[M!=0])
image(M2)
image(log(M2))
True, a log plot may make "the difference betweeen entries more noticeable". However, if you have zeros in your data, you'd be using it wrong. The point of a logarithmic scale is to illustrate exponential increases in the data. Having zeros, however, means that either:
the values observed were not produced by a process exhibiting exponential
growth or
you need to handle your missing values differently.
Either way, what would work a lot better in your case is taking the square root of the values. Or (n>2)-th root if you want to accentuate the difference in values even more -- the higher the n, the bigger the difference.
As per #flodel's suggestion below, the code that would do this is: image.plot(sqrt(x)) or, more generally, image.plot(x^(1/n)) for some n>1.
Hope this helps.
A simple trick is to add 1 since log1=0 such that cell with 0 still will have 0 after log transformation.
k<-matrix(c(1:8,0,0),nrow=2,ncol=5)
> k
[,1] [,2] [,3] [,4] [,5]
[1,] 1 3 5 7 0
[2,] 2 4 6 8 0
log(k)
[,1] [,2] [,3] [,4] [,5]
[1,] 0.0000000 1.098612 1.609438 1.945910 -Inf
[2,] 0.6931472 1.386294 1.791759 2.079442 -Inf
log(k+1)
[,1] [,2] [,3] [,4] [,5]
[1,] 0.6931472 1.386294 1.791759 2.079442 0
[2,] 1.0986123 1.609438 1.945910 2.197225 0
The except is thrown by seq(), which can not take -inf as any one of its arguments. You can get exactly the same type of error with the following code:
> seq(-log(0), 0, 50)
Error in seq.default(-log(0), 0, 50) : invalid (to - from)/by in seq(.)
To avoid it, follow #Metrics 's trick. Although I will suggest instead of adding 1.0, add a very small value, such as 1e-22, since your matrix is a matrix of probabilities.
Can't paste multiple lines of code in a comment, but this example shows what I meant:
> m=cbind(c(0,0.88,0.99),c(1,2,1),c(3,4,5))
> m=as.matrix(m)
> log(m)
[,1] [,2] [,3]
[1,] -Inf 0.0000000 1.098612
[2,] -0.12783337 0.6931472 1.386294
[3,] -0.01005034 0.0000000 1.609438
> m
[,1] [,2] [,3]
[1,] 0.00 1 3
[2,] 0.88 2 4
[3,] 0.99 1 5
Related
I have a data.frame containing a vector of numeric values (prcp_log).
waterdate PRCP prcp_log
<date> <dbl> <dbl>
1 2007-10-01 0 0
2 2007-10-02 0.02 0.0198
3 2007-10-03 0.31 0.270
4 2007-10-04 1.8 1.03
5 2007-10-05 0.03 0.0296
6 2007-10-06 0.19 0.174
I then pass this data through Christiano-Fitzgerald band pass filter using the following command from the mfilter package.
library(mFilter)
US1ORLA0076_cffilter <- cffilter(US1ORLA0076$prcp_log,pl=180,pu=365,root=FALSE,drift=FALSE,
type=c("asymmetric"),
nfix=NULL,theta=1)
Which creates an S3 object containing, among other things, and vector of "trend" values and a vector of "cycle" values, like so:
head(US1ORLA0076_cffilter$trend)
[,1]
[1,] 0.05439408
[2,] 0.07275321
[3,] 0.32150292
[4,] 1.07958965
[5,] 0.07799329
[6,] 0.22082246
head(US1ORLA0076_cffilter$cycle)
[,1]
[1,] -0.05439408
[2,] -0.05295058
[3,] -0.05147578
[4,] -0.04997023
[5,] -0.04843449
[6,] -0.04686915
Plotted:
plot(US1ORLA0076_cffilter)
I then apply the following mathematical operation in attempt to remove the trend and seasonal components from the original numeric vector:
US1ORLA0076$decomp <- ((US1ORLA0076$prcp_log - US1ORLA0076_cffilter$trend) - US1ORLA0076_cffilter$cycle)
Which creates an output of values which includes unexpected elements such as dashes and letters.
head(US1ORLA0076$decomp)
[,1]
[1,] 0.000000e+00
[2,] 0.000000e+00
[3,] 1.387779e-17
[4,] -2.775558e-17
[5,] 0.000000e+00
[6,] 6.938894e-18
What has happened here? What do these additional characters signify? How can perform this mathematical operation and achieve the desired output of simply $log_prcp minus both the $tend and $cycle values?
I am happy to provide any additional info that will help right away, just ask.
I'm using the R package geigen to solve the generalized eigenvalue problem AV = lambdaB*V.
This is the code:
geigen(Gamma_chi_0, diag(diag(Gamma_xi_0)),symmetric=TRUE, only.values=FALSE) #GENERALIZED EIGENVALUE PROBLEM
Where:
Gamma_chi_0
[,1] [,2] [,3] [,4] [,5]
[1,] 1.02346 -0.50204 0.41122 -0.73066 0.00072
[2,] -0.50204 0.96712 -0.33526 0.51774 -0.37708
[3,] 0.41122 -0.33526 1.05086 0.09798 0.09274
[4,] -0.73066 0.51774 0.09798 0.99780 -0.51596
[5,] 0.00072 -0.37708 0.09274 -0.51596 1.03354
and
diag(diag(Gamma_xi_0))
[,1] [,2] [,3] [,4] [,5]
[1,] -0.0234 0.0000 0.0000 0.0000 0.0000
[2,] 0.0000 0.0329 0.0000 0.0000 0.0000
[3,] 0.0000 0.0000 -0.0509 0.0000 0.0000
[4,] 0.0000 0.0000 0.0000 0.0022 0.0000
[5,] 0.0000 0.0000 0.0000 0.0000 -0.0335
But I get this error:
> geigen(Gamma_chi_0, diag(diag(Gamma_xi_0)), only.values=FALSE)
Error in .sygv_Lapackerror(z$info, n) :
Leading minor of order 1 of B is not positive definite
In matlab, using the same two matrices, it works:
opt.disp = 0;
[P, D] = eigs(Gamma_chi_0, diag(diag(Gamma_xi_0)),r,'LM',opt);
% compute first r generalized eigenvectors and eigenvalues
For example I get the following eigenvalues matrix
D =
427.8208 0
0 -38.6419
Of course in matlab I just computed the first r=2, in R i want all the eigenvalues and eigenvectors (n=5), and then i subset the first 2.
Can someone help me to solve this?
geigen has detected a symmetric matrix for Gamma_chi_0. Then Lapack encounters an error and cannot continue. Specify symmetric=FALSE in the call of geigen. The manual describes what argument symmetric does. Do this
geigen(Gamma_chi_0, B, symmetric=FALSE, only.values=FALSE)
The result is (on my computer)
$values
[1] 4.312749e+02 -3.869203e+01 -2.328465e+01 1.706288e-05 1.840783e+01
$vectors
[,1] [,2] [,3] [,4] [,5]
[1,] -0.067535068 1.0000000 0.2249715 -0.89744514 0.05194799
[2,] -0.035746438 0.1094176 0.3273440 0.03714518 1.00000000
[3,] 0.005083806 0.3782606 0.8588086 0.50306323 0.17858115
[4,] -1.000000000 0.2986963 0.4067701 -1.00000000 -0.48314183
[5,] -0.034226056 -0.6075727 1.0000000 -0.53017872 0.06738515
$alpha
[1] 1.365959e+00 -1.152686e+00 -9.202769e-01 4.352770e-07 5.588102e-01
$beta
[1] 0.003167259 0.029791306 0.039522893 0.025510167 0.030357208
This is quite close to what you show for Matlab. I know nothing about Matlab so I cannot help you with that.
Addendum
Matlab seems to use similar methods as geigen when the matrices used are determined to be symmetric or not. Your matrix Gamma_chi_0 may not be exactly symmetric. See this documentation for argument 'algorithm' of eig.
More addendum
In actual fact your matrix B is not positive definite. Try the function chol of base R. And you'll get the same error message. In this case you have to force geigen to use the general algorithm.
I have this dataset:
dbppre dbppost per1pre per1post per2pre per2post
0.544331824055634 0.426482748529805 1.10388140870983 1.14622255457398 1.007302668 1.489675646
0.44544008292805 0.300746382647025 0.891104906479033 0.876840408251785 0.919450773 0.892276804
0.734783578764543 0.489971007532308 1.02796075709944 0.79655130374748 0.610340504 0.936092006
1.04113077142586 0.386513119551008 0.965359488375859 1.04314173155816 1.122001994 0.638452078
0.333368637355291 0.525460160226716 NA 0.633435747 1.196988457 0.396543005
1.76769244892893 0.726077921840058 1.08060419667991 0.974269083108835 1.245643507 1.292857474
1.41486783 NA 0.910710353033318 1.03435985624106 0.959985314 1.244732938
1.01932795229362 0.624195252685448 1.27809687379565 1.59656046306852 1.076534265 0.848544508
1.3919315726037 0.728230610741795 0.817900465495852 1.24505216554384 0.796182044 1.47318564
1.48912544220417 0.897585509143984 0.878534099910696 1.12148645028777 1.096723799 1.312244217
1.56801709691326 0.816474814896344 1.13655475536592 1.01299018097117 1.226607978 0.863016615
1.34144721808244 0.596169010679233 1.889775937 NA 1.094095173 1.515202105
1.17409999971024 0.626873517936125 0.912837009713984 0.814632450682884 0.898149331 0.887216585
1.06862027138743 0.427855128881696 0.727537839417515 1.15967069522768 0.98168375 1.407271061
1.50406121956726 0.507362673558659 1.780752715 0.658835953 2.008229626 1.231869338
1.44980944220763 0.620658801480513 0.885827192590202 0.651268425772394 1.067548223 0.994736445
1.27975202574336 0.877955236879164 0.595981804265367 0.56002696152466 0.770642278 0.519875921
0.675518080750329 0.38478948746306 0.822745530980815 0.796051785239611 1.16899539 1.16658889
0.839686262472682 0.481534573379965 0.632380676760052 0.656052506855686 0.796504954 1.035781891
.
.
.
As you can see, there are multiple cuantitative variables for gene expression data, each gene meassured two times, pre and post treatment, with some missing values in some of the variables.
Each row corresponds to one individual, and they are also divided in two groups (0 = control, 1 = really treated).
I would like to make a correlation (Spearman or Pearson depending on normality, but by group, and obtaining the correlation value and the p-value significance, avoiding the NAs.
Is it possible?
I know how to implement cor.test() function to compare two variables, but I could not find any variable inside this function to take groups into account.
I also discovered plyr and data.table libraries to do so, by groups, but they return just the correlation value without p-value, and I haven't been able to make it word for variables with NAs.
Suggestions?
You could use the Hmisc package.
library(Hmisc)
set.seed(10)
dt<-matrix(rnorm(100),5,5) #create matrix
dt[1,1]<-NA #introduce NAs
dt[2,4]<-NA #introduce NAs
cors<-rcorr(dt, type="spearman") #spearman correlation
corp<-rcorr(dt, type="pearson") #pearson correlation
> corspear
[,1] [,2] [,3] [,4] [,5]
[1,] 1.0 0.4 0.2 0.5 -0.4
[2,] 0.4 1.0 0.1 -0.4 0.8
[3,] 0.2 0.1 1.0 0.4 0.1
[4,] 0.5 -0.4 0.4 1.0 -0.8
[5,] -0.4 0.8 0.1 -0.8 1.0
n
[,1] [,2] [,3] [,4] [,5]
[1,] 4 4 4 3 4
[2,] 4 5 5 4 5
[3,] 4 5 5 4 5
[4,] 3 4 4 4 4
[5,] 4 5 5 4 5
P
[,1] [,2] [,3] [,4] [,5]
[1,] 0.6000 0.8000 0.6667 0.6000
[2,] 0.6000 0.8729 0.6000 0.1041
[3,] 0.8000 0.8729 0.6000 0.8729
[4,] 0.6667 0.6000 0.6000 0.2000
[5,] 0.6000 0.1041 0.8729 0.2000
For further details see the help section: ?rcorr
rcorr returns a list with elements r, the matrix of correlations, n
the matrix of number of observations used in analyzing each pair of
variables, and P, the asymptotic P-values. Pairs with fewer than 2
non-missing values have the r values set to NA.
please i'm trying to solve a 7x2 matrix problem in the form below using R- software:
A=array(c(5.54,0.96,1.59,2.07,0.73,10.64,8.28,1.41,3.77,3.11,3.74,2.93,8.29,3.33), c(7,2))
A
# [,1] [,2]
#[1,] 5.54 1.41
#[2,] 0.96 3.77
#[3,] 1.59 3.11
#[4,] 2.07 3.74
#[5,] 0.73 2.93
#[6,] 10.64 8.29
#[7,] 8.28 3.33
b=c(80814.25,34334.75,47921.75,59514.25,26981.25,63010.25,46646.25)
b
#[1] 80814.25 34334.75 47921.75 59514.25 26981.25 63010.25 46646.25
solve (A,b)
Error in solve.default(A, b) : 'a' (7 x 2) must be square
A %*% solve (A,b)
Error in solve.default(A, b) : 'a' (7 x 2) must be square
What do you think I can do to solve the problem. I need solution to two variables, x1 and x2, in the 7x2 matrix as stated above.
It seems that you're using solve when it needs a square input. In ?solve it discusses how you can use qr.solve for non-square matrices.
qr.solve(A,b)
[,1]
[1,] 3741.208
[2,] 6552.174
You might want to check that this is correct for your purposes. There are other ways to solve these types of problems. This might help you though.
The corpcor package offers a pseudoinverse function for finding the inverse of a rectangular matrix:
library(corpcor)
pseudoinverse(A)
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0.06271597 -0.05067830 -0.02922597 -0.03265713 -0.03964039 0.0230086
[2,] -0.05845856 0.08551514 0.05661287 0.06532450 0.06674243 0.0391552
[,7]
[1,] 0.07239133
[2,] -0.05420334
pseudoinverse(A) %*% b
[,1]
[1,] 3741.208
[2,] 6552.174
I searched about poly() in R and I think it should produce orthogonal polynomials so when we use it in regression model like lm(y~poly(x,2)) the predictors are uncorrelated. However:
poly(1:3,2)=
[1,] -7.071068e-01 0.4082483
[2,] -7.850462e-17 -0.8164966
[3,] 7.071068e-01 0.4082483
I think this is probably a stupid question but what I don't understand is the column vectors of the result poly(1:3,2) does not have inner product zero? That is -7.07*0.40-7.85*(-0.82)+7.07*0.41=/ 0? so how is this uncorrelated predictors for regression?
Your main problem is that you're missing the meaning of the e or "E notation": as commented by #MamounBenghezal above, fffeggg is shorthand for fff * 10^(ggg)
I get slightly different answers than you do (the difference is numerically trivial) because I'm running this on a different platform:
pp <- poly(1:3,2)
## 1 2
## [1,] -7.071068e-01 0.4082483
## [2,] 4.350720e-18 -0.8164966
## [3,] 7.071068e-01 0.4082483
An easier format to see:
print(zapsmall(matrix(c(pp),3,2)),digits=3)
## [,1] [,2]
## [1,] -0.707 0.408
## [2,] 0.000 -0.816
## [3,] 0.707 0.408
sum(pp[,1]*pp[,2]) ## 5.196039e-17, effectively zero
Or to use your example, with the correct placement of decimal points:
-0.707*0.408-(7.85e-17)*(-0.82)+(0.707)*0.408
## [1] 5.551115e-17