I have a procedure that takes data and creates a square matrix M, where the elements of M correspond to certain features of the data. I wish to get a confidence interval for each element of the matrix to measure the uncertainty around each feature that I'm estimating. To do this, I want to bootstrap, so I initialize a list in R, resample my data, and store the resulting matrices in the list. How can I then estimate the 95% confidence interval for each element?
For example, say my original matrix M is 10 by 10, and the list of bootstrapped matrices is
mylist <- list()
for(i in 1:1000){
boot_matrices[[i]] <- matrix(rnorm(10*10, mean=0, sd=1), nrow=10, ncol=10)
}
I wish to calculate a confidence interval around each (i,j) element in M, where i=1...10 and j=1...10 by using the 1000 bootstrap matrices. In this toy example, I know the parametric distribution of each (i,j) element, and I'd expect the 95% confidence interval for each element to be around (-1.96, 1.96). However, in my real data, I don't know as much, and I don't want to assume as much (hence, using bootstrap). Is there a way to automatically calculate confidence intervals around each matrix element in this case?
How about something like this?
Create nBS bootstrap 10x10 matrices (I'm using set.seed(...) to ensure reproducibility of sample data). Resulting matrices are stored in a list.
# List of bootstrap matrices
nBS <- 1000;
set.seed(2017);
lst <- lapply(1:nBS, function(x)
matrix(rnorm(10 * 10, mean = 0, sd = 1), nrow = 10, ncol = 10));
Calculate the mean and standard deviation for every matrix element (i,j) across all bootstrap samples:
# Calculate mean and sd of every matrix element (i,j)
mat.mean <- apply(simplify2array(lst), c(1, 2), mean);
mat.sd <- apply(simplify2array(lst), c(1, 2), sd);
Calculate 95% confidence interval as mean +- 1.96 * sem:
# Calculate lower and upper 95% confidence interval
mat.lowerCI <- mat.mean - 1.96 * mat.sd / sqrt(nBS);
mat.upperCI <- mat.mean + 1.96 * mat.sd / sqrt(nBS);
Show mat.mean:
mat.mean;
[,1] [,2] [,3] [,4] [,5]
[1,] -0.011862801 -0.017872385 -2.059780e-02 -0.056602452 -0.077408704
[2,] 0.083863805 -0.057467756 -7.920189e-03 0.001923072 -0.010616517
[3,] -0.021193913 -0.021594100 -3.069827e-03 0.082500345 -0.015010818
[4,] -0.001063529 -0.028606045 6.366336e-02 0.021871973 0.014491280
[5,] -0.042912905 -0.020031203 7.075698e-03 0.032309070 0.051875125
[6,] -0.028336190 -0.055650895 -1.119998e-02 -0.030252861 -0.008670326
[7,] 0.006555878 -0.008686383 -1.928690e-02 -0.027290181 -0.002037219
[8,] 0.001513634 -0.057669094 -6.025466e-03 0.028409560 0.052159330
[9,] 0.044741065 -0.026265301 3.915427e-02 -0.011599341 0.006817949
[10,] 0.035356686 -0.039949595 -5.468612e-05 0.007272050 0.013150241
[,6] [,7] [,8] [,9] [,10]
[1,] 0.054420568 0.0050127337 -0.046358349 -0.029833662 -0.0525282034
[2,] -0.033703118 -0.0623761140 -0.029511715 -0.048816905 -0.0189984349
[3,] -0.013218223 -0.0278959480 -0.036351073 0.028833428 -0.0001538902
[4,] 0.029236408 -0.0046022995 0.019077031 0.069887669 -0.0283910941
[5,] -0.035474785 0.0372263523 0.021329823 0.006252149 0.0395028012
[6,] 0.008978299 0.0266740599 -0.006252266 -0.005793750 0.0072594645
[7,] 0.092958577 0.0047135528 0.019320387 0.011766436 -0.0021045223
[8,] 0.014867452 -0.0001325218 0.014760887 -0.027671024 0.0610503856
[9,] -0.031151561 0.0373095832 0.016197685 -0.050206244 -0.0561044648
[10,] 0.059817479 -0.0669659941 0.020218135 -0.039548025 0.0115156843
Lower and upper 95% confidence intervals are given in mat.lowerCI and mat.upperCI.
Related
I have a matrix of 2134 by 2134 of correlation values and I would like to count the total number of values that are above 0.8 and below -0.8. I have tried
length(TFcoTF[TFcoTF>.8])
but this does not seem to be correct as I am getting about 50 percent of values above .8 which does not correspond to the histogram I have for the data. Also when I do
length(TFcoTF[TFcoTF<-.8])
I got 0 as the output. Any help is appreciated.
The data table package has a function called between. This returns TRUE/FALSE value for each value in your matrix whether the value is between two values.
In my example below, I randomly created a 10x10 matrix with random values [-1,+1]. Using the length function and subsetting where the values are in your range of [-0.8,+0.8].
library(data.table)
data <- matrix(runif(100,-1,1), nrow = 10, ncol=10)
data
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 0.05585901 -0.7497720 -0.8371569 -0.401079424 -0.4130752 -0.788961736 0.2909987 0.48965177 0.4076504 -0.0682856
[2,] -0.42442920 0.7476111 0.8238973 -0.912507391 -0.4450897 -0.001308901 0.5151425 -0.16838841 -0.1648151 0.8370660
[3,] -0.73295874 0.5271986 0.5822628 -0.008554908 -0.2785803 -0.499058508 -0.5661172 0.35957967 0.5807055 0.2350893
[4,] 0.18949338 0.3827603 -0.6112584 0.209209240 -0.5883962 -0.087900052 0.1272227 0.58165922 -0.9950324 -0.9118599
[5,] 0.40862973 0.9496163 0.4996253 0.079538601 0.9839763 -0.119883751 0.3667418 -0.02751815 -0.6724141 0.3217434
[6,] 0.77338548 -0.7698167 -0.5632436 0.223301216 -0.9936610 0.650110638 -0.9400395 -0.47808065 -0.1579283 -0.6896787
[7,] 0.93210326 0.5360980 0.7677325 0.815231731 -0.4320206 0.647954028 0.5180600 -0.09574138 -0.3848389 0.9726445
[8,] -0.66411834 0.1125759 -0.4021577 -0.711363103 0.7161801 -0.071971464 0.7953436 0.40326575 0.6895480 0.7496597
[9,] 0.14118154 0.4775983 0.8966069 0.852880293 0.4715885 -0.542526148 0.5200246 -0.62649677 -0.3677738 0.1961003
[10,] -0.59353193 -0.2358892 0.5769562 -0.287113142 -0.7100862 -0.107092848 -0.8101459 -0.46754146 -0.4082147 -0.4475972
length(data[between(data,-0.8,0.8)])
[1] 84
It's difficult to answer without having your dataset, please provide a minimal reproducible example later.
For the first line of code, this looks correct.
For the second, the error comes from a syntax error. In R you can assign value with = and <-. So x<-1 assign the value whereas x < -1 return a boolean.
You can then combine logical values and run the code below :
set.seed(42)
m <- matrix(runif(25, min = -1, max = 1), nrow = 5, ncol = 5)
m
length(m[ m > .8]) + length(m[ m < -.8]) # long version from what you did.
length(m[ m < -.8 | m > .8]) # | mean or. TRUE | FALSE will return TRUE.
sum(m > .8 | m < -.8)
# The sum of logical is the length, since sum(c(TRUE, FALSE)) is sum(c(0, 1))
sum(abs(m) > .8) # is the shortest version
I am trying to generate a matrix of joint probabilities. It's a symmetric matrix. The main diagonal elements are interpreted as probabilities
p
(
A
i
)
that a binary variable
A
i
equals 1. The off-diagonal elements are the probabilities
p
(
A
i
A
j
)
that both
A
i
and
A
j
are 1. This matrix should respond to the following conditions :
0
≤
p
A
i
≤
1
max
(
0
,
p
A
i
+
p
A
j
−
1
)
≤
p
A
i
A
j
≤
min
(
p
A
i
,
p
A
j
)
,
i
≠
j
p
A
i
+
p
A
j
+
p
A
k
−
p
A
i
A
j
−
p
A
i
A
k
−
p
A
j
A
k
≤
1
,
i
≠
j
,
i
≠
k
,
j
≠
k
These conditions are checked with check.commonprob.
I built a function to generate this matrix respecting these conditions:
# First I need another function to make the matrix symmetric
makeSymm <- function(m) {
m[upper.tri(m)] <- t(m)[upper.tri(m)]
return(m) }
b=matrix(0,10,10)
#The functionthat generates joint probabilities
joint=function(b,x,y,u,z,k,m){
repeat{
diag(b)=runif(k, min=x, max=y)
b[lower.tri(b,diag=FALSE)]<-runif(m,min=u, max=z)
b<-makeSymm(b)
check.commonprob(b)->c
if(c==TRUE)
break}
return(b)}
Since b is 10*10 matrix => there is 10 diagonal elements and 45 elements in the lower triangular matrix. I got this result:
b=joint(b,0.4,0.6,0.2,0.4,10,45)
> b
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 0.4479626 0.2128775 0.3103472 0.2342798 0.2719423 0.3114339 0.3978305
[2,] 0.2128775 0.4413829 0.2603543 0.2935595 0.2556380 0.2486850 0.2694443
[3,] 0.3103472 0.2603543 0.5170409 0.3003153 0.2651415 0.3410199 0.2321201
[4,] 0.2342798 0.2935595 0.3003153 0.5930984 0.2719581 0.3982266 0.3157343
[5,] 0.2719423 0.2556380 0.2651415 0.2719581 0.4031691 0.2157856 0.3016181
[6,] 0.3114339 0.2486850 0.3410199 0.3982266 0.2157856 0.4042654 0.2595399
[7,] 0.3978305 0.2694443 0.2321201 0.3157343 0.3016181 0.2595399 0.5195244
[8,] 0.3154185 0.3174374 0.2920965 0.3259053 0.2847335 0.3560568 0.2070868
[9,] 0.2892746 0.2510410 0.3232922 0.2970148 0.3070217 0.3445408 0.3180946
[10,] 0.2948818 0.2264481 0.3210267 0.2866854 0.3783635 0.3427585 0.2306935
[,8] [,9] [,10]
[1,] 0.3154185 0.2892746 0.2948818
[2,] 0.3174374 0.2510410 0.2264481
[3,] 0.2920965 0.3232922 0.3210267
[4,] 0.3259053 0.2970148 0.2866854
[5,] 0.2847335 0.3070217 0.3783635
[6,] 0.3560568 0.3445408 0.3427585
[7,] 0.2070868 0.3180946 0.2306935
[8,] 0.5958957 0.2710500 0.2318991
[9,] 0.2710500 0.5003779 0.2512744
[10,] 0.2318991 0.2512744 0.5004233
Up to now , everything seems good, but the problem is that when I wanted to generate a 100*100 matrix, I noticed that beyond a dimension of 20*20 the running time becomes so long (hours) and I can't get a result at the end because i have to stop it.
Do you have any suggestions to improve this function so I can try it on 100*100 matrix ? Also can I stipulate the mean and the standard deviation of the joint probabilities matrix in advance? Thanks !
If you are simply trying to generate examples of such matrices and don't have any other constraints, you can do so by generating observations from a population that would be implicitly described by such a matrix and then tabulate the observed probabilities. You can start by writing a function which does the tabulation:
p.matrix <- function(A){
n <- nrow(A)
k <- ncol(A)
outer(1:n,1:n,Vectorize(function(i,j) sum(A[i,]*A[j,])))/k
}
The above function can take any binary matrix and turn it into a matrix of probabilities that will statisfy check.commonprob. To get a matrix of a given size you can do something like:
prob.matrix <- function(n,p = 0.5){
k <- max(1000,10*n^2)
pop <- replicate(k,ifelse(runif(n) < p,1,0))
p.matrix(pop)
}
For example:
> M <- prob.matrix(4,c(0.1,0.9,0.3,0.4))
> M
[,1] [,2] [,3] [,4]
[1,] 0.098 0.090 0.019 0.042
[2,] 0.090 0.903 0.278 0.366
[3,] 0.019 0.278 0.306 0.121
[4,] 0.042 0.366 0.121 0.410
> bindata::check.commonprob(M)
[1] TRUE
For n = 100 this takes about 30 seconds on my machine.
In this function the resulting variables are basically uncorrelated. To get correlated variables, replace the simple ifelse() call by a custom function which e.g. doesn't allow for runs of 3 or more consecutive 1's. If you want finer control on the correlations, you would need to first be clear on just what you would want them to be.
I have spent a fair amount of time searching for an answer to my novice question and am still confused. I am trying to plot initial magnetization of an FID versus time. My initial magnetizations are in a matrix and my time values corresponding to each column of the matrix is a list. How do I run the nls for a exponential decay over each column of data with the corresponding value in the list of times? I am trying to have the nls function input the first time value from the list and run use the initial magnetization values columnwise and return the rates in a matrix of the same dimensions as m0_matrix.
> m0_matrix
[,1] [,2] [,3] [,4]
[1,] 19439311560 15064186946 11602185622 9009147617
[2,] 9437620734 7135488585 5348160563 4156154903
[3,] 11931439242 9584153017 7765094983 6470870180
[4,] 9367920785 7612552829 5927424214 4331819248
[5,] 12077347835 8892705185 6866664357 5530601653
[6,] 20191716524 15729555553 11920147205 8964406945
[7,] 20177137879 15744074858 12364404080 9971845743
[8,] 15990100401 12464163359 9724743390 8294038306
[9,] 19409862926 16085027074 13110425604 10330007806
[10,] 15367044986 11994945813 9565243969 7535061239
r2_from_decay_matrix = matrix(data = NA, nrow = nrow(m0_matrix), ncol =
ncol(m0_matrix))
t <- c(0.1, 0.2, 0.3, 0.4)
for (i in seq(1,nrow(m0_matrix))) {
m0 <- m0_matrix[,i]
t <- t[i]
r <- 1
mCPMG_function <- function(m0, t)
results <- paste(a = m0, b = t)
mCPMG_formula <- mCPMG ~ m0*exp(-r*t)
fit_start <- c(m0= 19439311560, r=1)
fit_data <- list(m0=m0, t=t)
r2 <- nls(mCPMG_formula, fit_data, fit_start)
r2_from_decay_matrix <- r2$m$getPars()["r"][i]
}
Thank you for helping!
I am trying to get the Hessian matrix from my own data, and I have two results -
using the code Hessian from library(numDeriv)
using code numericHessian from library(maxLik)
The result from the Hessian is very very small relative to the result from the numericHessian.
In this case, which results should I trust?
Specifically, the data I used ranged from 350000 to 1100000 and they were 9X2 matrix with a total of 18 data values.
I used with a sort of standard deviation formula and the result from "numericHessian" was ranging from 230 to 466 with 2X2 matrix, whereas the result from "Hessian" ranged from -3.42e-18 to 1.34e-17 which was much less than the previous one.
Which one do you think is correct calculation for the sort of standard deviation?
The code is as follows:
data=read.table("C:/file.txt", header=T);
data <- as.matrix(data);
library(plyr)
library(MASS)
w1 = tail(data/(rowSums(data)),1)
w2 = t(w1)
f <- function(x){
w1 = tail(x/(rowSums(x)),1)
w2 = t(w1)
r = ((w1%*%cov(cbind(x))%*%w2)^(1/2))
return(r)
}
library(maxLik);
numericHessian(f, t0=rbind(data[1,1], data[1,2]))
library(numDeriv);
hessian(f, rbind(data[1,1], data[1,2]), method="Richardson")
The file.txt is the following:
1 2
137 201
122 342
142 111
171 126
134 123
823 876
634 135
541 214
423 142
The result from the "numericHessian" is:
[,1] [,2]
[1,] 0.007105427 0.007105427
[2,] 0.007105427 0.000000000
Then, the result from the "Hessian" is:
[,1] [,2]
[1,] -3.217880e-15 -1.957243e-16
[2,] -1.957243e-16 1.334057e-16
Thank you very much in advance.
You have not given a reproducible example, but I'll try anyway.
library(bbmle)
x <- 0:10
y <- c(26, 17, 13, 12, 20, 5, 9, 8, 5, 4, 8)
d <- data.frame(x,y)
LL <- function(ymax=15, xhalf=6)
-sum(stats::dpois(y, lambda=ymax/(1+x/xhalf), log=TRUE))
fit <- mle2(LL)
cc <- coef(fit)
Here are the finite-difference estimates of the Hessians (matrices of second derivatives) of the negative log-likelihood function at the MLE: inverting these matrices gives an estimate of the variance-covariance matrices of the parameters.
library(numDeriv)
hessian(LL,cc)
## [,1] [,2]
## [1,] 1.296717e-01 -1.185789e-15
## [2,] -1.185789e-15 4.922087e+00
library(maxLik)
numericHessian(LL, t0=cc)
## [,1] [,2]
## [1,] 0.1278977 0.000000
## [2,] 0.0000000 4.916956
So for this relatively trivial example, numDeriv::hessian and maxLik::numericHessian give very similar results. So there must be something you haven't shown us, or something special about the numerics of your problem. In order to proceed further, we need a reproducible example please ...
dat <- matrix(c(137,201,122,342,142,111,
171,126,134,123,823,876,
634,135,541,214,423,142),
byrow=TRUE,ncol=2)
f <- function(x){
w1 <- tail(x/(rowSums(x)),1)
sqrt(w1%*%cov(cbind(x))%*%t(w1))
}
p <- t(dat[1,1:2,drop=FALSE])
f(p) ## 45.25483
numDeriv::hessian(f,p)
## [,1] [,2]
## [1,] -3.217880e-15 -1.957243e-16
## [2,] -1.957243e-16 1.334057e-16
maxLik::numericHessian(f,t0=p)
## [,1] [,2]
## [1,] 0.007105427 0.007105427
## [2,] 0.007105427 0.000000000
OK, these clearly disagree. I'm not sure why, but in this particular case we can analyze what you're doing and see which one is right:
since your input matrix is a single column, x/rowSums(x) is a vector of ones, so the last element (w1 <- tail(...,1)) is just 1.
so your expression reduces to sqrt(cov(cbind(x))). Again, since x is a one-column matrix, cov() is just the variance, and sqrt(cov(.)) is just the standard deviation, or the norm of the vector.
the variance is a quadratic function of any element's deviation from the mean, and so the standard deviation is more or less linear in the deviation from the mean (except at zero), so we would expect the second derivatives to be zero. So it looks like numDeriv::hessian is giving the right answer
We can also confirm this by increasing eps for numericHessian:
maxLik::numericHessian(f,t0=p,eps=1e-3)
## [,1] [,2]
## [1,] 0 0.000000e+00
## [2,] 0 -7.105427e-09
The bottom line is that numDeriv uses a more accurate (but slower) method, but you can get reasonable answers from numericHessian if you're careful.
I have been searching for long time to find out how we can calculate a spatial average by R.However,i failed to find something.I have a binary file has 720 columns and 360 rows,float of air temperature. and I want first to calculate the spatial average over the whole area and to calculate the spatial average over an area of (500-600 colu and 200-250rows) and then extract the area.
t<- file("C:annual_Prc2000_without999_1.img","rb")
e=readBin(t, double(), size=4,n=720*360, signed=TRUE)
from answer 1 :
mean(e)
I got
NaN
from answer 2:i got
spat_mean = apply(e, c(2,1), mean)
Error in apply(e, c(2, 1), mean) : dim(X) must have a positive length
let me explain it:
assume
m <- matrix(rnorm(10000),4,4)
> m
[,1] [,2] [,3] [,4]
[1,] 0.7930132 2.1973895 0.063493345 -0.5484056
[2,] 0.5222513 0.4331308 -0.002393336 0.1725495
[3,] 1.7462222 -1.5701996 -2.276781240 0.5628531
[4,] -1.2713361 -0.9349057 0.757412225 1.5118180
> mean(m)
[1] 0.134757
now I want this:
[,1] [,2] [,3] [,4]
[1,] 0.134757 0.134757 0.134757 0.134757
[2,] 0.134757 0.134757 0.134757 0.134757
[3,] 0.134757 0.134757 0.134757 0.134757
[4,] 0.134757 0.134757 0.134757 0.134757
I did this manually but i want to do it thru R
Assuming you have a regular spatial grid, you probably want something like this:
set.seed(42)
m <- matrix(rnorm(10000),100,100)
#calculate mean of all values
mean(m)
#calculate mean of region 1:10 * 1:10
mean(m[1:10,1:10])
From your comment to the other answer I gather that you want to calculate the mean value per x,y pair. For this you can use apply. Assuming you have a multi-dimensional array with x,y,time dimensions, getting the mean per x,y pair is done like this:
spat_mean = apply(multi_dim_array, c(1,2), mean)
spat_mean is now a 2d array with x,y dimensions, giving the spatial mean.