SpBayes for an offset model - r

I am running spBayes to fit an 'offset' model y ~ 1.
I have a dataframe like this
ID lon lat y
1 A 90.0 5.9 0.957096100
2 A 90.5 6.0 0.991374969
3 A 91.1 6.0 0.991374969
4 A 92.7 6.1 0.913501740
5 A 94.0 6.1 0.896575928
6 A 97.8 5.2 0.631320953
7 A 98.9 4.4 -0.282432556
8 A 101.2 2.8 1.829053879
9 A 102.3 2.0 0.993621826
10 A 105.8 0.5 0.038677216
where the variable ID is a factor with two levels A and B. I would like to find a offset for the two IDs. However, when I run
fit.by.ALL <- spLM(formula=y ~ ID, data= df, coords=coords,
priors=priors, tuning=tuning, starting=starting,
cov.model = "exponential", n.samples = n.samples, verbose = TRUE,
n.report = 50)
which gives the result
Iterations = 1:251
Thinning interval = 1
Number of chains = 1
Sample size per chain = 251
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean SD Naive SE Time-series SE
(Intercept) 1.0736 2.8674 0.18099 0.18099
IDB -0.9188 0.1922 0.01213 0.01213
2. Quantiles for each variable:
2.5% 25% 50% 75% 97.5%
(Intercept) -4.952 -0.773 1.1059 3.0165 6.4824
IDB -1.303 -1.048 -0.9284 -0.7679 -0.5795
the result doesn't like is very stable as it keep changing every time I run it.
Moreover, to find the final offset for the ID B I need to add the (Intercept) Mean to the IDB Mean, how does it work for the SD?
Would it be better to run the spLM formula separately for the two IDs (with y~1 instead of y~ID)?
Thanks

I am unclear what you mean by "fit an offset model y ~ 1". When I read this, I think you want a model that only has an intercept, but reading further it suggests you want a model where you can estimate the mean for both groups which can be done using
y ~ 0+ID # manually remove the intercept,
To answer your questions:
the result doesn't like is very stable as it keep changing every time
I run it.
You are not using very many iterations. Try running with more iterations. With enough iterations the results should be stable.
Moreover, to find the final offset for the ID B I need to add the (Intercept) Mean to the IDB Mean, how does it work for the SD?
Again, I'm not sure what you mean by offset, but if you mean you want the difference in means between group A and group B, this this is exactly what you have in the line beginning with IDB. That is, -0.9188 is the estimated difference in means between group B and group A, i.e. group B's mean is estimated to be 0.9188 smaller than group B's mean, and the SD is the posterior standard deviation.
If you are interested in group B's mean, then you are correct that you must add the (Intercept) to the IDB, but you cannot simply add the SDs. You have two options here: 1) use an appropriate design matrix (X in the above code) that directly obtains your desired parameter estimates or 2) obtain the MCMC samples and calculate the sum of the (Intercept) and IDB parameters for each iteration and then take means and standard deviations of these sums.
Would it be better to run the spLM formula separately for the two IDs (with y~1 instead of y~ID)?
If you ran them separately, then you would be estimating the spatial parameters separately. If the spatial parameters are different in the two different groups, running them separately makes a lot of sense. If they are the same (or similar) then it probably makes more sense to fit the two groups together so you can "borrowing information" about the spatial parameters between these two groups.

Related

Calculating 95% confidence intervals for a weighted median over grouped data in dplyr

I have a dataset with several groups, where I want to calculate a median value for each group using dplyr. The data are weighted, and the weights need to be taken into account in calculating the median. I found the weighted.median function from spatstat which seems to work fine. Consider the following simplified example:
require(spatstat, dplyr)
tst <- data.frame(group = rep(c(1:5), each = 100))
tst$val = runif(500) * tst$group
tst$wt = runif(500) * tst$val
tst %>%
group_by(group) %>%
summarise(weighted.median(val, wt))
# A tibble: 5 × 2
group `weighted.median(val, wt)`
<int> <dbl>
1 1 0.752
2 2 1.36
3 3 1.99
4 4 2.86
5 5 3.45
However, I would also like to add 95% confidence intervals to these values, and this has me stumped. Things I've considered:
Spatstat also has a weighted.var function but there's no documentation, and it's not even clear to me whether this is variance around the median or mean.
This rcompanion post suggests various methods for calculating CIs around medians, but as far as I can tell none of them handle weights.
This blog post suggests a function for calculating CIs and a median for weighted data, and is the closest I can find to what I need. However, it doesn't work with my dplyr groupings. I suppose I could write a loop to do this one group at a time and build the output data frame, but that seems cumbersome. I'm also not totally sure I understand the function in the post and slightly suspicious of its results- for instance, testing this out I get wider estimates for alpha=0.1 than for alpha=0.05, which seems backwards to me. Edit to add: upon further investigation, I think this function works as intended if I use alpha=0.95 for 95% CIs, rather than alpha = 0.05 (at least, this returns values that feel intuitively about right). I can also make it work with dplyr by editing to return just a single moe value rather than a pair of high/low estimates. So this may be a good option- but I'm also considering others.
Is there an existing function in some library somewhere that can do what I want, or an otherwise straightforward way to implement this?
There are several approaches.
You could use the asymptotic formula for standard error of the sample median. The sample median is asymptotically normal with standard error 1/sqrt(4 n f(m)) where n is the number of observations, m is the true median, and f(x) is the probability density of the (weighted) random variable. You could estimate the probability density using the base R function density.default with the weights argument. If x is the vector of observed values and w the corresponding vector of weights, then
med <- weighted.median(x, w)
f <- density(x, weights=w)
fmed <- approx(f$x, f$y, xout=med)$y
samplesize <- length(x)
se <- 1/sqrt(4 * samplesize * fmed)
ci <- med + c(-1,1) * 1.96 * se
This relies on several asymptotic approximations so it may be inaccurate. Also the sample size depends on the interpretation of the weights. In some cases the sample size could be equal to sum(w).
If there is very little data in each group, you could use the even simpler normal reference approximation,
med <- weighted.median(x, w)
v <- weighted.var(x, w)
sdm <- sqrt(pi/2) * sqrt(v)
samplesize <- length(x)
se <- sdm/sqrt(samplesize)
ci <- med + c(-1,1) * 1.96 * se
Alternatively you could use bootstrapping - generate random resamples of the input data (by choosing random resamples of the indices 1, 2, ..., n), extract the corresponding weighted observations (x_i, w_i), compute the weighted median of each resampled dataset, and construct the 95% confidence interval.
(This approach implicitly assumes the sample size is equal to n)

How to simulate a matrix based on both row and column parameters

I'm looking into simulating columns of normally distributed data whilst sticking to certain rowbased parameters. Specifically, let's say I want to simulate 6 rows of data for 4 columns, where the final column is a sum of the previous 3 columns. Let's say I have the fourth column filled out, and I know what I want for means and standard deviations for the other three columns. Is there a way for me to simulate this?
For a visual representation, my question is essentially how can I fill out the blanks in the following table:
x
y
z
total
?
?
?
17.42
?
?
?
11.95
?
?
?
15.85
?
?
?
15.93
?
?
?
14.78
?
?
?
17.19
------
------
------
------
mean = 5
mean = 6
mean = 5
mean = 15.5
sd = 1.2
sd = 1.5
sd = 1.3
sd = 2
Simulating each column is of course simple enough with rnorm or something similar, but the row sums are then random, and it's important that I maintain control over the variance of the total column. The final column values don't actually need to be known if there's a way to simulate the 4 columns simultaneously, as long as the approximate mean and sd parameters are maintained, and that the 4th column is a sum of the first 3.
I've fiddled with different things such as mvrnorm or rnorm_multi, which allows me a degree of control over the correlation of the columns, but only indirect unreliable influence over the variance of the final column, which again, is a crucial factor.
Any ideas?
EDIT
A bit more of my process, with a brief example. If I simulate a dataset with three variables, eg. x, y and z, I can make sure these variables stick to certain means and vars. A brief example with rnorm:
dat <- tibble(x = rnorm(200, 5, 1.3),
y = rnorm(200, 6, 1.5),
z = rnorm(200, 5, 1.7))
dat2 <- dat %>%
mutate(total = rowSums(dat))
var(dat2$total)
If you run this piece of code, you can see that the variance of the total column changes pretty considerably for each set of simulations. What I want is to be able to simulate data where I can specify a variance I want for the total column. My idea for this was to create the total column first and then somehow simulate the other columns (through something like rnorm), but giving it a rowwide parameter too. I might've been completely off track here, if so I'll happily listen to other solutions.

R: Find cutoffpoint for continous variable to assign observations to two groups

I have the following data
Species <- c(rep('A', 47), rep('B', 23))
Value<- c(3.8711, 3.6961, 3.9984, 3.8641, 4.0863, 4.0531, 3.9164, 3.8420, 3.7023, 3.9764, 4.0504, 4.2305,
4.1365, 4.1230, 3.9840, 3.9297, 3.9945, 4.0057, 4.2313, 3.7135, 4.3070, 3.6123, 4.0383, 3.9151,
4.0561, 4.0430, 3.9178, 4.0980, 3.8557, 4.0766, 4.3301, 3.9102, 4.2516, 4.3453, 4.3008, 4.0020,
3.9336, 3.5693, 4.0475, 3.8697, 4.1418, 4.0914, 4.2086, 4.1344, 4.2734, 3.6387, 2.4088, 3.8016,
3.7439, 3.8328, 4.0293, 3.9398, 3.9104, 3.9008, 3.7805, 3.8668, 3.9254, 3.7980, 3.7766, 3.7275,
3.8680, 3.6597, 3.7348, 3.7357, 3.9617, 3.8238, 3.8211, 3.4176, 3.7910, 4.0617)
D<-data.frame(Species,Value)
I have the two species A and B and want to find out which is the best cutoffpoint for value to determine the species.
I found the following question:
R: Determine the threshold that maximally separates two groups based on a continuous variable?
and followed the accepted answer to find the best value with the dose.p function from the MASS package. I have several similar values and it worked for them, but not for the one given above (which is also the reason why i needed to include all 70 observations here).
D$Species_b<-ifelse(D$Species=="A",0,1)
my.glm<-glm(Species_b~Value, data = D, family = binomial)
dose.p(my.glm,p=0.5)
gives me 3.633957 as threshold:
Dose SE
p = 0.5: 3.633957 0.1755291
this results in 45 correct assignments. however, if I look at the data, it is obvious that this is not the best value. By trial and error I found that 3.8 gives me 50 correct assignments, which is obviously better.
Why does the function work for other values, but not for this one? Am I missing an obvious mistake? Or is there maybe a different/ better approach to solving my problem? I have several values I need to do this for, so I really do not want to just randomly test values until I find the best one.
Any help would be greatly appreciated.
I would typically use a receiver operating characteristic curve (ROC) for this type of analysis. This allows a visual and numerical assessment of how the sensitivity and specificity of your cutoff changes as you adjust your threshold. This allows you to select the optimum threshold based on when the overall accuracy is optimum. For example, using pROC:
library(pROC)
species_roc <- roc(D$Species, D$Value)
We can get a measure of how good a discriminator Value is for predicting Species by examining the area under the curve:
auc(species_roc)
#> Area under the curve: 0.778
plot(species_roc)
and we can find out the optimum cut-off threshold like this:
coords(species_roc, x = "best")
#> threshold specificity sensitivity
#> 1 3.96905 0.6170213 0.9130435
We see that this threshold correctly identifies 50 cases:
table(Actual = D$Species, Predicted = c("A", "B")[1 + (D$Value < 3.96905)])
#> Predicted
#> Actual A B
#> A 29 18
#> B 2 21

Inputting data to a matrix and predicting the response variable in r

So I'm having a problem combining a vector with a matrix
require(faraway)
x<-lm(lpsa~lcavol+lweight+age+svi+lcp+gleason+pgg45,prostate)
y<-model.matrix(x)
I have been given new data, that I need to predict lpsa with. So I was thinking that I could just add the data in using a vector and go about the regression analysis from there.
z<-c(1.44692,3.62301,65,.30010,0,-.79851,7,15)
rbind(y,z)
Not only does this give me 100 rows, but I'm not sure how to predict lpsa using this method. Can anybody give me advice?
Try :
require(faraway)
x<-lm(lpsa~lcavol+lweight+age+svi+lcp+gleason+pgg45,prostate)
z<-c(1.44692,3.62301,65,.30010,0,-.79851,7,15)
z<-z[-length(z)]
names(z)<-names(x$coefficients)[-1]
z<-as.list(z)
predict(x,z)
1
2.036906
Explanation : when you create x you then have to use predict to predict lpsa for new values of your variables. You create a list z with as many variables as there are in the model (except lpsa as you wish to "find" it). You then run the command and 2 is the predicted value of lpsa for the new varaibles. AS for the last value of z (ie 15) I don't know what it is.
unlist(z) # this shows that z is coherent as age is 65 (only value that makes sense for it)
lcavol lweight age svi lcp gleason pgg45
1.44692 3.62301 65.00000 0.30010 0.00000 -0.79851 7.00000
If you want to know the coefficients calcultated by the regression you can do :
coefficients(x)
(Intercept) lcavol lweight age svi lcp gleason pgg45
-0.130150643 0.577486444 0.576247172 -0.014687934 0.698386394 -0.100954503 0.055762175 0.004769619
If you want to be sure that predict is correct, do :
unname(sum(unlist(z)*coefficients(x)[-1])+coefficients(x)[1])
[1] 2.036906 # same estimated value for z

Statistical inefficiency (block-averages)

I have a series of data, these are obtained through a molecular dynamics simulation, and therefore are sequential in time and correlated to some extent. I can calculate the mean as the average of the data, I want to estimate the the error associated to mean calculated in this way.
According to this book I need to calculate the "statistical inefficiency", or roughly the correlation time for the data in the series. For this I have to divide the series in blocks of varying length and, for each block length (t_b), the variance of the block averages (v_b). Then, if the variance of the whole series is v_a (that is, v_b when t_b=1), I have to obtain the limit, as t_b tends to infinity, of (t_b*v_b/v_a), and that is the inefficiency s.
Then the error in the mean is sqrt(v_a*s/N), where N is the total number of points. So, this means that only one every s points is uncorrelated.
I assume this can be done with R, and maybe there's some package that does it already, but I'm new to R. Can anyone tell me how to do it? I have already found out how to read the data series and calculate the mean and variance.
A data sample, as requested:
# t(ps) dH/dl(kJ/mol)
0.0000 582.228
0.0100 564.735
0.0200 569.055
0.0300 549.917
0.0400 546.697
0.0500 548.909
0.0600 567.297
0.0700 638.917
0.0800 707.283
0.0900 703.356
0.1000 685.474
0.1100 678.07
0.1200 687.718
0.1300 656.729
0.1400 628.763
0.1500 660.771
0.1600 663.446
0.1700 637.967
0.1800 615.503
0.1900 605.887
0.2000 618.627
0.2100 587.309
0.2200 458.355
0.2300 459.002
0.2400 577.784
0.2500 545.657
0.2600 478.857
0.2700 533.303
0.2800 576.064
0.2900 558.402
0.3000 548.072
... and this goes on until 500 ps. Of course, the data I need to analyze is the second column.
Suppose x is holding the sequence of data (e.g., data from your second column).
v = var(x)
m = mean(x)
n = length(x)
si = c()
for (t in seq(2, 1000)) {
nblocks = floor(n/t)
xg = split(x[1:(nblocks*t)], factor(rep(1:nblocks, rep(t, nblocks))))
v2 = sum((sapply(xg, mean) - m)**2)/nblocks
#v rather than v1
si = c(si, t*v2/v)
}
plot(si)
Below image is what I got from some of my time series data. You have your lower limit of t_b when the curve of si becomes approximately flat (slope = 0). See http://dx.doi.org/10.1063/1.1638996 as well.
There are a couple different ways to calculate the statistical inefficiency, or integrated autocorrelation time. The easiest, in R, is with the CODA package. They have a function, effectiveSize, which gives you the effective sample size, which is the total number of samples divided by the statistical inefficiency. The asymptotic estimator for the standard deviation in the mean is sd(x)/sqrt(effectiveSize(x)).
require('coda')
n_eff = effectiveSize(x)
Well it's never too late to contribute to a question, isn't it?
As I'm doing some molecular simulation myself, I did step uppon this problem but did not see this thread already. I found out that the method actually proposed by Allen & Tildesley seems a bit out dated compared to modern error analysis methods. The rest of the book is good enought to worth the look though.
While Sunhwan Jo's answer is correct concerning block averages method,concerning error analysis you can find other methods like the jacknife and bootstrap methods (closely related to one another) here: http://www.helsinki.fi/~rummukai/lectures/montecarlo_oulu/lectures/mc_notes5.pdf
In short, with the bootstrap method, you can make series of random artificial samples from your data and calculate the value you want on your new sample. I wrote a short piece of Python code to work some data out (don't forget to import numpy or the functions I used):
def Bootstrap(data):
B = 100 # arbitraty number of artificial samplings
es = 0.
means = numpy.zeros(B)
sizeB = data.shape[0]/4 # (assuming you pass a numpy array)
# arbitrary bin-size proportional to the one of your
# sampling.
for n in range(B):
for i in range(sizeB):
# if data is multi-column array you may have to add the one you use
# specifically in randint, else it will give you a one dimension array.
# Check the doc.
means[n] = means[n] + data[numpy.random.randint(0,high=data.shape[0])] # Assuming your desired value is the mean of the values
# Any calculation is ok.
means[n] = means[n]/sizeB
es = numpy.std(means,ddof = 1)
return es
I know it can be upgraded but it's a first shot. With your data, I get the following:
Mean = 594.84368
Std = 66.48475
Statistical error = 9.99105
I hope this helps anyone stumbling across this problem in statistical analysis of data. If I'm wrong or anything else (first post and I'm no mathematician), any correction is welcomed.

Resources