Interpolate within points in a vector - r

Vector V1 contains 56 observations for X, and vector BS contains a bootstrapped sample of V1 of length 100000. I would like to interpolate linearly within points in BS to fill in any missing values. For example, V1 contains no 0.27 values, and hence neither does BS. But BS would contain a few 0.28 and 0.26. I would like the interpolation to create a few 0.27 values and add those to BS. And so on for any missing values within the two extremes in the vector.
V1 <- c(0.18, 0.2, 0.24, 0.35, -0.22, -0.17, 0.28, -0.28, -0.14, 0.03, 0.87, -0.2, 0.06, -0.1, -0.72, 0.18, 0.01, 0.31, -0.36, 0.61, -0.16, -0.07, -0.13, 0.01, -0.09, 0.26, -0.14, 0.08, -0.62, -0.2, 0.3, -0.21, -0.11, 0.05, 0.06, -0.28, -0.27, 0.17, 0.42, -0.05, -0.15, 0.05, -0.07, -0.22, -0.34, 0.16, 0.34, 0.1, -0.12, 0.24, 0.45, 0.37, 0.61, 0.9, -0.25, 0.02)
BS <- sample(V1, 100000, replace=TRUE)
The approxfun functions do not help as are for interpolating within data sets. Have found a few questions/answers covering interpolating within different data sets, but not within one data set. Thank you for your help.
EDIT: please note I do not want to fit a normal distribution (or any other) to create those points.

You can use approx() (or approxfun()) to do this by treating BS as the y-coordinate and using sequential x-coordinates:
set.seed(1L); BS <- sample(V1,1e5L,T);
res <- approx(seq_along(BS),BS,n=length(BS)*2L-1L)$y;
The specification of n here is important. It ensures that exactly one interpolated value will be produced halfway between each adjacent pair of input values.
Here's a plot of an excerpt of the result, centered around the first occurrence of an adjacent pair of 0.26 and 0.28:
i <- which(BS[-length(BS)]==0.26 & BS[-1L]==0.28)[1L];
j <- i*2L-1L;
xlim <- c(j-6L,j+8L);
ylim <- c(-1,1);
xticks <- seq(xlim[1L],xlim[2L]);
yticks <- seq(ylim[1L],ylim[2L],0.05);
plot(NA,xlim=xlim,ylim=ylim,xlab='res index',ylab='y',axes=F,xaxs='i',yaxs='i');
abline(v=xticks,col='lightgrey');
abline(h=yticks,col='lightgrey');
axis(1L,xticks,cex.axis=0.7);
axis(2L,yticks,sprintf('%.02f',round(yticks,2L)),las=1L,cex.axis=0.7);
x <- seq(xlim[1L],xlim[2L],2L); y <- BS[seq(i-3L,len=8L)];
points(x,y,pch=16L,col='red',xpd=NA);
x <- seq(xlim[1L],xlim[2L]); y <- res[x];
points(x,y,pch=4L,cex=1.2,col='blue',xpd=NA);
text(x+0.24,y+0.03,y,cex=0.7,xpd=NA);
legend(xlim[1L]+1.5,0.87,c('input value','interpolated'),col=c('red','blue'),pch=c(16L,4L));

Related

Optimise function on two criteria in R

I'm trying to optimise an exponential model by minimising the sum of squares, however I can't work out how to optimise using two separate criteria.
I need to find values for "a" and "b" that minimise the output of the function. I have entered estimates in the code below, but need the output of this code to tell me the sum of squares (already printing), but also values for "a" and "b" when the model was correctly optimised.
c <- c(0.08, 0.17, 0.25, 0.33, 0.41, 0.49, 0.57, 0.65, 0.73, 0.81, 0.88, 0.96, 1.04, 1.11, 1.19, 1.26)
my_fun <- function(a, b, c){
predVar1 <- a * (1-exp(-c/b))
sum((predVar1 - c)^2)
}
a <- 9
b <- 1.4
my_fun(a, b, c)
Thanks
Did some rearranging based on how I'm familiar with using optim. Also, I changed c to d because I don't like messing with accidentally overwriting the c() function.
d <- c(0.08, 0.17, 0.25, 0.33, 0.41, 0.49, 0.57, 0.65, 0.73, 0.81, 0.88, 0.96, 1.04, 1.11, 1.19, 1.26)
my_fun <- function(parameters, d){
a <- parameters[1]
b <- parameters[2]
predVar1 <- a * (1-exp(-d/b))
return(sum((predVar1 - d)^2))
}
a <- 9
b <- 1.4
results <- optim(c(a, b), my_fun, d = d)
results$par
#[1] 700.8850 700.4793
results$value
#[1] 4.37461e-07

Bayes Factor values in the R package BayesFactor

I've followed the instructions on how to run a Bayesian 't-test' using default priors in the BayesFactor package in R.
Some of the returned values are astronomical.
Here is an example comparison with a huge Bayes factor:
#install.packages('BayesFactor')
library(BayesFactor)
condition1 <- c(0.94, 0.9, 0.96, 0.74, 1, 0.98, 0.86, 0.92, 0.918367346938776,
0.96, 0.4, 0.816326530612245, 0.8, 0.836734693877551, 0.56, 0.66,
0.605263157894737, 0.836734693877551, 0.84, 0.9, 0.92, 0.714285714285714,
0.82, 0.5, 0.565217391304348, 0.8, 0.62)
condition2 <- c(0.34, 0.16, 0.23, 0.19, 0.71, 0.36, 0.02, 0.83, 0.11, 0.06,
0.27, 0.347368421052632, 0.21, 0.13953488372093, 0.11340206185567,
0.14, 0.142857142857143, 0.257731958762887, 0.15, 0.29, 0.67,
0.0515463917525773, 0.272727272727273, 0.0895522388059701, 0.0204081632653061,
0.13, 0.0612244897959184)
bf = ttestBF(x = condition1, condition2, paired = TRUE)
bf
This returns:
Bayes factor analysis
--------------
[1] Alt., r=0.707 : 144035108289 ±0%
Against denominator:
Null, mu = 0
---
Bayes factor type: BFoneSample, JZS
For the most part the comparisons range from below 1 up to a few hundred. But I'm concerned that this value (144035108289!) is indicative of something erroneous on my part.
FYI: the p-value in the null-hypothesis test on the same data as above = 4.649279e-14.
Any assurances or insights into this returned BF would be much appreciated.
I calculated the BF using manual input of t-value and sample size like this using the same package:
exp(ttest.tstat(t=14.63, n1=27, rscale = 0.707)[['bf']])
It gives the same BF. It seems this is largely due to a relatively big sample size (27). The returned BF appears to be on the up-and-up.

Trying to fit f distribution to a vector

Would anyone know why the following code fails to execute fitdist with error "the function mle failed to estimate the parameters, with the error code 100".
I have encountered this error in the past when working with the normal distribution; the solution in that case was increasing the variance of the vector (by multiplying it by say 100), but that does not help on this case. Please note all elements in the vector are positive. Thank you.
library(fitdistrplus)
VH <- c(0.36, 0.3, 0.36, 0.47, 0, 0.05, 0.4, 0, 0, 0.15, 0.89, 0.03, 0.45, 0.21, 0, 0.18, 0.04, 0.53, 0, 0.68, 0.06, 0.09, 0.58, 0.03, 0.23, 0.27, 0, 0.12, 0.12, 0, 0.32, 0.07, 0.04, 0.07, 0.39, 0, 0.25, 0.28, 0.42, 0.55, 0.04, 0.07, 0.18, 0.17, 0.06, 0.39, 0.65, 0.15, 0.1, 0.32, 0.52, 0.55, 0.71, 0.93, 0, 0.36)
f <- fitdist(na.exclude(VH),"f", start =list(df1=1, df2=2))
The error you get here is actually somewhat informative:
simpleError in optim(par = vstart, fn = fnobj, fix.arg = fix.arg, obs = data, ddistnam = ddistname, hessian = TRUE, method = meth, lower = lower, upper = upper, ...): function cannot be evaluated at initial parameters
Error in fitdist(na.exclude(VH), "f", start = list(df1 = 1, df2 = 2)) :
the function mle failed to estimate the parameters,
with the error code 100
That means something went wrong right away, not in the middle of the optimization process.
Taking a guess, I looked and saw that there was a zero value in your data (so your statement that all the elements are positive is not technically correct -- they're all non-negative ...). The F distribution has an infinite value at 0: df(0,1,2) is Inf.
If I exclude the zero value, I get an answer ...
f <- fitdist(na.exclude(VH[VH>0]),"f", start =list(df1=1, df2=2))
... the estimated value for the second shape parameter is very large (approx. 6e6, with a big uncertainty), but seems to fit OK ...
par(las=1); hist(VH,freq=FALSE,col="gray")
curve(df(x,1.37,6.45e6),add=TRUE)

Ratio of polynomials approximation

I am trying to fit a polynomial to my dataset, which looks like that (full dataset is at the end of the post):
The theory predicts that the formulation of the curve is:
which looks like this (for x between 0 and 1):
When I try to make a linear model in R by doing:
mod <- lm(y ~ poly(x, 2, raw=TRUE)/poly(x, 2))
I get the following curve:
Which is much different from what I would expect. Have you got any idea how to fit a new curve from this data so that it would be similar to the one, which theory predicts? Also, it should have only one minimum.
Full dataset:
Vector of x values:
x <- c(0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.10, 0.11, 0.12,
0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.20, 0.21, 0.22, 0.23, 0.24, 0.25,
0.26, 0.27, 0.28, 0.29, 0.30, 0.31, 0.32, 0.33, 0.34, 0.35, 0.36, 0.37, 0.38,
0.39, 0.40, 0.41, 0.42, 0.43, 0.44, 0.45, 0.46, 0.47, 0.48, 0.49, 0.50, 0.51,
0.52, 0.53, 0.54, 0.55, 0.56, 0.57, 0.58, 0.59, 0.60, 0.61, 0.62, 0.63, 0.64,
0.65, 0.66, 0.67, 0.68, 0.69, 0.70, 0.71, 0.72, 0.73, 0.74, 0.75, 0.76, 0.77,
0.78, 0.79, 0.80, 0.81, 0.82, 0.83, 0.84, 0.85, 0.86, 0.87, 0.88, 0.89, 0.90,
0.91, 0.92, 0.93, 0.94, 0.95)
Vector of y values:
y <- c(4.104, 4.444, 4.432, 4.334, 4.285, 4.058, 3.901, 4.382,
4.258, 4.158, 3.688, 3.826, 3.724, 3.867, 3.811, 3.550, 3.736, 3.591,
3.566, 3.566, 3.518, 3.581, 3.505, 3.454, 3.529, 3.444, 3.501, 3.493,
3.362, 3.504, 3.365, 3.348, 3.371, 3.389, 3.506, 3.310, 3.578, 3.497,
3.302, 3.530, 3.593, 3.630, 3.420, 3.467, 3.656, 3.644, 3.715, 3.698,
3.807, 3.836, 3.826, 4.017, 3.942, 4.208, 3.959, 3.856, 4.157, 4.312,
4.349, 4.286, 4.483, 4.599, 4.395, 4.811, 4.887, 4.885, 5.286, 5.422,
5.527, 5.467, 5.749, 5.980, 6.242, 6.314, 6.587, 6.790, 7.183, 7.450,
7.487, 8.566, 7.946, 9.078, 9.308, 10.267, 10.738, 11.922, 12.178, 13.243,
15.627, 16.308, 19.246, 22.022, 25.223, 29.752)
Use nls to fit a nonlinear model. Note that the model formula is not uniquely defined as displayed in the question since if we multiply all the coefficients by any number the result will still give the same predictions. To avoid this we need to fix one coefficient. A first try used the coefficients shown in the question as starting values (except fixing one) but that failed so dropping C was tried and the resulting coefficients fed into a second fit with C = 1.
st <- list(a = 43, b = -14, c = 25, B = 18)
fm <- nls(y ~ (a + b * x + c * x^2) / (9 + B * x), start = st)
fm2 <- nls(y ~ (a + b * x + c * x^2) / (9 + B * x + C * x^2), start = c(coef(fm), C = 1))
plot(y ~ x)
lines(fitted(fm2) ~ x, col = "red")
(continued after chart)
Note: Here is an example of using nls2 to get starting values with random search. We assume that the coefficients each lie between -50 and 50.
library(nls2)
set.seed(123) # for reproducibility
v <- c(a = 50, b = 50, c = 50, B = 50, C = 50)
st0 <- as.data.frame(rbind(-v, v))
fm0 <- nls2(y ~ (a + b * x + c * x^2) / (9 + B * x + C * x^2), start = st0,
alg = "random", control = list(maxiter = 1000))
fm3 <- nls(y ~ (a + b * x + c * x^2) / (9 + B * x + C * x^2), st = coef(fm0))
Since you already have a theoretic prediction, you don't seem in need of a new model, and it's really only a plotting task:
png(); plot(y~x)
lines(x,mod,col="blue")
dev.off()
You cannot expect lm to produce a good approximation to a non-linear problem. The denominator involving x in that theoretic expression makes this inherently nonlinear.

How to implement the bootstrap in R

So I posted a thread about this problem, but it got on hold. So I rephrased so it can be it a programming question. This is my code below. I am trying to find the stimulated confidence level of a sample using the bootstrap.
# Step One: Generating the data from lognormal distribution
MC <-1000; # Number of samples to simulate
xbar = c(1:MC);
mu = 1;
sigma= 1.5;
the_mean <- exp(mu+sigma^2/2);
n= 10;
for(i in 1:MC)
{
mySample <- rlnorm(n=n meanlog=mu, sdlog=sigma);
xbar [i] <- the_mean(mySample);
}
# Step Two: Compute 95% Bootstrap CI with B=1000
B = 1000
xbar_star = c(1:B)
for(b in 1:B)
{
x_star = sample(n,n, replace=TRUE)
xbar_star[b] = mean(x_star)
}
quantile(xbar, p=c(0.025, 0.975))
If you implement this code you can see that the output is 975.025 when it should actually be 0. 90.
I don't understand why my output is wrong.
We arent trying to find the Confidence Interval, but the stimulated Confidence Level. How does the actual coverage percentage (obtained through simulation) compare with the nominal confidence level (which is 95%)? This is my code when my samples were given in a practice problem...
library(boot)
x = c(0.22, 0.23, 0.26, 0.27, 0.28, 0.28, 0.29,
0.33, 0.34, 0.35, 0.38, 0.39, 0.39, 0.42, 0.42,
0.43, 0.45, 0.46, 0.48, 0.5, 0.5, 0.51, 0.52,
0.54, 0.56, 0.56, 0.57, 0.57, 0.6, 0.62, 0.63,
0.67, 0.69, 0.72, 0.74, 0.76, 0.79, 0.81, 0.82,
0.84, 0.89, 1.11, 1.13, 1.14, 1.14, 1.2, 1.33)
B = 10000
xbar = mean(x)
n = length(x)
xbar_star = c(1:B)
for(b in 1:B)
{
x_star = sample(x=x, size=n, replace=TRUE)
xbar_star[b] = mean(x_star)
}
# empirical percentile method
quantile(xbar_star, p=c(0.025, 0.975))
> quantile(xbar_star, p=c(0.025, 0.975))
2.5% 97.5%
0.5221277 0.6797926

Resources