Ratio of polynomials approximation - r

I am trying to fit a polynomial to my dataset, which looks like that (full dataset is at the end of the post):
The theory predicts that the formulation of the curve is:
which looks like this (for x between 0 and 1):
When I try to make a linear model in R by doing:
mod <- lm(y ~ poly(x, 2, raw=TRUE)/poly(x, 2))
I get the following curve:
Which is much different from what I would expect. Have you got any idea how to fit a new curve from this data so that it would be similar to the one, which theory predicts? Also, it should have only one minimum.
Full dataset:
Vector of x values:
x <- c(0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.10, 0.11, 0.12,
0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.20, 0.21, 0.22, 0.23, 0.24, 0.25,
0.26, 0.27, 0.28, 0.29, 0.30, 0.31, 0.32, 0.33, 0.34, 0.35, 0.36, 0.37, 0.38,
0.39, 0.40, 0.41, 0.42, 0.43, 0.44, 0.45, 0.46, 0.47, 0.48, 0.49, 0.50, 0.51,
0.52, 0.53, 0.54, 0.55, 0.56, 0.57, 0.58, 0.59, 0.60, 0.61, 0.62, 0.63, 0.64,
0.65, 0.66, 0.67, 0.68, 0.69, 0.70, 0.71, 0.72, 0.73, 0.74, 0.75, 0.76, 0.77,
0.78, 0.79, 0.80, 0.81, 0.82, 0.83, 0.84, 0.85, 0.86, 0.87, 0.88, 0.89, 0.90,
0.91, 0.92, 0.93, 0.94, 0.95)
Vector of y values:
y <- c(4.104, 4.444, 4.432, 4.334, 4.285, 4.058, 3.901, 4.382,
4.258, 4.158, 3.688, 3.826, 3.724, 3.867, 3.811, 3.550, 3.736, 3.591,
3.566, 3.566, 3.518, 3.581, 3.505, 3.454, 3.529, 3.444, 3.501, 3.493,
3.362, 3.504, 3.365, 3.348, 3.371, 3.389, 3.506, 3.310, 3.578, 3.497,
3.302, 3.530, 3.593, 3.630, 3.420, 3.467, 3.656, 3.644, 3.715, 3.698,
3.807, 3.836, 3.826, 4.017, 3.942, 4.208, 3.959, 3.856, 4.157, 4.312,
4.349, 4.286, 4.483, 4.599, 4.395, 4.811, 4.887, 4.885, 5.286, 5.422,
5.527, 5.467, 5.749, 5.980, 6.242, 6.314, 6.587, 6.790, 7.183, 7.450,
7.487, 8.566, 7.946, 9.078, 9.308, 10.267, 10.738, 11.922, 12.178, 13.243,
15.627, 16.308, 19.246, 22.022, 25.223, 29.752)

Use nls to fit a nonlinear model. Note that the model formula is not uniquely defined as displayed in the question since if we multiply all the coefficients by any number the result will still give the same predictions. To avoid this we need to fix one coefficient. A first try used the coefficients shown in the question as starting values (except fixing one) but that failed so dropping C was tried and the resulting coefficients fed into a second fit with C = 1.
st <- list(a = 43, b = -14, c = 25, B = 18)
fm <- nls(y ~ (a + b * x + c * x^2) / (9 + B * x), start = st)
fm2 <- nls(y ~ (a + b * x + c * x^2) / (9 + B * x + C * x^2), start = c(coef(fm), C = 1))
plot(y ~ x)
lines(fitted(fm2) ~ x, col = "red")
(continued after chart)
Note: Here is an example of using nls2 to get starting values with random search. We assume that the coefficients each lie between -50 and 50.
library(nls2)
set.seed(123) # for reproducibility
v <- c(a = 50, b = 50, c = 50, B = 50, C = 50)
st0 <- as.data.frame(rbind(-v, v))
fm0 <- nls2(y ~ (a + b * x + c * x^2) / (9 + B * x + C * x^2), start = st0,
alg = "random", control = list(maxiter = 1000))
fm3 <- nls(y ~ (a + b * x + c * x^2) / (9 + B * x + C * x^2), st = coef(fm0))

Since you already have a theoretic prediction, you don't seem in need of a new model, and it's really only a plotting task:
png(); plot(y~x)
lines(x,mod,col="blue")
dev.off()
You cannot expect lm to produce a good approximation to a non-linear problem. The denominator involving x in that theoretic expression makes this inherently nonlinear.

Related

Optimise function on two criteria in R

I'm trying to optimise an exponential model by minimising the sum of squares, however I can't work out how to optimise using two separate criteria.
I need to find values for "a" and "b" that minimise the output of the function. I have entered estimates in the code below, but need the output of this code to tell me the sum of squares (already printing), but also values for "a" and "b" when the model was correctly optimised.
c <- c(0.08, 0.17, 0.25, 0.33, 0.41, 0.49, 0.57, 0.65, 0.73, 0.81, 0.88, 0.96, 1.04, 1.11, 1.19, 1.26)
my_fun <- function(a, b, c){
predVar1 <- a * (1-exp(-c/b))
sum((predVar1 - c)^2)
}
a <- 9
b <- 1.4
my_fun(a, b, c)
Thanks
Did some rearranging based on how I'm familiar with using optim. Also, I changed c to d because I don't like messing with accidentally overwriting the c() function.
d <- c(0.08, 0.17, 0.25, 0.33, 0.41, 0.49, 0.57, 0.65, 0.73, 0.81, 0.88, 0.96, 1.04, 1.11, 1.19, 1.26)
my_fun <- function(parameters, d){
a <- parameters[1]
b <- parameters[2]
predVar1 <- a * (1-exp(-d/b))
return(sum((predVar1 - d)^2))
}
a <- 9
b <- 1.4
results <- optim(c(a, b), my_fun, d = d)
results$par
#[1] 700.8850 700.4793
results$value
#[1] 4.37461e-07

How to plot truncated distributions (truncdist) with fitdistrplus?

I am attempting to plot goodness of fit curves to truncated distributions from the fitdistrplus package using its plot function.
library(fitdistrplus)
library(truncdist)
library(truncnorm)
dataNum <- c(433.6668, 413.0450, 435.9952, 449.7559, 457.3629, 498.6187, 598.0335, 637.5611, 644.9193, 634.4843, 620.8676, 590.6622, 581.6411, 572.5022, 594.0925, 587.7293, 608.4948, 626.7594, 599.0286, 611.2966, 572.1749, 545.0071, 490.0298, 478.8484, 458.8293, 437.4878, 467.7026, 477.4094, 467.4182, 519.3056, 599.0155, 648.8603, 623.0672, 606.3737, 552.3653, 558.7612, 553.1345, 549.5961, 546.0578, 565.4582, 562.6825, 606.6225, 578.1584, 572.6201, 546.4735, 514.8147, 479.4638, 462.7702, 430.3652, 452.9671)
If I use the library(truncnorm) to fit a truncated normal distribution, everything works fine.
fit.dataNormTrunc2 <- fitdist(dataNum, "truncnorm", fix.arg=list(a=min(dataNum)), start = list(mean = mean(dataNum), sd = sd(dataNum)))
plot(fit.dataNormTrunc2)
However, if I try to use the truncdist package, only the histogram comparison plot prints without any of the other plots (e.g. qq-plot). I also get an error:
Error in qtNorm(p = c(0.01, 0.03, 0.05, 0.07, 0.09, 0.11, 0.13, 0.15, :
unused argument (p = c(0.01, 0.03, 0.05, 0.07, 0.09, 0.11, 0.13, 0.15, 0.17, 0.19, 0.21, 0.23, 0.25, 0.27, 0.29, 0.31, 0.33, 0.35, 0.37, 0.39, 0.41, 0.43, 0.45, 0.47, 0.49, 0.51, 0.53, 0.55, 0.57, 0.59, 0.61, 0.63, 0.65, 0.67, 0.69, 0.71, 0.73, 0.75, 0.77, 0.79, 0.81, 0.83, 0.85, 0.87, 0.89, 0.91, 0.93, 0.95, 0.97, 0.99))
The code used is:
dtNorm <- function(x, mean, sd) {
dtrunc(x, "norm", mean, sd, a=min(dataNum), b=Inf)
}
ptNorm <- function(x, mean, sd) {
ptrunc(x, "norm", mean, sd, a=min(dataNum), b=Inf)
}
qtNorm <- function(x, mean, sd) {
qtrunc(x, "norm", mean, sd, a=min(dataNum), b=Inf)
}
fit.dataNormTrunc <- fitdist(dataNum, "tNorm", start = c(mean=mean(dataNum), sd=sd(dataNum)))
plot(fit.dataNormTrunc)
I have also tried the truncdist approach with the lognormal functionand again the other 3 plots don't print out and I get the same error about the values not being used.

Bayes Factor values in the R package BayesFactor

I've followed the instructions on how to run a Bayesian 't-test' using default priors in the BayesFactor package in R.
Some of the returned values are astronomical.
Here is an example comparison with a huge Bayes factor:
#install.packages('BayesFactor')
library(BayesFactor)
condition1 <- c(0.94, 0.9, 0.96, 0.74, 1, 0.98, 0.86, 0.92, 0.918367346938776,
0.96, 0.4, 0.816326530612245, 0.8, 0.836734693877551, 0.56, 0.66,
0.605263157894737, 0.836734693877551, 0.84, 0.9, 0.92, 0.714285714285714,
0.82, 0.5, 0.565217391304348, 0.8, 0.62)
condition2 <- c(0.34, 0.16, 0.23, 0.19, 0.71, 0.36, 0.02, 0.83, 0.11, 0.06,
0.27, 0.347368421052632, 0.21, 0.13953488372093, 0.11340206185567,
0.14, 0.142857142857143, 0.257731958762887, 0.15, 0.29, 0.67,
0.0515463917525773, 0.272727272727273, 0.0895522388059701, 0.0204081632653061,
0.13, 0.0612244897959184)
bf = ttestBF(x = condition1, condition2, paired = TRUE)
bf
This returns:
Bayes factor analysis
--------------
[1] Alt., r=0.707 : 144035108289 ±0%
Against denominator:
Null, mu = 0
---
Bayes factor type: BFoneSample, JZS
For the most part the comparisons range from below 1 up to a few hundred. But I'm concerned that this value (144035108289!) is indicative of something erroneous on my part.
FYI: the p-value in the null-hypothesis test on the same data as above = 4.649279e-14.
Any assurances or insights into this returned BF would be much appreciated.
I calculated the BF using manual input of t-value and sample size like this using the same package:
exp(ttest.tstat(t=14.63, n1=27, rscale = 0.707)[['bf']])
It gives the same BF. It seems this is largely due to a relatively big sample size (27). The returned BF appears to be on the up-and-up.

Interpolate within points in a vector

Vector V1 contains 56 observations for X, and vector BS contains a bootstrapped sample of V1 of length 100000. I would like to interpolate linearly within points in BS to fill in any missing values. For example, V1 contains no 0.27 values, and hence neither does BS. But BS would contain a few 0.28 and 0.26. I would like the interpolation to create a few 0.27 values and add those to BS. And so on for any missing values within the two extremes in the vector.
V1 <- c(0.18, 0.2, 0.24, 0.35, -0.22, -0.17, 0.28, -0.28, -0.14, 0.03, 0.87, -0.2, 0.06, -0.1, -0.72, 0.18, 0.01, 0.31, -0.36, 0.61, -0.16, -0.07, -0.13, 0.01, -0.09, 0.26, -0.14, 0.08, -0.62, -0.2, 0.3, -0.21, -0.11, 0.05, 0.06, -0.28, -0.27, 0.17, 0.42, -0.05, -0.15, 0.05, -0.07, -0.22, -0.34, 0.16, 0.34, 0.1, -0.12, 0.24, 0.45, 0.37, 0.61, 0.9, -0.25, 0.02)
BS <- sample(V1, 100000, replace=TRUE)
The approxfun functions do not help as are for interpolating within data sets. Have found a few questions/answers covering interpolating within different data sets, but not within one data set. Thank you for your help.
EDIT: please note I do not want to fit a normal distribution (or any other) to create those points.
You can use approx() (or approxfun()) to do this by treating BS as the y-coordinate and using sequential x-coordinates:
set.seed(1L); BS <- sample(V1,1e5L,T);
res <- approx(seq_along(BS),BS,n=length(BS)*2L-1L)$y;
The specification of n here is important. It ensures that exactly one interpolated value will be produced halfway between each adjacent pair of input values.
Here's a plot of an excerpt of the result, centered around the first occurrence of an adjacent pair of 0.26 and 0.28:
i <- which(BS[-length(BS)]==0.26 & BS[-1L]==0.28)[1L];
j <- i*2L-1L;
xlim <- c(j-6L,j+8L);
ylim <- c(-1,1);
xticks <- seq(xlim[1L],xlim[2L]);
yticks <- seq(ylim[1L],ylim[2L],0.05);
plot(NA,xlim=xlim,ylim=ylim,xlab='res index',ylab='y',axes=F,xaxs='i',yaxs='i');
abline(v=xticks,col='lightgrey');
abline(h=yticks,col='lightgrey');
axis(1L,xticks,cex.axis=0.7);
axis(2L,yticks,sprintf('%.02f',round(yticks,2L)),las=1L,cex.axis=0.7);
x <- seq(xlim[1L],xlim[2L],2L); y <- BS[seq(i-3L,len=8L)];
points(x,y,pch=16L,col='red',xpd=NA);
x <- seq(xlim[1L],xlim[2L]); y <- res[x];
points(x,y,pch=4L,cex=1.2,col='blue',xpd=NA);
text(x+0.24,y+0.03,y,cex=0.7,xpd=NA);
legend(xlim[1L]+1.5,0.87,c('input value','interpolated'),col=c('red','blue'),pch=c(16L,4L));

How to implement the bootstrap in R

So I posted a thread about this problem, but it got on hold. So I rephrased so it can be it a programming question. This is my code below. I am trying to find the stimulated confidence level of a sample using the bootstrap.
# Step One: Generating the data from lognormal distribution
MC <-1000; # Number of samples to simulate
xbar = c(1:MC);
mu = 1;
sigma= 1.5;
the_mean <- exp(mu+sigma^2/2);
n= 10;
for(i in 1:MC)
{
mySample <- rlnorm(n=n meanlog=mu, sdlog=sigma);
xbar [i] <- the_mean(mySample);
}
# Step Two: Compute 95% Bootstrap CI with B=1000
B = 1000
xbar_star = c(1:B)
for(b in 1:B)
{
x_star = sample(n,n, replace=TRUE)
xbar_star[b] = mean(x_star)
}
quantile(xbar, p=c(0.025, 0.975))
If you implement this code you can see that the output is 975.025 when it should actually be 0. 90.
I don't understand why my output is wrong.
We arent trying to find the Confidence Interval, but the stimulated Confidence Level. How does the actual coverage percentage (obtained through simulation) compare with the nominal confidence level (which is 95%)? This is my code when my samples were given in a practice problem...
library(boot)
x = c(0.22, 0.23, 0.26, 0.27, 0.28, 0.28, 0.29,
0.33, 0.34, 0.35, 0.38, 0.39, 0.39, 0.42, 0.42,
0.43, 0.45, 0.46, 0.48, 0.5, 0.5, 0.51, 0.52,
0.54, 0.56, 0.56, 0.57, 0.57, 0.6, 0.62, 0.63,
0.67, 0.69, 0.72, 0.74, 0.76, 0.79, 0.81, 0.82,
0.84, 0.89, 1.11, 1.13, 1.14, 1.14, 1.2, 1.33)
B = 10000
xbar = mean(x)
n = length(x)
xbar_star = c(1:B)
for(b in 1:B)
{
x_star = sample(x=x, size=n, replace=TRUE)
xbar_star[b] = mean(x_star)
}
# empirical percentile method
quantile(xbar_star, p=c(0.025, 0.975))
> quantile(xbar_star, p=c(0.025, 0.975))
2.5% 97.5%
0.5221277 0.6797926

Resources