how to perform nls fro three matrix? - r

If we have these matrix as example:
x=matrix(runif(50 ),5,5)
z=matrix(runif(60 ),5,5)
y=exp(0.5*x+0.2*z+0.3)^1
the values in these three matrix change form one day to another so I have two years (daily,x1,x2,x3,etc)of these matrix(speaking of my real data)
to fit this equation y=exp(s1x+s2x+s3)^s4
use normal Nls as:
fit <- nls(y ~ exp(s1*x+s2*z+s3)^s4,
data = data,
start = list(s1 = 0.5, s2 = 0.2, s3 = 0.3,s4=1))
but my problem is that I want to do this fit for all corresponding values in three matrix for example start taking values from:
> use all values of x1[1,1], x2[1,1],.....etc
[1] 0.3617776 .......etc
> and the corresponding form z1[1,1], z2[1,1].....etc
[1] 0.5544851 .......etc
> y1[1,1], y2[1,1],etc .....
[1] 1.807213 .......etc
Find the fit parameters then do the same for all other corresponding values. Finally I get a matrix with, for example, values of s1, another matrix for s2 etc….
Any help is appreciated(Note: I have to use it as it is and not to linearise it or use lm.fit or lm)

I think the x,y,z as matrices are throwing you off. As stated by MrFlick in the comment you need multiple observations to perform the regression. Here is an example of nlm applied to a simple 3d normal surface with some noise.
You need a n-by-3 matrix of observations in the 3d space. Here I make some noisy points along a Gaussian curve and rotate it to form a surface.
This example uses the rgl package.
library(rgl)
z_rotate <- function(mat,rads){
rot.mat <- matrix(c(cos(rads),-sin(rads),0,
sin(rads),cos(rads),0,
0, 0, 1),nrow=3,ncol=3)
mat %*% rot.mat
}
x <- seq(-2,2,0.1)
x <- x + rnorm(length(x),sd=0.5)
y <- seq(-2,2,0.1)
#y <- y + rnorm(length(y),sd=0.5)
z <- 0.5*exp(-x^2/2 + (-y^2/2)) + rnorm(1,sd=0.6)
m <- as.matrix(cbind(x,y,z))
now the matrix m is just the initial points about the normal curve.
points <- Reduce(rbind,lapply(1:8,function(n){z_rotate(m,n*(pi/8))}))
colnames(points) <- c("x","y","z")
The previous command just calls the rotate function then appends the results as new rows in the matrix.
Now the points matrix represents the observations to be approximated with our nlm function.
> dim(points)
[1] 328 3
Now we can view these points in 3d with RGL.
plot3d(points[,1],points[,2],points[,3],type='s',size=0.3)
Use nlm to fit a function that approximates this data.
fit <- nls(z ~ s1*exp(-s2*x^2 + (-s3*y^2)) +s4,
data = data.frame(points),
start = list(s1 = 0.3, s2 = 0.3, s3 = 0.3, s4=0.3),
#trace=TRUE,
control = list(warnOnly=TRUE))
It produces this fit:
> fit
Nonlinear regression model
model: z ~ s1 * exp(-s2 * x^2 + (-s3 * y^2)) + s4
data: data.frame(points)
s1 s2 s3 s4
0.5000 0.5000 0.5000 -0.1084
residual sum-of-squares: 1.493e-31
Number of iterations till stop: 50
Achieved convergence tolerance: 0.05046
Reason stopped: number of iterations exceeded maximum of 50
We can see that it recovers the values we had, namely 0.5 for the coefficients when we compare the formulas.
z <- 0.5*exp(-x^2/2 + (-y^2/2)) + rnorm(1,sd=0.6)
z ~ s1*exp(-s2*x^2 + (-s3*y^2)) +s4
I know its a toy example but it might give you some ideas as to how to set up your data. I think you will need to structure your data in an n-by-3 observations matrix to be able to do your regression with nlm.

Related

How can I use cubic splines for extrapolation?

I am looking to use natural cubic splines to interpolate between some data points using stats::splinefun(). The documentation states:
"These interpolation splines can also be used for extrapolation, that is prediction at points outside the range of ‘x’. Extrapolation makes little sense for ‘method = "fmm"’; for natural splines it is linear using the slope of the interpolating curve at the nearest data point."
I have attempted to replicate the spline function in Excel as a review, which is working fine except that I can't replicate the extrapolation approach. Example data and code below:
library(stats)
# Example data
x <- c(1,2,3,4,5,6,7,8,9,10,12,15,20,25,30,40,50)
y <- c(7.1119,5.862,5.4432,5.1458,4.97,4.8484,4.7726,4.6673,4.5477,4.437,4.3163,4.1755,4.0421,3.9031,3.808,3.6594,3.663)
df <- data.frame(x,y)
# Create spline functions
splinetest <- splinefun(x = df$x, y = df$y, method = "natural")
# Create dataframe of coefficients
splinetest_coef <- environment(splinetest)$z
splinetest_coefdf <- data.frame(i = 0:16, x = splinecoef_inf$x, a = splinecoef_inf$y, b = splinecoef_inf$b, c = splinecoef_inf$c, d = splinecoef_inf$d)
# Calculate extrapolated value at 51
splinetest(51)
# Result:
# [1] 3.667414
Question: How is this result calculated?
Expected result using linear extrapolation from x = 40 and x = 50 is 3.663 + (51 - 50) x (3.663 - 3.6594) / (50 - 40) = 3.66336
The spline coefficients are as follows at i = 50: a = 3.663 and b = 0.00441355...
Therefore splinetest(51) is calculated as 3.663 + 0.0441355
How is 0.0441355 calculated in this function?
Linear extrapolation is not done by computing the slope between a particular pair of points, but by using the estimated derivatives at the boundary ("closest point" in R's documentation). The derivatives at any point can be calculated directly from the spline function, e.g. to calculate the estimated first derivative at the upper boundary:
splinetest(max(df$x), deriv = 1)
[1] 0.004413552
This agrees with your manual back-calculation of the slope used to do the extrapolation.
As pointed out in the comments, plotting the end of the curve/data set with curve(splinetest, from = 30, to = 60); points(x,y) illustrates clearly the difference between the derivative at the boundary (x=50) and the line based on the last two data points (i.e. (y(x=50) - y(x=40))/10)

Ridge regression within a loop

I am new in coding, so I still struggle with simple things as loops, subsetting, and data frame vs. matrix.
I am trying to fit a ridge regression for a multivariable X (X1=Marker 1, X2= Marker, X3= Marker 3,..., X1333= Marker 1333), shown in the first image, as a predictor variable of Y, in the second image.
I want to compute the sum of the squared errors (SSE) for varying tuning parameter λ (between 1 and 20). My code is the following:
#install.packages("MASS")
library(MASS)
fitridge <- function(x,y){
fridge=lm.ridge (y ~ x, lambda = seq(0, 20, 2)) #Fitting a ridge regression for varying λ values
sum(residuals(fridge)^2) #This results in SSE
}
all_gcv= apply(as.matrix(genmark_new),2,fitridge,y=as.matrix(coleslev_new))
}
However, it returns this error, and I don't know what to do anymore. I have tried converting the data set into a matrix, a data frame, changing the order of rows and columns...
Error in colMeans(X[, -Inter]) : 'x' must be an array of at least two dimensions.
I just would like to take each marker value from a single row (first picture), pass them into my fitridge function that fits a ridge regression against the Y from the second data set (in the second picture).
And then subset the SSE and their corresponding lambda values
You cannot fit a ridge with only one independent variable. It is not meant for this. In your case, most likely you have to do:
genmark_new = data.frame(matrix(sample(0:1,1333*100,replace=TRUE),ncol=1333))
colnames(genmark_new) = paste0("Marker_",1:ncol(genmark_new))
coleslev_new = data.frame(NormalizedCholesterol=rnorm(100))
Y = coleslev_new$NormalizedCholesterol
library(MASS)
fit = lm.ridge (y ~ ., data=data.frame(genmark_new,y=Y),lambda = seq(0, 20, 2))
And calculate residuals for each lambda:
apply(fit$coef,2,function(i)sum((Y-as.matrix(genmark_new) %*% i)^2))
0 2 4 6 8 10 12 14
26.41866 27.88029 27.96360 28.04675 28.12975 28.21260 28.29530 28.37785
16 18 20
28.46025 28.54250 28.62459
If you need to fit each variable separately, you can consider using a linear model:
fitlm <- function(x,y){
fridge=lm(y ~ x)
sum(residuals(fridge)^2)
}
all_gcv= apply(genmark_new,2,fitlm,y=Y)
Suggestion, check out make notes or introductions to ridge, they are meant for multiple variate regressions, that is, more than 1 independent variable.

Extract approximate probability density function (pdf) in R from random sampling

I have got n>2 independent continuous Random Variables(RV). For example say I have 4 Uniform RVs with different set of Upper and lowers.
W~U[-1,5], X~U[0,1], Y~[0,2], Z~[0.5,2]
I am trying to find out the approximate PDF for the sum of these RVs i.e. for T=W+X+Y+Z. As I don't need any closed form solution, I have sampled 1 million points for each of them to get 1 million samples for T. Is it possible in R to get the approximate PDF function or a way to get approximate probability of P(t<T)from this samples I have drawn. For example is there a easy way I can calculate P(0.5<T) in R. My priority here is to get probability first even if getting the density function is not possible.
Thanks
Consider the ecdf function:
set.seed(123)
W <- runif(1e6, -1, 5)
X <- runif(1e6, 0, 1)
Y <- runif(1e6, 0, 2)
Z <- runif(1e6, 0.5, 2)
T <- Reduce(`+`, list(W, X, Y, Z))
cdfT <- ecdf(T)
1 - cdfT(0.5) # Pr(T > 0.5)
# [1] 0.997589
See How to calculate cumulative distribution in R? for more details.

Maximum Likelihood Estimation for three-parameter Weibull distribution in r

I want to estimate the scale, shape and threshold parameters of a 3p Weibull distribution.
What I've done so far is the following:
Refering to this post, Fitting a 3 parameter Weibull distribution in R
I've used the functions
EPS = sqrt(.Machine$double.eps) # "epsilon" for very small numbers
llik.weibull <- function(shape, scale, thres, x)
{
sum(dweibull(x - thres, shape, scale, log=T))
}
thetahat.weibull <- function(x)
{
if(any(x <= 0)) stop("x values must be positive")
toptim <- function(theta) -llik.weibull(theta[1], theta[2], theta[3], x)
mu = mean(log(x))
sigma2 = var(log(x))
shape.guess = 1.2 / sqrt(sigma2)
scale.guess = exp(mu + (0.572 / shape.guess))
thres.guess = 1
res = nlminb(c(shape.guess, scale.guess, thres.guess), toptim, lower=EPS)
c(shape=res$par[1], scale=res$par[2], thres=res$par[3])
}
to "pre-estimate" my Weibull parameters, such that I can use them as initial values for the argument "start" in the "fitdistr" function of the MASS-Package.
You might ask why I want to estimate the parameters twice... reason is that I need the variance-covariance-matrix of the estimates which is also estimated by the fitdistr function.
EXAMPLE:
set.seed(1)
thres <- 450
dat <- rweibull(1000, 2.78, 750) + thres
pre_mle <- thetahat.weibull(dat)
my_wb <- function(x, shape, scale, thres) {
dweibull(x - thres, shape, scale)
}
ml <- fitdistr(dat, densfun = my_wb, start = list(shape = round(pre_mle[1], digits = 0), scale = round(pre_mle[2], digits = 0),
thres = round(pre_mle[3], digits = 0)))
ml
> ml
shape scale thres
2.942548 779.997177 419.996196 ( 0.152129) ( 32.194294) ( 28.729323)
> ml$vcov
shape scale thres
shape 0.02314322 4.335239 -3.836873
scale 4.33523868 1036.472551 -889.497580
thres -3.83687258 -889.497580 825.374029
This works quite well for cases where the shape parameter is above 1. Unfortunately my approach should deal with the cases where the shape parameter could be smaller than 1.
The reason why this is not possible for shape parameters that are smaller than 1 is described here: http://www.weibull.com/hotwire/issue148/hottopics148.htm
in Case 1, All three parameters are unknown the following is said:
"Define the smallest failure time of ti to be tmin. Then when γ → tmin, ln(tmin - γ) → -∞. If β is less than 1, then (β - 1)ln(tmin - γ) goes to +∞ . For a given solution of β, η and γ, we can always find another set of solutions (for example, by making γ closer to tmin) that will give a larger likelihood value. Therefore, there is no MLE solution for β, η and γ."
This makes a lot of sense. For this very reason I want to do it the way they described it on this page.
"In Weibull++, a gradient-based algorithm is used to find the MLE solution for β, η and γ. The upper bound of the range for γ is arbitrarily set to be 0.99 of tmin. Depending on the data set, either a local optimal or 0.99tmin is returned as the MLE solution for γ."
I want to set a feasible interval for gamma (in my code called 'thres') such that the solution is between (0, .99 * tmin).
Does anyone have an idea how to solve this problem?
In the function fitdistr there seems to be no opportunity doing a constrained MLE, constraining one parameter.
Another way to go could be the estimation of the asymptotic variance via the outer product of the score vectors. The score vector could be taken from the above used function thetahat.weibul(x). But calculating the outer product manually (without function) seems to be very time consuming and does not solve the problem of the constrained ML estimation.
Best regards,
Tim
It's not too hard to set up a constrained MLE. I'm going to do this in bbmle::mle2; you could also do it in stats4::mle, but bbmle has some additional features.
The larger issue is that it's theoretically difficult to define the sampling variance of an estimate when it's on the boundary of the allowed space; the theory behind Wald variance estimates breaks down. You can still calculate confidence intervals by likelihood profiling ... or you could bootstrap. I ran into a variety of optimization issues when doing this ... I haven't really thought about wether there are specific reasons
Reformat three-parameter Weibull function for mle2 use (takes x as first argument, takes log as an argument):
dweib3 <- function(x, shape, scale, thres, log=TRUE) {
dweibull(x - thres, shape, scale, log=log)
}
Starting function (slightly reformatted):
weib3_start <- function(x) {
mu <- mean(log(x))
sigma2 <- var(log(x))
logshape <- log(1.2 / sqrt(sigma2))
logscale <- mu + (0.572 / logshape)
logthres <- log(0.5*min(x))
list(logshape = logshape, logsc = logscale, logthres = logthres)
}
Generate data:
set.seed(1)
dat <- data.frame(x=rweibull(1000, 2.78, 750) + 450)
Fit model: I'm fitting the parameters on the log scale for convenience and stability, but you could use boundaries at zero as well.
tmin <- log(0.99*min(dat$x))
library(bbmle)
m1 <- mle2(x~dweib3(exp(logshape),exp(logsc),exp(logthres)),
data=dat,
upper=c(logshape=Inf,logsc=Inf,
logthres=tmin),
start=weib3_start(dat$x),
method="L-BFGS-B")
vcov(m1), which should normally provide a variance-covariance estimate (unless the estimate is on the boundary, which is not the case here) gives NaN values ... not sure why without more digging.
library(emdbook)
tmpf <- function(x,y) m1#minuslogl(logshape=x,
logsc=coef(m1)["logsc"],
logthres=y)
tmpf(1.1,6)
s1 <- curve3d(tmpf,
xlim=c(1,1.2),ylim=c(5.9,tmin),sys3d="image")
with(s1,contour(x,y,z,add=TRUE))
h <- lme4:::hessian(function(x) do.call(m1#minuslogl,as.list(x)),coef(m1))
vv <- solve(h)
diag(vv) ## [1] 0.002672240 0.001703674 0.004674833
(se <- sqrt(diag(vv))) ## standard errors
## [1] 0.05169371 0.04127558 0.06837275
cov2cor(vv)
## [,1] [,2] [,3]
## [1,] 1.0000000 0.8852090 -0.8778424
## [2,] 0.8852090 1.0000000 -0.9616941
## [3,] -0.8778424 -0.9616941 1.0000000
This is the variance-covariance matrix of the log-scaled variables. If you want to convert to the variance-covariance matrix on the original scale, you need to scale by (x_i)*(x_j) (i.e. by the derivatives of the transformation exp(x)).
outer(exp(coef(m1)),exp(coef(m1))) * vv
## logshape logsc logthres
## logshape 0.02312803 4.332993 -3.834145
## logsc 4.33299307 1035.966372 -888.980794
## logthres -3.83414498 -888.980794 824.831463
I don't know why this doesn't work with numDeriv - would be very careful with variance estimates above. (Maybe too close to boundary for Richardson extrapolation to work?)
library(numDeriv)
hessian()
grad(function(x) do.call(m1#minuslogl,as.list(x)),coef(m1)) ## looks OK
vcov(m1)
The profiles look OK ... (we have to supply std.err because the Hessian isn't invertible)
pp <- profile(m1,std.err=c(0.01,0.01,0.01))
par(las=1,bty="l",mfcol=c(1,3))
plot(pp,show.points=TRUE)
confint(pp)
## 2.5 % 97.5 %
## logshape 0.9899645 1.193571
## logsc 6.5933070 6.755399
## logthres 5.8508827 6.134346
Alternately, we can do this on the original scale ... one possibility would be to use the log-scaling to fit, then refit starting from those parameters on the original scale.
wstart <- as.list(exp(unlist(weib3_start(dat$x))))
names(wstart) <- gsub("log","",names(wstart))
m2 <- mle2(x~dweib3(shape,sc,thres),
data=dat,
lower=c(shape=0.001,sc=0.001,thres=0.001),
upper=c(shape=Inf,sc=Inf,
thres=exp(tmin)),
start=wstart,
method="L-BFGS-B")
vcov(m2)
## shape sc thres
## shape 0.02312399 4.332057 -3.833264
## sc 4.33205658 1035.743511 -888.770787
## thres -3.83326390 -888.770787 824.633714
all.equal(unname(coef(m2)),unname(exp(coef(m1))),tol=1e-4)
About the same as the values above.
We can fit with a small shape, if we are a little more careful to bound the paraameters, but now we end up on the boundary for the threshold, which will cause lots of problems for the variance calculations.
set.seed(1)
dat <- data.frame(x = rweibull(1000, .53, 365) + 100)
tmin <- log(0.99 * min(dat$x))
m1 <- mle2(x ~ dweib3(exp(logshape), exp(logsc), exp(logthres)),
lower=c(logshape=-10,logscale=0,logthres=0),
upper = c(logshape = 20, logsc = 20, logthres = tmin),
data = dat,
start = weib3_start(dat$x), method = "L-BFGS-B")
For censored data, you need to replace dweibull with pweibull; see Errors running Maximum Likelihood Estimation on a three parameter Weibull cdf for some hints.
Another possible solution is to do Bayesian inference. Using scale priors on the shape and scale parameters and a uniform prior on the location parameter, you can easily run Metropolis-Hastings as follows. It might be adviceable to reparameterize in terms of log(shape), log(scale) and log(y_min - location) because the posterior for some of the parameters becomes strongly skewed, in particular for the location parameter. Note that the output below shows the posterior for the backtransformed parameters.
library(MCMCpack)
logposterior <- function(par,y) {
gamma <- min(y) - exp(par[3])
sum(dweibull(y-gamma,exp(par[1]),exp(par[2]),log=TRUE)) + par[3]
}
y <- rweibull(100,shape=.8,scale=10) + 1
chain0 <- MCMCmetrop1R(logposterior, rep(0,3), y=y, V=.01*diag(3))
chain <- MCMCmetrop1R(logposterior, rep(0,3), y=y, V=var(chain0))
plot(exp(chain))
summary(exp(chain))
This produces the following output
#########################################################
The Metropolis acceptance rate was 0.43717
#########################################################
Iterations = 501:20500
Thinning interval = 1
Number of chains = 1
Sample size per chain = 20000
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean SD Naive SE Time-series SE
[1,] 0.81530 0.06767 0.0004785 0.001668
[2,] 10.59015 1.39636 0.0098738 0.034495
[3,] 0.04236 0.05642 0.0003990 0.001174
2. Quantiles for each variable:
2.5% 25% 50% 75% 97.5%
var1 0.6886083 0.768054 0.81236 0.8608 0.9498
var2 8.0756210 9.637392 10.50210 11.4631 13.5353
var3 0.0003397 0.007525 0.02221 0.0548 0.1939

ar(1) simulation with non-zero mean

I can't seem to find the correct way to simulate an AR(1) time series with a mean that is not zero.
I need 53 data points, rho = .8, mean = 300.
However, arima.sim(list(order=c(1,0,0), ar=.8), n=53, mean=300, sd=21)
gives me values in the 1500s. For example:
1480.099 1480.518 1501.794 1509.464 1499.965 1489.545 1482.367 1505.103 (and so on)
I have also tried arima.sim(n=52, model=list(ar=c(.8)), start.innov=300, n.start=1)
but then it just counts down like this:
238.81775870 190.19203239 151.91292491 122.09682547 96.27074057 [6] 77.17105923 63.15148491 50.04211711 39.68465916 32.46837830 24.78357345 21.27437183 15.93486092 13.40199333 10.99762449 8.70208879 5.62264196 3.15086491 2.13809323 1.30009732
and I have tried arima.sim(list(order=c(1,0,0), ar=.8), n=53,sd=21) + 300 which seems to give a correct answer. For example:
280.6420 247.3219 292.4309 289.8923 261.5347 279.6198 290.6622 295.0501
264.4233 273.8532 261.9590 278.0217 300.6825 291.4469 291.5964 293.5710
285.0330 274.5732 285.2396 298.0211 319.9195 324.0424 342.2192 353.8149
and so on..
However, I am in doubt that this is doing the correct thing? Is it still auto-correlating on the correct number then?
Your last option is okay to get the desired mean, "mu". It generates data from the model:
(y[t] - mu) = phi * (y[t-1] - mu) + \epsilon[t], epsilon[t] ~ N(0, sigma=21),
t=1,2,...,n.
Your first approach sets an intercept, "alpha", rather than a mean:
y[t] = alpha + phi * y[t-1] + epsilon[t].
Your second option sets the starting value y[0] equal to 300. As long as |phi|<1 the influence of this initial value will vanish after a few periods and will have no effect
on the level of the series.
Edit
The value of the standard deviation that you observe in the simulated data is correct. Be aware that the variance of the AR(1) process, y[t], is not equal the variance of the innovations, epsilon[t]. The variance of the AR(1) process, sigma^2_y, can be obtained obtained as follows:
Var(y[t]) = Var(alpha) + phi^2 * Var(y[t-1]) + Var(epsilon[t])
As the process is stationary Var(y[t]) = Var(t[t-1]) which we call sigma^2_y. Thus, we get:
sigma^2_y = 0 + phi^2 * sigma^2_y + sigma^2_epsilon
sigma^2_y = sigma^2_epsilon / (1 - phi^2)
For the values of the parameters that you are using you have:
sigma_y = sqrt(21^2 / (1 - 0.8^2)) = 35.
Use the rGARMA function in the ts.extend package
You can generate random vectors from any stationary Gaussian ARMA model using the ts.extend package. This package generates random vectors directly form the multivariate normal distribution using the computed autocorrelation matrix for the random vector, so it gives random vectors from the exact distribution and does not require "burn-in" iterations. Here is an example of generating multiple independent time-series vectors all from an AR(1) model.
#Load the package
library(ts.extend)
#Set parameters
MEAN <- 300
ERRORVAR <- 21^2
AR <- 0.8
m <- 53
#Generate n = 16 random vectors from this model
set.seed(1)
SERIES <- rGARMA(n = 16, m = m, mean = MEAN, ar = AR, errorvar = ERRORVAR)
#Plot the series using ggplot2 graphics
library(ggplot2)
plot(SERIES)
As you can see, the generated time-series vectors in this plot use the appropriate mean and error variance that were specified in the inputs.

Resources