Fit data with multiple functions - r

I am trying to fit my data with multiple fitting functions consisting of multiple variables. An example is below for two variables. There are cases where for a certain variable the fit is not good and I get a singular gradient error. I would like to ignore those cases and proceed anyway and furthermore for the remaining variables take the better solution between the two fitting function by comparing the deviance. Like in this example for both type1 and type2 the sum of residuals is less with the first function
sum(resid(myfitfun1)^2) < sum(resid(myfitfun2)^2) so take the first function for both of the variables.
myfun1<-function(x,a,b){1/(1+exp(-(x/a)+b))}
myfun2<-function(x,a,b){1+b*exp(-(x)/a)}
mydata <- data.frame(v=c("type1","type1","type1","type1","type1","type1","type1","type1","type1","type1","type1",
"type1","type1","type1","type1","type1","type1","type1","type1","type1","type1","type1","type1","type1",
"type1","type1","type1","type1","type1","type1","type1","type1","type1","type1","type1","type1",
"type1","type1","type1","type1","type1","type1","type1","type1","type2","type2","type2","type2",
"type2","type2"),
m=c(1.116975672,1.38641493,1.423833959,1.482008121,1.513588607,1.527179033,
1.543512539,1.555874185,1.607579807,1.721182154,1.729059048,1.748226681,
1.774814055,1.815147988,1.835638421,1.854582642,1.861972,1.887704144,
1.915360975,1.948689331,1.97516491,1.985962227,2.011310496,2.043716548,
2.068918438,2.091184665,2.120366813,2.126865141,2.148241318,2.15871161,
2.193529738,2.256197915,2.302364722,2.316381935,2.31909683,2.325213451,
2.336299128,2.410419652,2.473160411,2.478302702,2.5238233,2.651124474,
2.70470831,2.927536062,-0.1736072,0.1235610,0.5848941,0.9016486,0.9744832,
1.2767238),
n=c(0.022727273,0.045454545,0.068181818,0.090909091,0.113636364,0.136363636,
0.159090909,0.181818182,0.204545455,0.227272727,0.25,0.272727273,0.295454545,
0.318181818,0.340909091,0.363636364,0.386363636,0.409090909,0.431818182,
0.454545455,0.477272727,0.5,0.522727273,0.545454545,0.568181818,0.590909091,
0.613636364,0.636363636,0.659090909,0.681818182,0.704545455,0.727272727,0.75,
0.772727273,0.795454545,0.818181818,0.840909091,0.863636364,0.886363636,
0.909090909,0.931818182,0.954545455,0.977272727,1,0.1666667,0.3333333,0.5000000,
0.6666667,0.8333333,1))
myfitfun1 <- nls(n~myfun1(m,a,b),mydata,start=list(a=1,b=1))
myfitfun2 <- nls(n~myfun2(m,a,b),mydata,start=list(a=1,b=1))
I would like to program it in a way that it handels automatically the better fit between the two functions for various type and ignoring in case of an error. Any help is appreciated.

You could put both functions in a function and work with tryCatch. In the one tryCatches, just throw an NA to overcome the error. In another tryCatch set the value to Inf when an error occurs to ensure the "better" fit for the non-failing function. In normal cases the minimum is chosen. With `attr<-` we can give the MSE as an attribute to the output of the fit.
fun <- function(data) {
myfitfun1 <- tryCatch(
nls(n ~ myfun1(m, a, b), data, start=list(a=1, b=1)),
error=function(e) NA)
myfitfun2 <- tryCatch(
nls(n ~ myfun2(m, a, b), data, start=list(a=1, b=1)),
error=function(e) NA)
L <- list(myfitfun1, myfitfun2)
res <- sapply(L, function(x) {
tryCatch(sum(resid(x)^2), error=function(e) Inf)
})
`attr<-`(L[[which.min(res)]], "MSE", min(res))
}
fun(mydata)
# Nonlinear regression model
# model: n ~ myfun1(m, a, b)
# data: data
# a b
# 0.3465 5.6461
# residual sum-of-squares: 2.323
#
# Number of iterations to convergence: 26
# Achieved convergence tolerance: 7.675e-06
To get the MSE attribute, use:
attr(fun(mydata), "MSE")
# [1] 2.322945

Related

Standardization and inclusion of intercept in sparse lasso GLM

I found some problems while practicing the sparse group lasso method using the cvSGL function forom the SGL package.
My questions are as follows:
Looking at the code for SGL:::center_scale, it doesn't seem to consider the sample size of the data.
SGL:::center_scale
#Output
function (X, standardize) {
means <- apply(X, 2, mean)
X <- t(t(X) - means)
X.transform <- list(X.means = means)
if (standardize == TRUE) {
var <- apply(X, 2, function(x) (sqrt(sum(x^2))))
X <- t(t(X)/var)
X.transform$X.scale <- var
}
else {
X.transform$X.scale <- 1
}
return(list(x = X, X.transform = X.transform))
}
Therefore, the sample standard deviation of the predicted variable is measured somewhat larger.
Is my understanding correct that this may cause the coefficients to be excessively large?
whether the model can be estimated by SGL package with intercept term (or constant term)
The SGL package does not seem to provide a function for estimating by including a intercept term.
In cvFit[["fit"]], I can see only the beta of the predict variables according to the lambda's except for the constant term. The value of cvFit[["fit"]][["intercept"]] is the mean of the y variable.
It can be estimated by adding 1 to first column of predict variable X, but in this case, it is expected to cause problems in centering and standardizing predict variables.
In addition, the SPG package seems to add a penalty to all predict variables. Even if the estimation is performed by adding 1 to the first column of the explanatory variable X as described above, the constant term may be estimated as 0.

extracting residuals from pixel by pixel regression

I am trying to extract the residuals from a regression run pixel by pixel on a raster stack of NDVI/precipitation. My script works when i run it with a small part of my data. But when i try to run the whole of my study area i get: "Error in setValues(out, x) : values must be numeric, integer, logical or factor"
The lm works, since I can extract both slope and intercept. I just cant extract the residuals.
Any idea of how this could be fixed?
Here is my script:
setwd("F:/working folder/test")
gimms <- list.files(pattern="*ndvi.tif")
ndvi <- stack(gimms)
precip <- list.files(pattern="*pre.tif")
pre <- stack(precip)
s <- stack(ndvi,pre)
residualfun = function(x) { if (is.na(x[1])){ NA } else { m <- lm(x[1:6] ~ x[7:12], na.action=na.exclude)
r <- residuals.lm(m)
return (r)}}
res <- calc(s,residualfun)
And here is my data: https://1drv.ms/u/s!AhwCgWqhyyDclJRjhh6GtentxFOKwQ
Your function only test if the first layer shows NA values to avoid fitting the model. But there may be NA in other layers. You know that because you added na.action = na.exclude in your lm fit.
The problem is that if the model removes some values because of NA, the residuals will only have the length of the non-NA values. This means that your resulting r vector will have different lengths depending on the amount of NA values in layers. Then, calc is not be able to combine results of different lengths in a stack a a defined number of layers.
To avoid that, you need to specify the length of r in your function and attribute residuals only to non-NA values.
I propose the following function that now works on the dataset your provided. I added (1) the possibility compare more layers of each if you want to extend your exploration (with nlayers), (2) avoid fitting the model if there are only two values to compare in each layer (perfect model), (3) added a try if for any reason the model can fit, this will output values of -1e32 easily findable for further testing.
library(raster)
setwd("/mnt/Data/Stackoverflow/test")
gimms <- list.files(pattern="*ndvi.tif")
ndvi <- stack(gimms)
precip <- list.files(pattern="*pre.tif")
pre <- stack(precip)
s <- stack(ndvi,pre)
# Number of layers of each
nlayers <- 6
residualfun <- function(x) {
r <- rep(NA, nlayers)
obs <- x[1:nlayers]
cov <- x[nlayers + 1:nlayers]
# Remove NA values before model
x.nona <- which(!is.na(obs) & !is.na(cov))
# If more than 2 points proceed to lm
if (length(x.nona) > 2) {
m <- NA
try(m <- lm(obs[x.nona] ~ cov[x.nona]))
# If model worked, calculate residuals
if (is(m)[1] == "lm") {
r[x.nona] <- residuals.lm(m)
} else {
# alternate value to find where model did not work
r[x.nona] <- -1e32
}
}
return(r)
}
res <- calc(s, residualfun)

Using anova() on gamma distributions gives seemingly random p-values

I am trying to determine whether there is a significant difference between two Gamm distributions. One distribution has (shape, scale)=(shapeRef,scaleRef) while the other has (shape, scale)=(shapeTarget,scaleTarget). I try to do analysis of variance with the following code
n=10000
x=rgamma(n, shape=shapeRef, scale=scaleRef)
y=rgamma(n, shape=shapeTarget, scale=scaleTarget)
glmm1 <- gam(y~x,family=Gamma(link=log))
anova(glmm1)
The resulting p values keep changing and can be anywhere from <0.1 to >0.9.
Am I going about this the wrong way?
Edit: I use the following code instead
f <- gl(2, n)
x=rgamma(n, shape=shapeRef, scale=scaleRef)
y=rgamma(n, shape=shapeTarget, scale=scaleTarget)
xy <- c(x, y)
anova(glm(xy ~ f, family = Gamma(link = log)),test="F")
But, every time I run it I get a different p-value.
You will indeed get a different p-value every time you run this, if you pick different realizations every time. Just like your data values are random variables, which you'd expect to vary each time you ran an experiment, so is the p-value. If the null hypothesis is true (which was the case in your initial attempts), then the p-values will be uniformly distributed between 0 and 1.
Function to generate simulated data:
simfun <- function(n=100,shapeRef=2,shapeTarget=2,
scaleRef=1,scaleTarget=2) {
f <- gl(2, n)
x=rgamma(n, shape=shapeRef, scale=scaleRef)
y=rgamma(n, shape=shapeTarget, scale=scaleTarget)
xy <- c(x, y)
data.frame(xy,f)
}
Function to run anova() and extract the p-value:
sumfun <- function(d) {
aa <- anova(glm(xy ~ f, family = Gamma(link = log),data=d),test="F")
aa["f","Pr(>F)"]
}
Try it out, 500 times:
set.seed(101)
r <- replicate(500,sumfun(simfun()))
The p-values are always very small (the difference in scale parameters is easily distinguishable), but they do vary:
par(las=1,bty="l") ## cosmetic
hist(log10(r),col="gray",breaks=50)

Cross validation for glm() models

I'm trying to do a 10-fold cross validation for some glm models that I have built earlier in R. I'm a little confused about the cv.glm() function in the boot package, although I've read a lot of help files. When I provide the following formula:
library(boot)
cv.glm(data, glmfit, K=10)
Does the "data" argument here refer to the whole dataset or only to the test set?
The examples I have seen so far provide the "data" argument as the test set but that did not really make sense, such as why do 10-folds on the same test set? They are all going to give exactly the same result (I assume!).
Unfortunately ?cv.glm explains it in a foggy way:
data: A matrix or data frame containing the data. The rows should be
cases and the columns correspond to variables, one of which is the
response
My other question would be about the $delta[1] result. Is this the average prediction error over the 10 trials? What if I want to get the error for each fold?
Here's what my script looks like:
##data partitioning
sub <- sample(nrow(data), floor(nrow(x) * 0.9))
training <- data[sub, ]
testing <- data[-sub, ]
##model building
model <- glm(formula = groupcol ~ var1 + var2 + var3,
family = "binomial", data = training)
##cross-validation
cv.glm(testing, model, K=10)
I am always a little cautious about using various packages 10-fold cross validation methods. I have my own simple script to create the test and training partitions manually for any machine learning package:
#Randomly shuffle the data
yourData<-yourData[sample(nrow(yourData)),]
#Create 10 equally size folds
folds <- cut(seq(1,nrow(yourData)),breaks=10,labels=FALSE)
#Perform 10 fold cross validation
for(i in 1:10){
#Segement your data by fold using the which() function
testIndexes <- which(folds==i,arr.ind=TRUE)
testData <- yourData[testIndexes, ]
trainData <- yourData[-testIndexes, ]
#Use test and train data partitions however you desire...
}
#Roman provided some answers in his comments, however, the answer to your questions is provided by inspecting the code with cv.glm:
I believe this bit of code splits the data set up randomly into the K-folds, arranging rounding as necessary if K does not divide n:
if ((K > n) || (K <= 1))
stop("'K' outside allowable range")
K.o <- K
K <- round(K)
kvals <- unique(round(n/(1L:floor(n/2))))
temp <- abs(kvals - K)
if (!any(temp == 0))
K <- kvals[temp == min(temp)][1L]
if (K != K.o)
warning(gettextf("'K' has been set to %f", K), domain = NA)
f <- ceiling(n/K)
s <- sample0(rep(1L:K, f), n)
This bit here shows that the delta value is NOT the root mean square error. It is, as the helpfile says The default is the average squared error function. What does this mean? We can see this by inspecting the function declaration:
function (data, glmfit, cost = function(y, yhat) mean((y - yhat)^2),
K = n)
which shows that within each fold, we calculate the average of the error squared, where error is in the usual sense between predicted response vs actual response.
delta[1] is simply the weighted average of the SUM of all of these terms for each fold, see my inline comments in the code of cv.glm:
for (i in seq_len(ms)) {
j.out <- seq_len(n)[(s == i)]
j.in <- seq_len(n)[(s != i)]
Call$data <- data[j.in, , drop = FALSE]
d.glm <- eval.parent(Call)
p.alpha <- n.s[i]/n #create weighted average for later
cost.i <- cost(glm.y[j.out], predict(d.glm, data[j.out,
, drop = FALSE], type = "response"))
CV <- CV + p.alpha * cost.i # add weighted average error to running total
cost.0 <- cost.0 - p.alpha * cost(glm.y, predict(d.glm,
data, type = "response"))
}

Obtain t-statistic for regression coefficients of an “mlm” object returned by `lm()`

I've used lm() to fit multiple regression models, for multiple (~1 million) response variables in R. Eg.
allModels <- lm(t(responseVariablesMatrix) ~ modelMatrix)
This returns an object of class "mlm", which is like a huge object containing all the models. I want to get the t-statistic for the first coefficient in each model, which I can do using the summary(allModels) function, but its very slow on this large data and returns a lot of unwanted info too.
Is there a faster way of calculating the t-statistic manually, that might be faster than using the summary() function
Thanks!
You can hack the summary.lm() function to get just the bits you need and leave the rest.
If you have
nVariables <- 5
nObs <- 15
y <- rnorm(nObs)
x <- matrix(rnorm(nVariables*nObs),nrow=nObs)
allModels <-lm(y~x)
Then this is the code from the lm.summary() function but with all the excess baggage removed (note, all the error handling has been removed as well).
p <- allModels$rank
rdf <- allModels$df.residual
Qr <- allModels$qr
n <- NROW(Qr$qr)
p1 <- 1L:p
r <- allModels$residuals
f <- allModels$fitted.values
w <- allModels$weights
mss <- if (attr(allModels$terms, "intercept"))
sum((f - mean(f))^2) else sum(f^2)
rss <- sum(r^2)
resvar <- rss/rdf
R <- chol2inv(Qr$qr[p1, p1, drop = FALSE])
se <- sqrt(diag(R) * resvar)
est <- allModels$coefficients[Qr$pivot[p1]]
tval <- est/se
tval is now a vector of the t statistics as also give by
summary(allModels)$coefficients[,3]
If you have problems on the large model you might want to rewrite the code so that it keeps fewer objects by compounding multiple lines/assignments into fewer lines.
Hacky solution I know. But it will be about as fast as possible. I suppose it would be neater to put all the lines of code into a function as well.

Resources