transforming variables to use for prediction in regression model - r

I have the following dataset, I wish to find the best varibles among pred1, pred2, pred3, pred5, and pred6 to buid a regression model to predict resp1 and resp2.
So far, I only find out pred2 seems like the one to use for resp1.
library(ggplot2)
testdat <- read.csv("testdat.csv", header = T)
plot(testdat$pred2,testdat$resp1)
m1<-lm(resp1~pred2, data=testdat)
smooth <- smooth.spline(testdat$pred2,testdat$resp1, spar=1)
lines(smooth, col='red', lwd = 2, lty = 2)
I tried ^2,sqrt,log to do the transformations, when I check summary(m1), no matter what I tried, the R-squared seems no higher than 53%, desperate.
The same about transforming variables to predict resp2, no higher than 66%
plot(testdat$pred3,testdat$resp2)
m1<-lm(resp2~pred3, data=testdat)
smooth <- smooth.spline(testdat$pred3,testdat$resp2, spar=1)
lines(smooth, col='red', lwd = 2, lty = 2)
Sample dataset:
https://www.filehosting.org/file/details/846977/testdat.csv
Having more than one variable is highly considered, if you are trying to generate a regression model to predict resp1 and then resp2 by using one or more varibles among pred1, pred2, pred3, pred5, and pred6. What would you do?

If you are looking to use linear regression, you could perform a RESET test on your data and work from there:
resettest(resp1~pred2, data=testdat, power = 2:3, type = "regressor)
If the p.value is lower than 0.05, then you have to change the model.
And use adjusted R2 instead of R2 to make comparisons between models.
Otherwise, more information regarding your data might be necessary (e.g. what is its nature?)

Related

Calculating piecewise quantile linear regression with segmented package R

I am looking for a way to obtain the piecewise quantile linear regression with R. I have been able to compute the Quantile regression with the package quantreg. However, I don't want just 1 unique slope but want to check for breakpoints in my dataset. I have seen that the segmented package can do so. While it works good if the fit is carried out with lm or glm (as shown below in an example), it doesn't manage to work for quantile.
On the segmented package info I have read that there is a segmented.default which can be used for specific regression models, such as Quantiles. However, when I apply it for my quantile outcome it gives me the following errors:
Error in diag(vv) : invalid 'nrow' value (too large or NA)
In addition: Warning message:
cannot compute the covariance matrix
If instead of using K=2 I use for example psi I get other type of errors:
Error in rq.fit.br(x, y, tau = tau, ...) : Singular design matrix
I have created an example with the mtcars data so you can see the errors that I get.
library(quantreg)
library(segmented)
data(mtcars)
out.rq <- rq(mpg ~ wt, data= mtcars)
out.lm <- lm(mpg ~ wt, data= mtcars)
# Plotting the results
plot(mpg ~ wt, data = mtcars, pch = 1, main = "mpg ~ wt")
abline(out.lm, col = "red", lty = 2)
abline(out.rq, col = "blue", lty = 2)
legend("topright", legend = c("linear", "quantile"), col = c("red", "blue"), lty = 2)
#Generating segmented LM
o <- segmented(out.lm, seg.Z= ~wt, npsi=2, control=seg.control(display=FALSE))
plot(o, lwd=2, col=2:6, main="Segmented regression", res=FALSE) #lwd: line width #col: from 2 to 6 #RES: show datapoints
#Generating segmented Quantile
#using K=2
o.quantile <- segmented.default(out.rq, seg.Z= ~wt, control=seg.control(display=FALSE, K=2))
# using psi
o.quantile <- segmented.default(out.rq, seg.Z= ~wt, psi=list(wt=c(2,4)), control=seg.control(display=FALSE))
I came across this post after a long time because I have the same issue. Just in case others might be stuck with the problem in the future, I wanted to point out what the problem is.
I examined "segmented.default". There is a line in the source code as follows:
Cov <- try(vcov(objF), silent = TRUE)
vcov is used to calculate the covariance matrix but does not work for quantile regression object objF. To get the covariance matrix for quantile regression, you need:
summary(objF,se="boot",cov=TRUE)$cov
Here, I used bootstrap method to compute the covariance matrix by selecting se="boot" but you should choose the appropriate method for you. Check ?summary.rq then "se" section for different methods.
Additionally, you need to assign the row/column names as follows:
dimnames(Cov)[[1]] <- dimnames(Cov)[[2]] <- unlist(attributes(objF$coef))
After modifying the function, it worked for me.
Maybe the other answer isn't particularly clean, as you need to modify a package function.
Additionally, maybe boot isn't such a good idea for SEs, according to this answer.
To get it working a bit easier, add a function to your workspace:
vcov.rq <- function(object, ...) {
result = summary(object, se = "nid", covariance = TRUE)$cov
rownames(result) = colnames(result) = names(coef(object))
return(result)
}
Caveats from the Cross-Validated link apply.

Obtaining Standardized coefficients from "rstanarm" package in R?

I was wondering if it might be possible (and perhaps recommended) to obtain standardized coefficients from stan_glm() in the rstanarm package? (did not find anything specific in the documentation)
Can I just standardize all variables as in normal regression? (see below)
Example:
library("rstanarm")
fit <- stan_glm(wt ~ vs*gear, data = mtcars)
Standardization:
design <- wt ~ vs*gear
vars <- all.vars(design)
stand.vars <- lapply(mtcars[, vars], scale)
fit <- stan_glm(stand.vars, data = mtcars)
I would not say that it is affirmatively recommended, but I would recommend that you not subtract the sample mean and divide by the sample standard deviation of the outcome because the estimation uncertainty in those two statistics will not be propagated to the posterior distribution.
Standardizing the predictors is more debatable. You can do it, but it makes doing posterior prediction with new data harder because you have to remember to subtract the old means from the new data and divide by the old standard deviations.
The most computationally efficient approach is to leave the variables as they are but specify the non-default argument QR = TRUE, especially if you are not going to modify the default (normal) priors on the coefficients anyway.
You can then standardize the posterior coefficients after-the-fact if standardized coefficients are of interest. To do so, you can do
X <- model.matrix(fit)
sd_X <- apply(X, MARGIN = 2, FUN = sd)[-1]
sd_Y <- apply(posterior_predict(fit), MARGIN = 1, FUN = sd)
beta <- as.matrix(fit)[ , 2:ncol(X), drop = FALSE]
b <- sweep(sweep(beta, MARGIN = 2, STATS = sd_X, FUN = `*`),
MARGIN = 1, STATS = sd_Y, FUN = `/`)
summary(b)
However, standardizing regression coefficients just gives the illusion of comparability across variables and says nothing about how germane a one standard deviation difference is, particularly for dummy variables. If your question is really whether manipulating this predictor or that predictor is going to make a bigger difference on the outcome variable, then simply simulate those manipulations like
PPD_0 <- posterior_predict(fit)
nd <- model.frame(fit)
nd[ , 2] <- nd[ , 2] + 1 # for example
PPD_1 <- posterior_predict(fit, newdata = nd)
summary(c(PPD_1 - PPD_0))
and repeat that process for other manipulations of interest.

Linear Mixed-Effects Models for a big spatial auto-correlated dataset

So, I am working with a big dataset (55965 points). I am trying to run a LME accounting for correlation. But R will return me this
Error: 'sumLenSq := sum(table(groups)^2)' = 3.13208e+09 is too large.
Too large or no groups in your correlation structure?
I can not subset it since I need all the points. My questions are:
Is there some setting I can change in the function?
If not, is there any other package with similar function that would run such a big dataset?
Here is a reproducible example:
require(nlme)
my.data<- matrix(data = 0, nrow = 55965, ncol = 3)
my.data<- as.data.frame(my.data)
dummy <- rep(1, 55965)
my.data$dummy<- dummy
my.data$V1<- seq(780, 56744)
my.data$V2<- seq(1:55965)
my.data$X<- seq(49.708, 56013.708)
my.data$Y<-seq(-12.74094, -55977.7409)
null.model <- lme(fixed = V1~ V2, data = my.data, random = ~ 1 | dummy, method = "ML")
spatial_model <- update(null.model, correlation = corGaus(1, form = ~ X + Y), method = "ML")
Since you have assigned a grouping factor with only one level, there are no groups in the data, which is what the error message reports. If you just want to account for spatial autocorrelation, with no other random effects, use gls from the same package.
Edit: A further note on 2 different approaches to modelling spatial autocorrelation: The corrGauss (and other corrSpatial type functions) implement spatial correlation models for regression residuals, which is different from, say, a spatial random effect added to the model based on county/district/grid identity.

How can I perform a linear regression on my group variances in R?

Here is my data
Comparing the mean seems to yield to some interested results. And it indeed does as revealed by the linear model:
lm(data=data, y~factor(x)))
Now, it also looks like the variances are not equal in all groups. Here is a plot of the variance in y for each group in x.
I'd be interested to test different linear regression to see if a quadratic regression is a better model than a linear model (compared by AIC or BIC).
I could estimate the sampling distribution for the variance and get a confidence interval for the variance in each group, so it should be feasible to perform a regression on the variance of my groups. However, I don't know how to perform a regression on my sample variances in R.
How can I perform a linear regression on my group variances in R?
I wouldn't use a polynomial to model variances. Among the variance functions offered by package nlme is varConstPower. Let's try this:
n <- c(1, 2, 4, 8, 16, 32)
v <- c(5.85, 6.35, 6.55, 6.85, 7.02, 7.15)
plot(v ~ n)
fit_ConstPower <- nls(v ~ n^(2*theta) + c,
start = list(theta = 2, c = 4),
data = data.frame(n, v))
summary(fit_ConstPower)
lines(npred <- seq(1, 32, length.out = 100),
predict(fit_ConstPower, newdata = data.frame(n = npred)),
col = "blue")
However, that doesn't seem to be a good fit. To me it looks more like there might be saturation:
fit_hyper <- nls(v ~ k * n^(theta) + c,
start = list(k = -1, theta = -1, c = 7.2),
data = data.frame(n, v))
summary(fit_hyper)
lines(npred,
predict(fit_hyper, newdata = data.frame(n = npred)),
col = "green")
PS: I'm having trouble understanding your question. The literal answer would be so simple (just use lm) that I must miss something. Furthermore, you don't tell us whether x is a covariate or something else (e.g., sample size?). Thus, I can't even guess if there might be some fundamental relationship behind what you observe.
Theres a package called 'glmnet' in R. This library is used for generalized linear models. Included Multivariated Linear Regression. If you are familiar with lasso or ridge penalization glmnet( ) function performs a regression with a combination of both penalizations. And cv.glmnet( ) function performs a crossvalidated model for choosing the best penalization parameter(s).
So, you can do the following:
library('glmnet')
x <- yourdata[sample(1:nrow(mydata),]
cv.model1 <- cv.glmnet(x,y,alpha = your_parameter)
Or if you want to get the simplest linear regression model
you can use de lm( ) function which doesnt perform penalization.
simple_model <- lm(y~x,data=data.frame(x,y))
For more info on both models visit the documentation for the corresponding functions. Hope this helps.

Plot learning curves with caret package and R

I would like to study the optimal tradeoff between bias/variance for model tuning. I'm using caret for R which allows me to plot the performance metric (AUC, accuracy...) against the hyperparameters of the model (mtry, lambda, etc.) and automatically chooses the max. This typically returns a good model, but if I want to dig further and choose a different bias/variance tradeoff I need a learning curve, not a performance curve.
For the sake of simplicity, let's say my model is a random forest, which has just one hyperparameter 'mtry'
I would like to plot the learning curves of both training and test sets. Something like this:
(red curve is the test set)
On the y axis I put an error metric (number of misclassified examples or something like that); on the x axis 'mtry' or alternatively the training set size.
Questions:
Has caret the functionality to iteratively train models based of training set folds different in size? If I have to code by hand, how can I do that?
If I want to put the hyperparameter on the x axis, I need all the models trained by caret::train, not just the final model (the one with maximum performance got after CV). Are these "discarded" model still available after train?
Caret will iteratively test lots of cv models for you if you set the
trainControl() function and the parameters (e.g. mtry) using a tuneGrid().
Both of these are then passed as control options to the train()
function. The specifics of the tuneGrid parameters (e.g. mtry, ntree) will be different for each
model type.
Yes the final trainFit model will contain the error rate (however you specified it) for all folds of your CV.
So you could specify e.g. a 10-fold CV times a grid with 10 values of mtry -which would be 100 iterations. You might want to go get a cup of tea or possibly lunch.
If this sounds complicated ... there is a very good example here - caret being one of the best documented packages about.
Here's my code on how I approached this issue of plotting a learning curve in R while using the Caret package to train your model. I use the Motor Trend Car Road Tests in R for illustrative purposes. To begin, I randomize and split the mtcars dataset into training and test sets. 21 records for training and 13 records for the test set. The response feature is mpg in this example.
# set seed for reproducibility
set.seed(7)
# randomize mtcars
mtcars <- mtcars[sample(nrow(mtcars)),]
# split iris data into training and test sets
mtcarsIndex <- createDataPartition(mtcars$mpg, p = .625, list = F)
mtcarsTrain <- mtcars[mtcarsIndex,]
mtcarsTest <- mtcars[-mtcarsIndex,]
# create empty data frame
learnCurve <- data.frame(m = integer(21),
trainRMSE = integer(21),
cvRMSE = integer(21))
# test data response feature
testY <- mtcarsTest$mpg
# Run algorithms using 10-fold cross validation with 3 repeats
trainControl <- trainControl(method="repeatedcv", number=10, repeats=3)
metric <- "RMSE"
# loop over training examples
for (i in 3:21) {
learnCurve$m[i] <- i
# train learning algorithm with size i
fit.lm <- train(mpg~., data=mtcarsTrain[1:i,], method="lm", metric=metric,
preProc=c("center", "scale"), trControl=trainControl)
learnCurve$trainRMSE[i] <- fit.lm$results$RMSE
# use trained parameters to predict on test data
prediction <- predict(fit.lm, newdata = mtcarsTest[,-1])
rmse <- postResample(prediction, testY)
learnCurve$cvRMSE[i] <- rmse[1]
}
pdf("LinearRegressionLearningCurve.pdf", width = 7, height = 7, pointsize=12)
# plot learning curves of training set size vs. error measure
# for training set and test set
plot(log(learnCurve$trainRMSE),type = "o",col = "red", xlab = "Training set size",
ylab = "Error (RMSE)", main = "Linear Model Learning Curve")
lines(log(learnCurve$cvRMSE), type = "o", col = "blue")
legend('topright', c("Train error", "Test error"), lty = c(1,1), lwd = c(2.5, 2.5),
col = c("red", "blue"))
dev.off()
The output plot is as shown below:
At some point, probably after this question was asked, the caret package added the learning_curve_dat function which helps assess model performance across a range of training set sizes.
Here is the example from the function documentation:
library(caret)
set.seed(1412)
class_dat <- twoClassSim(1000)
set.seed(29510)
lda_data <- learning_curve_dat(dat = class_dat,
outcome = "Class",
test_prop = 1/4,
## `train` arguments:
method = "lda",
metric = "ROC",
trControl = trainControl(classProbs = TRUE,
summaryFunction = twoClassSummary))
ggplot(lda_data, aes(x = Training_Size, y = ROC, color = Data)) +
geom_smooth(method = loess, span = .8)
The performance metric(s) are found for each Training_Size and saved in lda_data along with the Data variable ("Resampling", "Training", and optionally "Testing").
Here is a link to the function documentation: https://rdrr.io/cran/caret/man/learning_curve_dat.html
To be clear, this answers the first part of the question but not the second part.
NOTE Before at least August 2020 there was a typo in the caret package code and documentation. The function call was learing_curve_dat before it was corrected to learning_curve_dat. I've updated my answer to reflect this change. Make sure you are using a recent version of the caret package.

Resources