I'm trying the next code to try to see if predict can help me to find the values of the dependent variable for a polynomial of order 2, in this case it is obvious y=x^2:
x <- c(1, 2, 3, 4, 5 , 6)
y <- c(1, 4, 9, 16, 25, 36)
mypol <- lm(y ~ poly(x, 2, raw=TRUE))
> mypol
Call:
lm(formula = y ~ poly(x, 2, raw = TRUE))
Coefficients:
(Intercept) poly(x, 2, raw = TRUE)1 poly(x, 2, raw = TRUE)2
0 0 1
If I try to find the value of x=7, I get this:
> predict(mypol, 7)
Error in eval(predvars, data, env) : not that many frames on the stack
What am I doing wrong?
If you read the help for predict.lm, you will see that it takes a number of arguments including newdata
newdata -- An optional data frame in which to look for variables with
which to predict. If omitted, the fitted values are used.
predict(mypol, newdata = data.frame(x=7))
Related
We're performing an exploratory logistic regression and trying to determine the importance of the variables in predicting the outcome. We are using the train() and varImp() functions from the caret package. Ultimately, we would like to create a table/dataframe output that has 3 columns: Variable Name, Importance, and Coefficient. An output like this:
Desired format of output.
Here's some sample code to illustrate:
library(caret)
# Create a sample dataframe
my_DV <- c(0, 1, 0, 1, 1)
IV1 <- c(10, 40, 15, 35, 38)
IV2 <- c(1, 0, 1, 0, 1)
IV3 <- c(5, 4, 3, 2, 1)
IV4 <- c(5, 7, 3, 8, 9)
IV5 <- c(1, 2, 1, 2, 1)
df <- data.frame(my_DV, IV1, IV2, IV3, IV4, IV5)
df$my_DV <- as.factor(df$my_DV)
df$IV1 <- as.numeric(df$IV1)
df$IV2 <- as.factor(df$IV2)
df$IV3 <- as.numeric(df$IV3)
df$IV4 <- as.numeric(df$IV4)
df$IV5 <- as.factor(df$IV5)
# train model/perform logistic regression
model_one <- train(form = my_DV ~ ., data = df, trControl = trainControl(method = "cv", number = 5),
method = "glm", family = "binomial", na.action=na.omit)
summary(model_one)
# get the variable importance
imp <- varImp(model_one)
imp
I would like to take the importance values in imp and merge them with the coefficients from model_one but I'm fairly new to R and I can't figure out how to do it.
Any suggestions are greatly appreciated!
Here is one of many ways to get the desired output:
You assign the summary of the model to an object, and then extract the coefficients using coef() function, and then bind it with the variable names and the corresponding importance into a data frame. You then sort the rows based on the values of importance by using order().
sum_mod <- summary(model_one)
dat <- data.frame(VariableName = rownames(imp$importance),
Importance = unname(imp$importance),
Coefficient = coef(sum_mod)[rownames(imp$importance),][,1],
row.names = NULL)
dat <- dat[order(dat$Importance, decreasing = TRUE),]
The result:
VariableName Importance Coefficient
1 IV1 100.00000 1.0999732
4 IV4 74.48458 3.6665775
2 IV21 34.43803 -7.8831404
3 IV3 0.00000 -0.9166444
I am trying to run anovas on regression models (LMERs) with 200-400 observations, so I don't want to drop observations based on any missing data.
Here is the problem I am facing, simplified and reproducible:
dats <- data.frame(y = c(5, 3, 7, 4, 8, 4, 7, 3, 6, 3),
x = c(1, 2, 1, 2, 1, 1, 2, 1, 2, 1),
z = c(NA, 5, 6, 7, 8, 5, 4, 3, 2, 2))
fit1 <- lm(y ~ x, data = dats, na.action = "na.omit")
fit2 <- lm(y ~ x + z, data = dats, na.action = "na.omit")
anova(fit1, fit2)
And the error I encounter:
Error in anova.lmlist(object, ...) : models were not all fitted to the same size of dataset
Mainly, I need to run these ANOVAs to know whether the changes in the marginal R^2 in the LMERs are statistically significant. Is there any way of running these regressions and ANOVAs withouth dropping observations with missing data?
Disclaimer: I'm more of a Bayes person and don't really like the NHST world view.
Like #denis, I'm pretty sure the answer is "no". ANOVA is designed to be used on the same data as it's notionally based on comparing sum-of-squares errors.
The obvious options would be to exclude the same rows:
d2 <- na.omit(dats[,c('x','y','z')])
f1 <- lm(y ~ x, data=d2)
f2 <- lm(y ~ x + z, data=d2)
anova(f1, f2)
which obviously throws away data, or your could impute the missing data:
i <- na.action(d2)
d3 <- dats
d3$z[i] <- predict(lm(z ~ x*y, dats), dats[i,])
f1 <- lm(y ~ x, data = d2)
f2 <- lm(y ~ x + z, data = d2)
anova(f1, f2)
but this will underestimate the variance.
I want to do bootstrap of the division of the outputs of two regression models to get confidence intervals of the mean of the result of the operation.
#creating sample data
ldose <- rep(0:5, 2)
numdead <- c(1, 4, 9, 13, 18, 20, 0, 2, 6, 10, 12, 16)
sex <- factor(rep(c("M", "F"), c(6, 6)))
SF <- cbind(numdead, numalive = 20-numdead)
dat<-data.frame(ldose, numdead, sex, SF)
tibble::rowid_to_column(dat, "indices")
#creating the function to be bootstrapped
out<-function(dat) {
d<-data[indices, ] #allows boot to select sample
fit1<- glm(SF ~ sex*ldose, family = binomial (link = log), start=c(-1,0,0,0))
fit2<- glm(SF ~ sex*ldose, family = binomial (link = log), start=c(-1,0,0,0))
coef1<-coef(fit1)
numer<-exp(coef1[2])
coef2<-coef(fit2)
denom<-exp(coef2[2])
resultX<-numer/denom
return(mean(resultX))
}
#doing bootstrap
results <- boot(dat, out, 1000)
#error message
Error in statistic(data, original, ...) : unused argument (original)
Thanks in advance for any help.
Does it make sense to use a linear booster to predict a categorical outcome?
I thought it could work like multinomial logistic regression.
An example in R is as follows,
y <- c(0, 1, 2, 0, 1, 2) # target variable with numeric encoding
x1 <- c(1, 3, 5, 3, 5, 7)
x2 <- rnorm(n = 6, sd = 1) + x1
df <- data.matrix(data.frame(x1, x2, y))
xgb <- xgboost(data = df[, c("x1", "x2")], label = df[, "y"],
params = list(booster = "gblinear", objective = "multi:softmax",
num_class = 3),
save_period = NULL, nrounds = 1)
xgb.importance(model = xgb)
I don't get an error but the importance has 6 features instead of the expected 2. Is there any interpretation of the 6 importances in terms of the 2 input variables? Or does this not make any sense and only gbtree is sensible?
Thanks
This question already has answers here:
Predict() - Maybe I'm not understanding it
(4 answers)
Closed 6 years ago.
I have a simple problem but can't seem to figure out what I'm doing wrong. I am using predict to estimate values from a fitted linear model. The following code works correctly:
x <- c(1, 2, 3, 4, 5 , 6)
y <- c(1, 4, 9, 16, 25, 36)
model <- lm(y ~ x)
predict(model, newdata = data.frame(x=7))
However, when I use the same data for x and y, but in a dataframe, it does not work. For example,
df2 <- structure(list(x = c(1, 2, 3, 4, 5, 6), y = c(1, 4, 9, 16, 25,36)),
.Names = c("x", "y"), row.names = c(NA, -6L), class = "data.frame")
model <- lm(df2$y ~ df2$x)
predict(model, newdata = data.frame(x=7))
throws the error:
Warning message:
'newdata' had 1 row but variables found have 6 rows
I am using the same exact data and am expecting the same answer. Can anyone tell me what I am doing wrong? Thanks!
Try
model = lm(y ~ x, data = df2)
predict(model, newdata = data.frame(x = 7))
The problem is the way you specified the formula.