Trouble using predict with linear model in R [duplicate] - r

This question already has answers here:
Predict() - Maybe I'm not understanding it
(4 answers)
Closed 6 years ago.
I have a simple problem but can't seem to figure out what I'm doing wrong. I am using predict to estimate values from a fitted linear model. The following code works correctly:
x <- c(1, 2, 3, 4, 5 , 6)
y <- c(1, 4, 9, 16, 25, 36)
model <- lm(y ~ x)
predict(model, newdata = data.frame(x=7))
However, when I use the same data for x and y, but in a dataframe, it does not work. For example,
df2 <- structure(list(x = c(1, 2, 3, 4, 5, 6), y = c(1, 4, 9, 16, 25,36)),
.Names = c("x", "y"), row.names = c(NA, -6L), class = "data.frame")
model <- lm(df2$y ~ df2$x)
predict(model, newdata = data.frame(x=7))
throws the error:
Warning message:
'newdata' had 1 row but variables found have 6 rows
I am using the same exact data and am expecting the same answer. Can anyone tell me what I am doing wrong? Thanks!

Try
model = lm(y ~ x, data = df2)
predict(model, newdata = data.frame(x = 7))
The problem is the way you specified the formula.

Related

Using LmFuncs (Linear Regression) in Caret for Recursive Feature Elimination: How do I fix "same number of samples in x and y" error?

I'm new to R and trying to isolate the best performing features from a data set of 247 columns (246 variables + 1 outcome), and 800 or so rows (where each row is one person's data) to create a predictive model.
I'm using caret to do RFE using lmfuncs - I need to use linear regression since the target variable continuous.
I use the following to split into test/training data (which hasn't evoked errors)
inTrain <- createDataPartition(data$targetVar, p = .8, list = F)
train <- data[inTrain, ]
test <- data[-inTrain, ]
The resulting test and train files have even variables within the sets. e.g X and Y contain the same number samples / all columns are the same length
My control parameters are as follows (also runs without error)
control = rfeControl(functions = lmFuncs, method = "repeatedcv", repeats = 5, verbose = F, returnResamp = "all")
But when I run RFE I get an error message saying
Error in rfe.default(train[, -1], train[, 1], sizes = c(10, 15, 20, 25, 30), rfeControl = control) :
there should be the same number of samples in x and y
My code for RFE is as follows, with the target variable in first column
rfe_lm_profile <- rfe(train[, -1], train[, 1], sizes = c(10, 15, 20, 25, 30), rfeControl = control)
I've looked through various forums, but nothing seems to work.
This google.group suggests using an older version of Caret - which I tried, but got the same X/Y error https://groups.google.com/g/rregrs/c/qwcP0VGn4ag?pli=1
Others suggest converting the target variable to a factor or matrix. This hasn't helped, and evokes
Warning message:
In createDataPartition(data$EBI_SUM, p = 0.8, list = F) :
Some classes have a single record
when partitioning the data into test/train, and the same X/Y sample error if you try to carry out RFE.
Mega thanks in advance :)
p.s
Here's the dput for the target variable (EBI_SUM) and a couple of variables
data <- structure(list(TargetVar = c(243, 243, 243, 243, 355, 355), Dosing = c(2,
2, 2, 2, 2, 2), `QIDS_1 ` = c(1, 1, 3, 1, 1, 1), `QIDS_2 ` = c(3,
3, 2, 3, 3, 3), `QIDS_3 ` = c(1, 2, 1, 1, 1, 2)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
>
Your data object should not contain spaces:
library(caret)
data <- data.frame(
TargetVar = c(243, 243, 243, 243, 355, 355),
Dosing = c(2, 2, 2, 2, 2, 2),
QIDS_1 = c(1, 1, 3, 1, 1, 1),
QIDS_2 = c(3, 3, 2, 3, 3, 3),
QIDS_3 = c(1, 2, 1, 1, 1, 2)
)
inTrain <- createDataPartition(data$TargetVar, p = .8, list = F)
train <- data[inTrain, ]
test <- data[-inTrain, ]
control <- rfeControl(functions = lmFuncs, method = "repeatedcv", repeats = 5, verbose = F, returnResamp = "all")
rfe_lm_profile <- rfe(train[, -1], train[, 1], sizes = c(10, 15, 20, 25, 30), rfeControl = control)

Two-way repeated measure Anova R with ezANOVA error, One or more cells is missing data

I've created this minimal dataset for the example :
data_long <- data.frame(Subject = factor(c(1, 2, 3, 1, 2, 3)),
Trt = factor(c("T1","T2","T3","T1","T2","T3")),
Day = factor(c(7, 7, 7, 14, 14, 14)),
Value = c(7.6, 5.3, 8.6, 12.4, 11.2, 11))
But when I try to make a two way repeated measure ANOVA with ezANOVA, I have this error :
m2 <- ezANOVA(data = data_long, dv = Value, wid = Subject, within = c(Day,Trt))
Erreur dans ezANOVA_main(data = data, dv = dv, wid = wid, within = within, :
One or more cells is missing data. Try using ezDesign() to check your data.
I definitely don't have missing data, but this error still occurs. Is there a way to fix that ?
Thank you in advance,
Yemoloh
I think the problem you are having is that for each level of the Trt factor one single participant is present.
You can see this by adding the same participants to each condition (so that each participant is present for each Trt condition):
data_long <- data.frame(Subject = factor(rep(1:3, each = 6)),
Trt = factor(rep(c("T1", "T2", "T3"), times = 6)),
Day = factor(rep(c(7, 14), times = 3, each = 3)),
Value = rnorm(n = 18, mean = 6))
With this data structure you would be able to run the ANOVA as you specified it.

How to use anova() on regression models with missing data?

I am trying to run anovas on regression models (LMERs) with 200-400 observations, so I don't want to drop observations based on any missing data.
Here is the problem I am facing, simplified and reproducible:
dats <- data.frame(y = c(5, 3, 7, 4, 8, 4, 7, 3, 6, 3),
x = c(1, 2, 1, 2, 1, 1, 2, 1, 2, 1),
z = c(NA, 5, 6, 7, 8, 5, 4, 3, 2, 2))
fit1 <- lm(y ~ x, data = dats, na.action = "na.omit")
fit2 <- lm(y ~ x + z, data = dats, na.action = "na.omit")
anova(fit1, fit2)
And the error I encounter:
Error in anova.lmlist(object, ...) : models were not all fitted to the same size of dataset
Mainly, I need to run these ANOVAs to know whether the changes in the marginal R^2 in the LMERs are statistically significant. Is there any way of running these regressions and ANOVAs withouth dropping observations with missing data?
Disclaimer: I'm more of a Bayes person and don't really like the NHST world view.
Like #denis, I'm pretty sure the answer is "no". ANOVA is designed to be used on the same data as it's notionally based on comparing sum-of-squares errors.
The obvious options would be to exclude the same rows:
d2 <- na.omit(dats[,c('x','y','z')])
f1 <- lm(y ~ x, data=d2)
f2 <- lm(y ~ x + z, data=d2)
anova(f1, f2)
which obviously throws away data, or your could impute the missing data:
i <- na.action(d2)
d3 <- dats
d3$z[i] <- predict(lm(z ~ x*y, dats), dats[i,])
f1 <- lm(y ~ x, data = d2)
f2 <- lm(y ~ x + z, data = d2)
anova(f1, f2)
but this will underestimate the variance.

Xgboost multiclass predicton with linear booster

Does it make sense to use a linear booster to predict a categorical outcome?
I thought it could work like multinomial logistic regression.
An example in R is as follows,
y <- c(0, 1, 2, 0, 1, 2) # target variable with numeric encoding
x1 <- c(1, 3, 5, 3, 5, 7)
x2 <- rnorm(n = 6, sd = 1) + x1
df <- data.matrix(data.frame(x1, x2, y))
xgb <- xgboost(data = df[, c("x1", "x2")], label = df[, "y"],
params = list(booster = "gblinear", objective = "multi:softmax",
num_class = 3),
save_period = NULL, nrounds = 1)
xgb.importance(model = xgb)
I don't get an error but the importance has 6 features instead of the expected 2. Is there any interpretation of the 6 importances in terms of the 2 input variables? Or does this not make any sense and only gbtree is sensible?
Thanks

Choosing the X variables and plot estimated fitted values [duplicate]

I'm trying the next code to try to see if predict can help me to find the values of the dependent variable for a polynomial of order 2, in this case it is obvious y=x^2:
x <- c(1, 2, 3, 4, 5 , 6)
y <- c(1, 4, 9, 16, 25, 36)
mypol <- lm(y ~ poly(x, 2, raw=TRUE))
> mypol
Call:
lm(formula = y ~ poly(x, 2, raw = TRUE))
Coefficients:
(Intercept) poly(x, 2, raw = TRUE)1 poly(x, 2, raw = TRUE)2
0 0 1
If I try to find the value of x=7, I get this:
> predict(mypol, 7)
Error in eval(predvars, data, env) : not that many frames on the stack
What am I doing wrong?
If you read the help for predict.lm, you will see that it takes a number of arguments including newdata
newdata -- An optional data frame in which to look for variables with
which to predict. If omitted, the fitted values are used.
predict(mypol, newdata = data.frame(x=7))

Resources