How to solve "cannot coerce class to data.frame? - r

Problems occur in Line 20: x3 <- lm(Salary ~ ...
Error in as.data.frame.default(data) : cannot coerce class ‘c("train", "train.formula")’ to a data.frame
How to solve?
attach(Hitters)
Hitters
library(caret)
set.seed(123)
# Define training control
set.seed(123)
train.control <- trainControl(method = "cv", number = 10)
# Train the model
x2 <- train(Salary ~., data = x, method = "lm",
trControl = train.control)
# Summarize the results
print(x)
x3 <- lm(Salary ~ poly(AtBat,3) + poly(Hits,3) + poly(Walks,3) + poly(CRuns,3) + poly(CWalks,3) + poly(PutOuts,3), data = x2)
summary(x3)
MSE = mean(x3$residuals^2)
print("Mean Squared Error: ")
print(MSE)

First, as #dcarlson already mentioned, you should define x.
Second, x3 does not return a data frame.
If you run
str(x2)
you'll see that all the elements you're using in the lm function are part of a data frame called trainingData.
So if you intend to use the lm function, use that as your data source in the lm function, NOT x2.
I've rewritten your code below.
PS I'm far from a R expert so if someone wants to shoot at this answer, go ahead, I'm always willing to learn ;)
attach(Hitters)
Hitters
library(caret)
set.seed(123)
# Define training control
set.seed(123)
train.control <- trainControl(method = "cv", number = 10)
# Train the model
x2 <- train(Salary ~., data = x, method = "lm", trControl = train.control)
# Summarize the results
print(x2)
# str(x2) # $trainingData data.frame
x2$trainingData[["AtBat"]]
m <- x2$trainingData
x3 <- lm(Salary ~ poly(AtBat,3) + poly(Hits,3) + poly(Walks,3) + poly(CRuns,3) + poly(CWalks,3) + poly(PutOuts,3), data = m)
summary(x3)
MSE = mean(x3$residuals^2)
cat("Mean Squared Error: ", MSE) # use cat to concatenate text and variable value in one line

Related

predict() function in R is not providing prediction in R console

My training data has 87620 rows and 5 columns. My test data has the same number of rows and columns. When I use a CART model to predict the "Defaults" (that is the target variable), my model works and provides me with predictions.
When I use a validation data set that has 6 columns and only 19561 rows, and does not have the Defaults variable, and then proceed to use the
View(validationsetpreds.CART3.3x)
I get the attached picture
Validationsetpreds Picture
And when I perform the same command using the test data set I get the following Testsetpreds Picture
set.seed(123)
loans_training$Default <- as.factor(loans_training$Default)#Make the default variable categorical
loans_test$Default <- as.factor(loans_test$Default)#Make the default variable categorical
loans_training$term <- as.factor(loans_training$term)
loans_test$term <- as.factor(loans_test$term)
#Standardize datasets
library(psych)
library(caret)
preprocess.train.z <- preProcess(loans_training[1:5], method = c("center", "scale"))
preprocess.train.z
loans_train.z <- predict(preprocess.train.z,loans_training[1:5])
describe(loans_train.z)
View(loans_train.z)
summary(loans_train.z$Default)
preprocess.test.z <- preProcess(loans_test[1:5], method = c("center", "scale"))
preprocess.test.z
loans_test.z <- predict(preprocess.test.z,loans_test[1:5])
describe(loans_test.z)
View(loans_test.z)
summary(loans_train.z$Default)
(22417 * 2.3) + 22417
#Resampling subroutine
rare.record.indices <- which(loans_train.z$Default == "1")
rare.indices.resampled <- sample(x = rare.record.indices,size = 51559, replace = TRUE)
rare.records.resampled <- loans_train.z[rare.indices.resampled,]
loans_train.3.3x <- rbind(loans_train.z, rare.records.resampled)
table(loans_train.3.3x$Default)
#Develop 3.3x CART model
TC <- trainControl(method = "CV", number = 10)
fit.CART.3.3x <- train(Default ~ ., data = loans_train.3.3x, method = "rpart", trControl = TC)
fit.CART.3.3x$resample
testsetpreds.CART3.3x <- predict(fit.CART.3.3x,loans_test.z)
table(loans_test.z$Default, testsetpreds.CART3.3x)
testsetpreds.CART3.3x
#Predictions
set.seed(123)
loans_validation$grade <- as.character(loans_validation$grade)#Make the grade variable categorical
loans_validation$term <- as.factor(loans_validation$term)#Make the term variable categorical
loans_validation$Index <- as.factor(loans_validation$Index)#Make the Index variable categorical
#Standardize dataset
library(psych)
library(caret)
preprocess.validation.z <- preProcess(loans_validation[1:6], method = c("center", "scale"))
preprocess.validation.z
loans_validation.z <- predict(preprocess.validation.z,loans_validation[1:6])
#Predict Defaults using Cart
validationsetpreds.CART3.3x <- predict(fit.CART.3.3x,loans_validation.z)
View(validationsetpreds.CART3.3x)
Any help would be greatly appreaciated :)
How would I apply this to the validation data set?

Receiving message "Error in model.frame.default(form = lost_client ~ SRC + Commission_Rate + : variable lengths differ ()" when there are no NA s

Good evening, I am currently running a an classification algorithm using the Caret package. I'm using the upsample and downsample function to take care of data imbalance. I've taken care of all the NA values, however I keep getting this message, "Error in model.frame.default(form = lost_client ~ SRC + Commission_Rate + :
variable lengths differ (found for 'SRC')"
The code for the dataset
clients4 <- clients[,-c(1:6,8,14,15,16,18,19,20,21,22,23,26,27,28,29,32,33,42,44,50,51,52,53,57, 60:62, 63:66,71, 73:75)]
clients4$lost_client <- as.factor(clients4$lost_client)
clients4$New_Client <- as.factor(clients4$New_Client)
clients4 <- clients4[complete.cases(clients4),]
set.seed(101)
Training <- createDataPartition(clients4$lost_client, p=.80)$Resample1
fitControl <- trainControl(method = "cv", number = 10, allowParallel = TRUE)
glmgrid <- expand.grid(lambda=seq(0,1,.05), alpha=seq(0,1,.1))
rpartgrid <- expand.grid(maxdepth=1:20)
rfgrid <- expand.grid(mtry=1:14)
gbmgrid <- expand.grid(interaction.depth=1:5, n.trees=c(50,100,150,200,250), shrinkage=.1, n.minobsinnode=10)
svmgrid <- expand.grid(cost=seq(0,10, 0.05))
Training <- clients4[Training,]
clients5 <- clients4
clients5$lost_client[which(clients4$lost_client == 0)] = -1
TrainUp <- upSample(x=Training[,-2],
y=Training$lost_client)
TrainDown <- downSample(x=Training[,-2],
y=Training$lost_client)
This is the code for the model itself.
set.seed(3)
m2 <- train(lost_client~SRC+Commission_Rate+Line_of_Business+Pro_Rate+Pro_Increase+Premium+PrevWrittenPremium+PrevWrittenAgencyComm+Office_State+Non_Parent+Policy_Count+Cross_Sell_Prdcr+Provider_Type+num_months+Revenue+SIC_Industry_Code, data = TrainUp, method="rpart2",trControl=fitControl, tuneGrid=rpartgrid, num.threads = 6)
pred3 <- predict(m2, newdata=clients4[-Training,])
confusionMatrix(pred3, clients4[-Training,]$lost_client)
m2$bestTune
rpart.plot(m2$finalModel)
Any idea of what is causing this error?

R: Predicting with lmer, y ~ . formula error

Predicting values in new data from an lmer model throws an error when a period is used to represent predictors. Is there any way around this?
The answer to this similar question offers a way to automatically write out the full formula instead of using the period, but I'm curious if there's a way to get predictions from new data just using the period.
Here's a reproducible example:
mydata <- data.frame(
groups = rep(1:3, each = 100),
x = rnorm(300),
dv = rnorm(300)
)
train_subset <- sample(1:300, 300 * .8)
train <- mydata[train_subset,]
test <- mydata[-train_subset,]
# Returns an error
mod <- lmer(dv ~ . - groups + (1 | groups), data = train)
predict(mod, newdata = test)
predict(mod) # getting predictions for the original data works
# Writing the full formula without the period does not return an error, even though it's the exact same model
mod <- lmer(dv ~ x + (1 | groups), data = train)
predict(mod, newdata = test)
This should be fixed in the development branch of lme4 now. You can install from GitHub (see first line below) or wait a few weeks (early April-ish) for a new version to hit CRAN.
remotes::install_github("lme4/lme4") ## you will need compilers etc.
mydata <- data.frame(
groups = rep(1:3, each = 100),
x = rnorm(300),
dv = rnorm(300)
)
train_subset <- sample(1:300, 300 * .8)
train <- mydata[train_subset,]
test <- mydata[-train_subset,]
# Returns an error
mod <- lmer(dv ~ . - groups + (1 | groups), data = train)
p1 <- predict(mod, newdata = test)
mod2 <- lmer(dv ~ x + (1 | groups), data = train)
p2 <- predict(mod2, newdata = test)
identical(p1, p2) ## TRUE

Dummies not included in summary

I want to create a function which will perform panel regression with 3-level dummies included.
Let's consider within model with time effects :
library(plm)
fit_panel_lr <- function(y, x) {
x[, length(x) + 1] <- y
#adding dummies
mtx <- matrix(0, nrow = nrow(x), ncol = 3)
mtx[cbind(seq_len(nrow(mtx)), 1 + (as.integer(unlist(x[, 2])) - min(as.integer(unlist(x[, 2])))) %% 3)] <- 1
colnames(mtx) <- paste0("dummy_", 1:3)
#converting to pdataframe and adding dummy variables
x <- pdata.frame(x)
x <- cbind(x, mtx)
#performing panel regression
varnames <- names(x)[3:(length(x))]
varnames <- varnames[!(varnames == names(y))]
form <- paste0(varnames, collapse = "+")
x_copy <- data.frame(x)
form <- as.formula(paste0(names(y), "~", form,'-1'))
params <- list(
formula = form, data = x_copy, model = "within",
effect = "time"
)
pglm_env <- list2env(params, envir = new.env())
model_plm <- do.call("plm", params, envir = pglm_env)
model_plm
}
However, if I use data :
data("EmplUK", package="plm")
dep_var<-EmplUK['capital']
df1<-EmplUK[-6]
In output I will get :
> fit_panel_lr(dep_var, df1)
Model Formula: capital ~ sector + emp + wage + output + dummy_1 + dummy_2 +
dummy_3 - 1
<environment: 0x000001ff7d92a3c8>
Coefficients:
sector emp wage output
-0.055179 0.328922 0.102250 -0.002912
How come that in formula dummies are considered and in coefficients are not ? Is there any rational explanation or I did something wrong ?
One point why you do not see the dummies on the output is because they are linear dependent to the other data after the fixed-effect time transformation. They are dropped so what is estimable is estimated and output.
Find below some (not readily executable) code picking up your example from above:
dat <- cbind(EmplUK, mtx) # mtx being the dummy matrix constructed in your question's code for this data set
pdat <- pdata.frame(dat)
rhs <- paste(c("emp", "wage", "output", "dummy_1", "dummy_2", "dummy_3"), collapse = "+")
form <- paste("capital ~" , rhs)
form <- formula(form)
mod <- plm(form, data = pdat, model = "within", effect = "time")
detect.lindep(mod$model) # before FE time transformation (original data) -> nothing offending
detect.lindep(model.matrix(mod)) # after FE time transformation -> dummies are offending
The help page for detect.lindep (?detect.lindep is included in package plm) has some more nice examples on linear dependence before and after FE transformation.
A suggestion:
As for constructing dummy variables, I suggest to use R's factor with three levels and not have the dummy matrix constructed yourself. Using a factor is typically more convinient and less error prone. It is converted to the binary dummies (treatment style) by your typical estimation function using the model.frame/model.matrix framework.

predict() R function caret package errors: "newdata" rows different, "type" not accepted

I am running a logistic regression analysis using the caret package.
Data is input as a 18x6 matrix
everything is fine so far except the predict() function.
R is telling me the type parameter is supposed to be raw or prob but raw just spits out an exact copy of the last column (the values of the binomial variable). prob gives me the following error:
"Error in dimnames(out)[[2]] <- modelFit$obsLevels :
length of 'dimnames' [2] not equal to array extent
In addition: Warning message:
'newdata' had 7 rows but variables found have 18 rows"
install.packages("pbkrtest")
install.packages("caret")
install.packages('e1071', dependencies=TRUE)
#install.packages('caret', dependencies = TRUE)
require(caret)
library(caret)
A=matrix(
c(
64830,18213,4677,24761,9845,17504,22137,12531,5842,28827,51840,4079,1000,2069,969,9173,11646,946,66161,18852,5581,27219,10159,17527,23402,11409,8115,31425,55993,0,0,1890,1430,7873,12779,627,68426,18274,5513,25687,10971,14104,19604,13438,6011,30055,57242,0,0,2190,1509,8434,10492,755,69716,18366,5735,26556,11733,16605,20644,15516,5750,31116,64330,0,0,1850,1679,9233,12000,500,73128,18906,5759,28555,11951,19810,22086,17425,6152,28469,72020,0,0,1400,1750,8599,12000,500,1,1,1,0,1,0,0,0,0,1,0,1,1,1,1,1,1,1
),
nrow = 18,
ncol = 6,
byrow = FALSE) #"bycol" does NOT exist
################### data set as vectors
a<-c(64830,18213,4677,24761,9845,17504,22137,12531,5842,28827,51840,4079,1000,2069,969,9173,11646,946)
b<-c(66161,18852,5581,27219,10159,17527,23402,11409,8115,31425,55993,0,0,1890,1430,7873,12779,627)
c<-c(68426,18274,5513,25687,10971,14104,19604,13438,6011,30055,57242,0,0,2190,1509,8434,10492,755)
d<-c(69716,18366,5735,26556,11733,16605,20644,15516,5750,31116,64330,0,0,1850,1679,9233,12000,500)
e<-c(73128,18906,5759,28555,11951,19810,22086,17425,6152,28469,72020,0,0,1400,1750,8599,12000,500)
f<-c(1,1,1,0,1,0,0,0,0,1,0,1,1,1,1,1,1,1)
######################
n<-nrow(A);
K<-ncol(A)-1;
Train <- createDataPartition(f, p=0.6, list=FALSE) #60% of data set is used as training.
training <- A[ Train, ]
testing <- A[ -Train, ]
nrow(training)
#this is the logistic formula:
#estimates from logistic regression characterize the relationship between the predictor and response variable on a log-odds scale
mod_fit <- train(f ~ a + b + c + d +e, data=training, method="glm", family="binomial")
mod_fit
#this isthe exponential function to calculate the odds ratios for each preditor:
exp(coef(mod_fit$finalModel))
predict(mod_fit, newdata=training)
predict(mod_fit, newdata=testing, type="prob")
I'm not very sure to understand, but A is a matrix of (a,b,c,d,e,f). So you don't need to create two objects.
install.packages("pbkrtest")
install.packages("caret")
install.packages('e1071', dependencies=TRUE)
#install.packages('caret', dependencies = TRUE)
require(caret)
library(caret)
A=matrix(
c(
64830,18213,4677,24761,9845,17504,22137,12531,5842,28827,51840,4079,1000,2069,969,9173,11646,946,66161,18852,5581,27219,10159,17527,23402,11409,8115,31425,55993,0,0,1890,1430,7873,12779,627,68426,18274,5513,25687,10971,14104,19604,13438,6011,30055,57242,0,0,2190,1509,8434,10492,755,69716,18366,5735,26556,11733,16605,20644,15516,5750,31116,64330,0,0,1850,1679,9233,12000,500,73128,18906,5759,28555,11951,19810,22086,17425,6152,28469,72020,0,0,1400,1750,8599,12000,500,1,1,1,0,1,0,0,0,0,1,0,1,1,1,1,1,1,1
),
nrow = 18,
ncol = 6,
byrow = FALSE) #"bycol" does NOT exist
A <- data.frame(A)
colnames(A) <- c('a','b','c','d','e','f')
A$f <- as.factor(A$f)
Train <- createDataPartition(A$f, p=0.6, list=FALSE) #60% of data set is used as training.
training <- A[ Train, ]
testing <- A[ -Train, ]
nrow(training)
And to predict a variable you must enter the explanatory variables and not the variable to predict
mod_fit <- train(f ~ a + b + c + d +e, data=training, method="glm", family="binomial")
mod_fit
#this isthe exponential function to calculate the odds ratios for each preditor:
exp(coef(mod_fit$finalModel))
predict(mod_fit, newdata=training[,-which(colnames(training)=="f")])
predict(mod_fit, newdata=testing[,-which(colnames(testing)=="f")])
Short answer, you should not include the explained variable, which is f in your predict equation. So you should do:
predict(mod_fit, newdata=training[, -ncol(training])
predict(mod_fit, newdata=testing[, -ncol(testing])
The issue with the warning message 'newdata' had 11 rows but variables found have 18 rows is because you run the regression using the whole data set (18 observations), but predict using just part of it (either 11 or 7).
EDIT: To simplify the data creation and glm processes we can do:
library(caret)
A <- data.frame(a = c(64830,18213,4677,24761,9845,17504,22137,12531,5842,28827,51840,4079,1000,2069,969,9173,11646,946),
b = c(66161,18852,5581,27219,10159,17527,23402,11409,8115,31425,55993,0,0,1890,1430,7873,12779,627),
c = c(68426,18274,5513,25687,10971,14104,19604,13438,6011,30055,57242,0,0,2190,1509,8434,10492,755),
d = c(69716,18366,5735,26556,11733,16605,20644,15516,5750,31116,64330,0,0,1850,1679,9233,12000,500),
e = c(73128,18906,5759,28555,11951,19810,22086,17425,6152,28469,72020,0,0,1400,1750,8599,12000,500),
f = c(1,1,1,0,1,0,0,0,0,1,0,1,1,1,1,1,1,1))
Train <- createDataPartition(f, p=0.6, list=FALSE) #60% of data set is used as training.
training <- A[ Train, ]
testing <- A[ -Train, ]
mod_fit <- train(f ~ a + b + c + d + e, data=training, method="glm", family="binomial")
I try to run logistic regression model. I wrote this code:
install.packages('caret')
library(caret)
setwd('C:\\Users\\BAHOZ\\Documents\\')
D<-read.csv(file = "D.csv",header = T)
D<-read.csv(file = 'DataSet.csv',header=T)
names(D)
set.seed(111134)
Train<-createDataPartition(D$X, p=0.7,list = FALSE)
training<-D[Train,]
length(training$age)
testing<-D[-Train,]
length(testing$age)
mod_fit<-train(X~age + gender + total.Bilirubin + direct.Bilirubin + total.proteins + albumin + A.G.ratio+SGPT + SGOT + Alkphos,data=training,method="glm", family="binomial")
summary(mod_fit)
exp(coef(mod_fit$finalModel))
And I recived this message for last command:
(Intercept) age gender total.Bilirubin direct.Bilirubin total.proteins albumin A.G.ratio
0.01475027 1.01596886 1.03857883 1.00022899 1.78188072 1.00065332 1.01380334 1.00115742
SGPT SGOT Alkphos
3.93498241 0.05616662 38.29760014
By running this command I could predict my data,
predict(mod_fit , newdata=testing)
But if I set type="prob" or type="raw"
predict(mod_fit , newdata=testing, type = "prob")
it falls in error:
Error in dimnames(out) <- *vtmp* :
length of 'dimnames' [2] not equal to array extent

Resources