linear regression in R - r

I'm trying to predict automobile prices based on a bunch of independent variables using linear regression. The only attributes in my data set that are chr is fuel and color, the rest are either num or int. I omitted fuel because it only has one level.
here is my code:
# Loading Data
car_data = read.csv("Car_Data (1).csv",header =TRUE)
car_data$Fuel <-NULL
car_data$Colour<- as.factor(car_data$Colour)
str(car_data)
set.seed(123)
indx <- sample(2, nrow(car_data), replace = T, prob = c(0.8, 0.2))
train <- car_data[indx == 1, ]
test <- car_data[indx == 2, ]
lmModel <- lm(Price ~ ., data = train)
summary(lmModel)
When I run the summary(lmModel), it shows all NA's for the Error, tvalue, and Pr(>|t|).
Can someone help...

It's possible that your dataset has too few observations in it and you are trying to fit too many features. It would be helpful for reproducibility if you could supply your dataset (or a minimal working example of a similar dataset). Perhaps you could also try running a simpler regression specification to see if that might tease out some errors.
lmModelSimple <- lm(Price ~ Colour, data = train)
summary(lmModelSimple)

Related

How do I utilize imputed data, with categorical levels, in a prediction in R?

I'll illustrate my problem with the iris data set in R. My objective here is to create 5 imputed data sets, fit a regression to each imputed data set, then pool together the results of these regressions into one final model. This is the preferred order of operations for a proper execution of multiple imputation.
library(mice)
df <- iris
# Inject some missingness into the data:
df$Sepal.Width[c(20,40,70,121)] <- NA
df$Species[c(15,80,99,136)] <- NA
# Perform the standard steps of multiple imputation with MICE:
imputed_data <- mice(df, method = c(rep("pmm", 5)), m = 5, maxit = 5)
model <- with(imputed_data, lm(Sepal.Length ~ Sepal.Width + Species))
pooled_model <- pool(model)
This leaves me with this pooled_model object which I am hoping to use as a fitted model in the predict command. However, that does not work. When I run:
predict(pooled_model, newdata = iris)
I get this error:
Error in UseMethod("predict") :
no applicable method for 'predict' applied to an object of class "c('mipo', 'data.frame')"
Disregard the reasoning why I am using the original iris data set in my newly fitted model; I simply want to be able to fit this data, or a subset of it, onto the model I created with my imputation.
I specifically chose a data set with multiple levels of a categorical variable to highlight my problem. I thought about using some matrix multiplication with which I could do this manually, but the presence of a categorical variable makes that tough. In my actual data set, I have over a hundred variables, many of which have multiple categorical levels. I say this because I realize one possible solution would be to re-code my categorical variables into dummy variables, and then I can apply some matrix multiplication to get my answer. But that would be an EXTREME amount of work for me. If there's a way I can somehow get a model object I can use in the predict function, that would make my life 100x easier.
Any suggestions?
You have two issues: 1) how to use stats::predict with pooled data and 2) what to do about your categorical variables.
Your first issue has already been documented on the mice Github page and it seems like there's been a desire to have a predict.mira function for a while. The author of the mice package posted some code on how to simulate a predict.mira-like function. Unfortunately, it only works with lm models, but it seems like that's okay considering your reprex. If you have a Github account, you can comment on that Github issue to demonstrate your interest in the predict.mira function.
Your question also has been posted on StackOverflow before; although the answer was never accepted, the SO user suggested this reading by Miles (2015).
For your second question, have you considered leaving out your current method argument when using mice()? As long as your variables have been classed as factors, then mice will default to the polyreg method for categorical variables and pmm for continuous variables. You can read more about the method argument here.
library(mice)
set.seed(123)
# make missing data
df <- iris
df$Sepal.Width[c(20,40,70,121)] <- NA
df$Species[c(15,80,99,136)] <- NA
# specify method
meth <- mice(df, maxit = 0, printFlag = FALSE)$meth
print(meth)
# this is how you would change your methods, if you wanted
# but pmm and polyreg are defaults
meth["Species"] <- "polr"
meth["Sepal.Width"] <- "midastouch"
print(meth)
# impute
imputed_data <- mice(df,
m = 5,
maxit = 5,
method = meth, # new method
printFlag = FALSE)
# make model
model <- with(imputed_data, lm(Sepal.Length ~ Sepal.Width + Species))
summary(pool(model))
# obtain predictions Q and prediction variance U
predm <- lapply(getfit(model), predict, se.fit = TRUE)
Q <- sapply(predm, `[[`, "fit")
U <- sapply(predm, `[[`, "se.fit")^2
dfcom <- predm[[1]]$df
# pool predictions
pred <- matrix(NA, nrow = nrow(Q), ncol = 3,
dimnames = list(NULL, c("fit", "se.fit", "df")))
for(i in 1:nrow(Q)) {
pi <- pool.scalar(Q[i, ], U[i, ], n = dfcom + 1)
pred[i, 1] <- pi[["qbar"]]
pred[i, 2] <- sqrt(pi[["t"]])
pred[i, 3] <- pi[["df"]]
}
head(pred)

How can I calculate the mean square error in R of a regression tree?

I am working with the wine quality database.
I am studying regression trees depending on different variables as:
library(rpart)
library(rpart.plot)
library(rattle)
library(naniar)
library(dplyr)
library(ggplot2)
vinos <- read.csv(file = 'Wine.csv', header = T)
arbol0<-rpart(formula=quality~chlorides, data=vinos, method="anova")
fancyRpartPlot(arbol0)
arbol1<-rpart(formula=quality~chlorides+density, data=vinos, method="anova")
fancyRpartPlot(arbol1)
I want to calculate the mean square error to see if arbol1 is better than arbol0. I will use my own dataset since no more data is available. I have tried to do it as
aaa<-predict(object=arbol0, newdata=data.frame(chlorides=vinos$chlorides), type="anova")
bbb<-predict(object=arbol1, newdata=data.frame(chlorides=vinos$chlorides, density=vinos$density), type="anova")
and then substract manually the last column of the dataframe from aaa and bbb. However, I am getting an error. Can someone please help me?
This website could be useful for you. It's very important to split your dataset into train and test subsets before training your models. In the following code, I've done it with base functions, but there's another function called sample.split from the caTools package that does the same procedure. I attach you this website where you can see all the ways to split data in R.
Remember that the function of the Mean Squared Error (MSE) is the following one:
So, it's very simple to apply it with R. You just have to compute the mean of the squared difference between the observed (i.e, the response variable from your test subset) and predicted values (i.e, the values you have predicted from the model with the predict function).
A solution for your wine dataset could be this one, based on the previous website.
library(rpart)
library(dplyr)
library(data.table)
vinos <- fread(file = 'Winequality-red.csv', header = TRUE)
# Split data into train and test subsets
sample_index <- sample(nrow(vinos), size = nrow(vinos)*0.75)
train <- vinos[sample_index, ]
test <- vinos[-sample_index, ]
# Train regression trees models
arbol0 <- rpart(formula = quality ~ chlorides, data = train, method = "anova")
arbol1 <- rpart(formula = quality ~ chlorides + density, data = train, method = "anova")
# Make predictions for each model
pred0 <- predict(arbol0, newdata = test)
pred1 <- predict(arbol1, newdata = test)
# Calculate MSE for each model
mean((pred0 - test$quality)^2)
mean((pred1 - test$quality)^2)

Cannot generate predictions in mgcv when using discretization (discrete=T)

I am fitting a model using a random site-level effect using a generalized additive model, implemented in the mgcv package for R. I had been doing this using the function gam() however, to speed things up I need to shift to the bam() framework, which is basically the same as gam(), but faster. I further sped up fitting by passing the options bam(nthreads = N, discrete=T), where nthreads is the number of cores on my machine. However, when I use the discretization option, and then try to make predictions with my model on new data, while ignoring the random effect, I consistent get an error.
Here is code to generate example data and reproduce the error.
library(mgcv)
#generate data.
N <- 10000
x <- runif(N,0,1)
y <- (0.5*x / (x + 0.2)) + rnorm(N)*0.1 #non-linear relationship between x and y.
#uninformative random effect.
random.x <- as.factor(do.call(paste0, replicate(2, sample(LETTERS, N, TRUE), FALSE)))
#fit models.
fit1 <- gam(y ~ s(x) + s(random.x, bs = 're')) #this one takes ~1 minute to fit, rest faster.
fit2 <- bam(y ~ s(x) + s(random.x, bs = 're'))
fit3 <- bam(y ~ s(x) + s(random.x, bs = 're'), discrete = T, nthreads = 2)
#make predictions on new data.
newdat <- data.frame(runif(200, 0, 1))
colnames(newdat) <- 'x'
test1 <- predict(fit1, newdata=newdat, exclude = c("s(random.x)"), newdata.guaranteed = T)
test2 <- predict(fit2, newdata=newdat, exclude = c("s(random.x)"), newdata.guaranteed = T)
test3 <- predict(fit3, newdata=newdat, exclude = c("s(random.x)"), newdata.guaranteed = T)
Making predictions with the third model which uses discretization throws this error (which the other two do not):
Error in model.frame.default(object$dinfo$gp$fake.formula[-2], newdata) :
variable lengths differ (found for 'random.x')
In addition: Warning message:
'newdata' had 200 rows but variables found have 10000 rows
How can I go about making predictions for a new dataset using the model fit with discretization?
newdata.gauranteed doesn't seem to be working for bam() models with discrete = TRUE. You could email the author and maintainer of mgcv and send him the reproducible example so he can take a look. See ?bug.reports.mgcv.
You probably want
names(newdat) <- "x"
as data frames have names.
But the workaround is just to pass in something for random.x
newdat <- data.frame(x = runif(200, 0, 1), random.x = random.x[[1]])
and then do your call to generate test3 and it will work.
The warning message and error are the result of you not specifying random.x in the newdata and then mgcv looking for random.x and finding it in the global environment. You should really gather that variables into a data frame and use the data argument when you are fitting your models, and try not to leave similarly named objects lying around in your global environment.

R Random Forest: OOB error rate changes when sequence of columns is changed in feature table

I am puzzled with the following "behaviour" of RandomForest and wonder, whether this has been experienced by other users - and what I can do to avoid this:
Anything else being equal (and using the same set.seed value), the results of a randomForest model (e.g. its OOB estimate of error rate) changes only by changing the sequence of the features (=columns) of the data table. In the following code, I
1) run randomForest() once: OOB = 23.06%
2) randomly change the sequence of the data table
3) run randomForest() again with the changes data table: OOB=22.53%
R.version.string
library(randomForest)
library(dplyr)
df <- readRDS("df_feature_list.rds")
head(df)
set.seed(1)
RF <- randomForest(Class ~ . , data = df)
RF # OOB error: 23.06%
# randomly swap field order in feature table
df <- df[, sample(names(df))]
head(df)
set.seed(1)
RF <- randomForest(Class ~ . , data = df)
RF # OOB error: 22.53%
This is because of the way randomForest uses the random seed. It uses the seed to select predictors in the formula during running, based on position of the predictors, not name. Since you use the . operator to select all columns in your data frame, the ordering of the predictors is different in your formula call (based on data frame ordering) and while the model selects the same position each iteration, they are actually different columns.
I only show OOB error output below to reduce length of the post, but here is a simple example.
library(randomForest)
set.seed(123)
df <- data.frame(class = c(sample(c("a", "b", "c"), size = 100, replace = T)),
x = runif(100),
y = runif(100))
set.seed(1)
randomForest(class ~ ., data = df)
#> OOB estimate of error rate: 75%
# Changing order of columns changes results
set.seed(1)
randomForest(class ~ ., data = df[,c(1,3,2)])
#> OOB estimate of error rate: 70%
# But if we specify the formula, get the same result as original
set.seed(1)
randomForest(class ~ x + y, data = df[,c(1,3,2)])
#> OOB estimate of error rate: 75%
# Keeping ordering of data frame but renaming columns doesn't change results
names(df) <- c("class", "y", "x")
set.seed(1)
randomForest(class ~ ., data = df)
#> OOB estimate of error rate: 75%
To get the same behavior every time, you need to either specify the formula explicitly or maintain the same ordering of data frame columns.

predict dropping svm observations

I'm using using a support vector machine on the Titanic dataset and some of the observations are not being predicted when using the predict function with my model.
library(e1071)
library(data.table)
library(ISLR)
titanic.index <- sample(891, 600)
titanic.train <- dat[index]
titanic.test <- dat[-index]
titanic. fit <- svm(Survived ~ Pclass + Sex + SibSp, data = train, kernel = "polynomial")
titanic.preds <- predict(fit, newdata = test)
titanic.preds
length(titanic.preds)
Whenever I run this on my comp I get anywhere from 220 to 240 predictions, but the their are clearly 291 observations in the test data. There aren't any missing observations for these predictors. To make matters even more weird, when I build an SVM using the auto dataset in the ISLR package this same problem doesn't occur.
data("Auto")
auto <- as.data.table(Auto)
auto[, mileage := ifelse(auto[, mpg] > median(auto[, mpg]), 1, 0)]
auto[, mileage := factor(mileage)]
auto.index <- sample(392, 200)
auto.train <- auto[auto.index]
auto.test <- auto[-auto.index]
auto.fit <- svm(mileage ~ ., data = auto.train)
auto.preds <- predict(auto.fit, newdata = auto.test)
auto.preds
length(auto.preds)
I have no idea why this is happening. Any insight you can provide is greatly appreciated!

Resources