Unable to get R-squared for test dataset - r

I am trying to learn a bit about different types of regression and I am hacking my way through the code sample below.
library(magrittr)
library(dplyr)
# Polynomial degree 1
df=read.csv("C:\\path_here\\auto_mpg.csv",stringsAsFactors = FALSE) # Data from UCI
df1 <- as.data.frame(sapply(df,as.numeric))
# Select key columns
df2 <- df1 %>% select(cylinder,displacement,horsepower,weight,acceleration,year,mpg)
df3 <- df2[complete.cases(df2),]
smp_size <- floor(0.75 * nrow(df3))
# Split as train and test sets
train_ind <- sample(seq_len(nrow(df3)), size = smp_size)
train <- mtcars[train_ind, ]
test <- mtcars[-train_ind, ]
Rsquared <- function (x, y) cor(x, y) ^ 2
# Fit a model of degree 1
fit <- lm(mpg~. ,data=train)
rsquared1 <-Rsquared(fit,test$mpg)
sprintf("R-squared for Polynomial regression of degree 1 (auto_mpg.csv) is : %f", rsquared1)
I am getting this error:
'Error in cor(x, y) : 'x' must be numeric'
I got the code samples from here (1.2b & 1.3a).
https://gigadom.wordpress.com/2017/10/06/practical-machine-learning-with-r-and-python-part-1/
The raw data is available here.
https://raw.githubusercontent.com/tvganesh/MachineLearning-RandPython/master/auto_mpg.csv

Just a few minutes ago I got an upvote for Function to calculate R2 (R-squared) in R. Now I guess it is from you, thanks.
Rsquare function expects two vectors, but you've passed in a model object fit (which is a list) and a vector test$mpg. I guess you want predict(fit, newdata = test) for its first argument here.

Related

fitting linear regression models with different predictors using loops

I want to fit regression models using a single predictor variable at a time. In total I have 7 predictors and 1 response variable. I want to write a chunk of code that picks a predictor variable from data frame and fits a model. I would further want to extract regression coefficient( not the intercept) and the sign of it and store them in 2 vectors. Here's my code-
for (x in (1:7))
{
fit <- lm(distance ~ FAA_unique_with_duration_filtered[x] , data=FAA_unique_with_duration_filtered)
coeff_values<-summary(fit)$coefficients[,1]
coeff_value<-coeff_values[2]
append(coeff_value_vector,coeff_value , after = length(coeff_value_vector))
append(RCs_sign_vector ,sign(coeff_values[2]) , after = length(RCs_sign_vector))
}
Over here x in will use the first column , then the 2nd and so on. However, I am getting the following error.
Error in model.frame.default(formula = distance ~ FAA_unique_with_duration_filtered[x], :
invalid type (list) for variable 'FAA_unique_with_duration_filtered[x]'
Is there a way to do this using loops?
You don't really need loops for this.
Suppose we want to regress y1, the 5th column of the built-in anscombe dataset, separately on each of the first 4 columns.
Then:
a <- anscombe
reg <- function(i) coef(lm(y1 ~., a[c(5, i)]))[[2]] # use lm
coefs <- sapply(1:4, reg)
signs <- sign(coefs)
# or
a <- anscombe
reg <- function(i) cov(a$y1, a[[i]]) / var(a[[i]]) # use formula for slope
coefs <- sapply(1:4, reg)
signs <- sign(coefs)
Alternately the following where reg is either of the reg definitions above.
a <- anscombe
coefs <- numeric(4)
for(i in 1:4) coefs[i] <- reg(i)
signs <- sign(coefs)

How to get X & Y rows to match?

I'm working on a new type of code and need a little help with the ridge regularized regression. trying to build a predictive model but first i need x and y matrix rows to match.
I found something similar with a google search but their data is randomly generated and not provided like mine is. the data is a large dataset with over 500,000 observations and 670 variables.
library(rsample)
library(glmnet)
library(dplyr)
library(ggplot2)
# Create training (70%) and test (30%) sets
# Use set.seed for reproducibility
set.seed(123)
alumni_split<-initial_split(alumni, prop=.7, strata = "Id.Number")
alumni_train<-training(alumni_split)
alumni_test<-testing(alumni_split)
#----
# Create training and testing feature model matrices and response
vectors.
# we use model.matrix(...)[, -1] to discard the intercept
alumni_train_x <- model.matrix(Id.Number ~ ., alumni_train)[, -1]
alumni_test_x <- model.matrix(Id.Number ~ ., alumni_test)[, -1]
alumni_train_y <- log(alumni_train$Id.Number)
alumni_test_y <- log(alumni_test$Id.Number)
# What is the dimension of of your feature matrix?
dim(alumni_train_x)
#---- [HERE]
# Apply Ridge regression to alumni data
alumni_ridge <- glmnet(alumni_train_x, alumni_train_y, alpha = 0)
The error message (with code):
alumni_ridge <- glmnet(alumni_train_x, alumni_train_y, alpha = 0)
Error in glmnet(alumni_train_x, alumni_train_y, alpha = 0) :
number of observations in y (329870) not equal to the number of rows of
x (294648)

Loop linear regression different predictor and outcome variables

I'm new to R but am slowly learning it to analyse a data set.
Let's say I have a data frame which contains 8 variables and 20 observations. Of the 8 variables, V1 - V3 are predictors and V4 - V8 are outcomes.
B = matrix(c(1:160),
nrow = 20,
ncol = 8,)
df <- as.data.frame(B)
Using the car package, to perform a simple linear regression, display summary and confidence intervals is:
fit <- lm(V4 ~ V1, data = df)
summary(fit)
confint(fit)
How can I write code (loop or apply) so that R regresses each predictor on each outcome individually and extracts the coefficients and confidence intervals? I realise I'm probably trying to run before I can walk but any help would be really appreciated.
You could wrap your lines in a lapply call and train a linear model for each of your predictors (excluding the target, of course).
my.target <- 4
my.predictors <- 1:8[-my.target]
lapply(my.predictors, (function(i){
fit <- lm(df[,my.target] ~ df[,i])
list(summary= summary(fit), confint = confint(fit))
}))
You obtain a list of lists.
So, the code in my own data that returns the error is:
my.target <- metabdata[c(34)]
my.predictors <- metabdata[c(18 : 23)]
lapply(my.predictors, (function(i){
fit <- lm(metabdata[, my.target] ~ metabdata[, i])
list(summary = summary(fit), confint = confint(fit))
}))
Returns:
Error: Unsupported index type: tbl_df

five-fold cross-validation with the use of linear regression

I would like to perform a five-fold cross validation for a regression model of degree 1
lm(y ~ poly(x, degree=1), data).
I generated 100 observations with the following code
set.seed(1)
GenData <- function(n){
x <- seq(-2,2,length.out=n)
y <- -4 - 3*x + 1.5*x^2 + 2*x^3 + rnorm(n,0,0.5)
return(cbind(x,y))
}
GenData(100)
D<-GenData(100)
and my code for this goal is
ind<-sample(1:100)
re<-NULL
k<-20
teams<- 5
t<-NULL
for (i in 1:teams) {
te<- ind[ ((i-1)*k+1):(i*k)]
train <- D[-te,1:2]
test <- D[te,1:2]
cl <- D[-te,2]
lm1 <- lm(cl ~train[,1] , data=train)
pred <- predict(lm1,test)
t<- c(t, sum(D[te,2] == pred) /dim(test)[1])
}
re<-c(re,mean(t))
where I split my data into training and test.With the training data I run a regression with purpose to make a prediction and comperate it with my test data.But I have the following error
"Error in predict(mult, test)$class :
$ operator is invalid for atomic vectors
In addition: Warning message:
'newdata' had 20 rows but variables found have 80 rows "
So I understand that I have to change something on the line
pred<-predict(lm1,test)
but I dont know what .
Thanks in advance!
lm requires a data frame as input data. Also trying to validate the model by just verifying if the result matches the expected value will not work. You are simulating the irreducible error using normal error.
Here is the updated code:
ind<-sample(1:100)
re<-NULL
k<-20
teams<- 5
t<-NULL
for (i in 1:teams) {
te<- ind[ ((i-1)*k+1):(i*k)]
train <- data.frame(D[-te,1:2])
test <- data.frame(D[te,1:2])
lm1 <- lm(y~x , data=train)
pred <- predict(lm1,test)
t<- c(t, sum(abs(D[te,2] - pred)) /dim(test)[1])
}
re<-c(re,mean(t))
In the lm() function, your y variable is cl, a vector not included in the data = argument:
cl <- D[-te,2]
lm1 <- lm(cl ~train[,1] , data=train)
No need to include the cl at all. Rather, simply specify x and y by their names in the dataset train, in this case the names are x and y:
names(train)
[1] "x" "y"
So your for loop would then look like:
for (i in 1:teams) {
te<- ind[ ((i-1)*k+1):(i*k)]
train <- D[-te,1:2]
test <- D[te,1:2]
lm1 <- lm(y ~x , data=train)
pred <- predict(lm1,test)
t[i]<- sum(D[te,2] == pred)/dim(test)[1]
}
Also, note that I have added the for loop index i so that values can be added to the object. Lastly, I had to make the D object a dataframe in order for the code to work:
D<-as.data.frame(GenData(100))
Your re object ends up being 0 because your model does not predict any numbers correctly. I would suggest using RMSE as a performance measure for continuous data.

How to extract arguments used in a fitted nls model for use in a second fitting with do.call

I am trying to use the same original arguments from a fitted nls model in the fitting of a second model using a subset of the data (for a cross validation exercise). I can retrieve the arguments (e.g. fit$call), but am having a hard time passing these arguments to do.call.
# original model ----------------------------------------------------------
# generate data
set.seed(1)
n <- 100
x <- sort(rlnorm(n, 1, 0.2))
y <- 0.1*x^3 * rlnorm(n, 0, 0.1)
df <- data.frame(x, y)
plot(y~x,df)
# fit model
fit <- nls(y ~ a*x^b, data=df, start=list(a=1,b=2), lower=list(a=0, b=0), algo="port")
summary(fit)
plot(y~x,df)
lines(df$x, predict(fit), col=4)
# perform model fit on subset with same starting arguments ----------------
# df sampled subset
dfsub <- df[sample(nrow(df), nrow(df)*0.5),]
dim(dfsub)
plot(y~x, dfsub)
ARGS <- fit$call # original call information
ARGS$data <- dfsub # substitute dfsub as data
ARGS <- as.list(ARGS) # change class from "call", to "list"
fitsub <- do.call(nls, args = ARGS )
# Error in xj[i] : invalid subscript type 'closure'
Also, as a side note, fit$data just returns the name of the data object. Is the data actually contained within the fitted nls object as well (as lm and other model fits sometimes do)?
Use update to add a subset argument:
nr <- nrow(df)
update(fit, subset = sample(nr, nr * 0.5) )
You can use the update function to refit the model with a different data set:
fitsub <- update(fit, data = dfsub)

Resources