How to get X & Y rows to match? - r

I'm working on a new type of code and need a little help with the ridge regularized regression. trying to build a predictive model but first i need x and y matrix rows to match.
I found something similar with a google search but their data is randomly generated and not provided like mine is. the data is a large dataset with over 500,000 observations and 670 variables.
library(rsample)
library(glmnet)
library(dplyr)
library(ggplot2)
# Create training (70%) and test (30%) sets
# Use set.seed for reproducibility
set.seed(123)
alumni_split<-initial_split(alumni, prop=.7, strata = "Id.Number")
alumni_train<-training(alumni_split)
alumni_test<-testing(alumni_split)
#----
# Create training and testing feature model matrices and response
vectors.
# we use model.matrix(...)[, -1] to discard the intercept
alumni_train_x <- model.matrix(Id.Number ~ ., alumni_train)[, -1]
alumni_test_x <- model.matrix(Id.Number ~ ., alumni_test)[, -1]
alumni_train_y <- log(alumni_train$Id.Number)
alumni_test_y <- log(alumni_test$Id.Number)
# What is the dimension of of your feature matrix?
dim(alumni_train_x)
#---- [HERE]
# Apply Ridge regression to alumni data
alumni_ridge <- glmnet(alumni_train_x, alumni_train_y, alpha = 0)
The error message (with code):
alumni_ridge <- glmnet(alumni_train_x, alumni_train_y, alpha = 0)
Error in glmnet(alumni_train_x, alumni_train_y, alpha = 0) :
number of observations in y (329870) not equal to the number of rows of
x (294648)

Related

How can I calculate the mean square error in R of a regression tree?

I am working with the wine quality database.
I am studying regression trees depending on different variables as:
library(rpart)
library(rpart.plot)
library(rattle)
library(naniar)
library(dplyr)
library(ggplot2)
vinos <- read.csv(file = 'Wine.csv', header = T)
arbol0<-rpart(formula=quality~chlorides, data=vinos, method="anova")
fancyRpartPlot(arbol0)
arbol1<-rpart(formula=quality~chlorides+density, data=vinos, method="anova")
fancyRpartPlot(arbol1)
I want to calculate the mean square error to see if arbol1 is better than arbol0. I will use my own dataset since no more data is available. I have tried to do it as
aaa<-predict(object=arbol0, newdata=data.frame(chlorides=vinos$chlorides), type="anova")
bbb<-predict(object=arbol1, newdata=data.frame(chlorides=vinos$chlorides, density=vinos$density), type="anova")
and then substract manually the last column of the dataframe from aaa and bbb. However, I am getting an error. Can someone please help me?
This website could be useful for you. It's very important to split your dataset into train and test subsets before training your models. In the following code, I've done it with base functions, but there's another function called sample.split from the caTools package that does the same procedure. I attach you this website where you can see all the ways to split data in R.
Remember that the function of the Mean Squared Error (MSE) is the following one:
So, it's very simple to apply it with R. You just have to compute the mean of the squared difference between the observed (i.e, the response variable from your test subset) and predicted values (i.e, the values you have predicted from the model with the predict function).
A solution for your wine dataset could be this one, based on the previous website.
library(rpart)
library(dplyr)
library(data.table)
vinos <- fread(file = 'Winequality-red.csv', header = TRUE)
# Split data into train and test subsets
sample_index <- sample(nrow(vinos), size = nrow(vinos)*0.75)
train <- vinos[sample_index, ]
test <- vinos[-sample_index, ]
# Train regression trees models
arbol0 <- rpart(formula = quality ~ chlorides, data = train, method = "anova")
arbol1 <- rpart(formula = quality ~ chlorides + density, data = train, method = "anova")
# Make predictions for each model
pred0 <- predict(arbol0, newdata = test)
pred1 <- predict(arbol1, newdata = test)
# Calculate MSE for each model
mean((pred0 - test$quality)^2)
mean((pred1 - test$quality)^2)

kNN algorithm not working while using caret

I am trying to run LOOCV kNN on this dataset (104x182 where the first 62 samples are B and the following 42 are C). I first conducted a PCA on the standardized version of this dataset (giving me 104 PCs). I then try to perform LOOCV kNN for i = 3:98 where i refers to the number of PCs I will use for my kNN model. For each i I pull out the highest accuracy, which k it occurs at and store it within a data frame.
# required packages
library(MASS)
library(class)
library(tidyverse)
library(caret)
# reading in and cleaning data
data <- read.csv("chowdary.csv")
og_data <- data[, -1]
st_data <- as.data.frame(cbind(og_data[, 1], scale(og_data[, -1])))
colnames(st_data)[1] <- "tumour"
# PCA for dimension reduction
# on standardized data
pca_all <- prcomp(og_data[, -1], center=TRUE, scale=TRUE)
# creating data frame to store best k value for each number of PCs
kdf_pca_all_cc <- tibble(i=as.numeric(), # this is for storing number of PCs used,
pca_all_k=as.numeric(), # k value,
pca_all_acc=as.numeric(), # accuracy value,
pca_all_kapp=as.numeric()) # and kappa value
# kNN
k_kNN <- 3:97 # number of PCs to use in each iteration of the model
train_control <- trainControl(method="LOOCV")
kNN_data <- as.data.frame(cbind(as.factor(st_data[, 1]), pca_all$x)) # data used in kNN model below
for (i in k_kNN){
a111 <- train(V1~ .,
method="knn",
tuneGrid=expand.grid(k=1:25),
trControl=train_control,
metric="Accuracy",
data=kNN_data[, 1:i])
b111 <- a111$results[as.integer(a111$bestTune), ] # this is to store the best accuracy rate, along with its k and kappa value
kdf_pca_all_cc <- kdf_pca_all_cc %>%
add_row(i=i-1,
pca_all_k=b111[, 1],
pca_all_acc=b111[, 2],
pca_all_kapp=b111[, 3])
}
For example, for i = 5, the kNN model would be using the following data:
head(kNN_data[, 1:5])
V1 PC1 PC2 PC3 PC4
1 1 3.299844 0.2587487 -1.00501632 2.0273727
2 1 1.427856 -1.0455044 -1.79970790 2.5244021
3 1 3.087657 1.2563404 1.67591441 -1.4270431
4 1 3.107778 1.5893396 2.65871270 -2.8217264
5 1 3.244306 0.5982652 0.37011029 0.3642425
6 1 3.000098 0.5471276 -0.01178315 1.0857886
However, whenever I try to run the for-loop, I am given the following warning message:
Error: Metric Accuracy not applicable for regression models
In addition: Warning message:
In train.default(x, y, weights = w, ...) :
You are trying to do regression and your outcome only has two possible values Are you trying to do classification? If so, use a 2 level factor as your outcome column.
I have no idea how to fix this. Any help would be much appreciated.
Also, as a side note, is there a faster way to run this for-loop? It takes quite a while but I have no idea how to make it more efficient. Thank you.

`table` not showing in matrix format

I'm trying to generate a confusion table using the HMDA data from the AER package. So I ran a probit model, predict on testing set, and use table() function to generate a 2 by 2 plot, but R just returns me a long list, not showing the 2 by 2 matrix that I wanted.
Could anyone tell me what's going on>
# load required packages and data (HMDA)
library(e1071)
library(caret)
library(AER)
library(plotROC)
data(HMDA)
# again, check variable columns
names(HMDA)
# convert dependent variables to numeric
HMDA$deny <- ifelse(HMDA$deny == "yes", 1, 0)
# subset needed columns
subset <- c("deny", "hirat", "lvrat", "mhist", "unemp")
# subset data
data <- HMDA[complete.cases(HMDA), subset]
# do a 75-25 train-test split
train_row_numbers <- createDataPartition(data$deny, p=0.75, list=FALSE)
training <- data[train_row_numbers, ]
testing <- data[-train_row_numbers, ]
# fit a probit model and predict on testing data
probit.fit <- glm(deny ~ ., family = binomial(link = "probit"), data = training)
probit.pred <- predict(probit.fit, testing)
confmat_probit <- table(Predicted = probit.pred,
Actual = testing$deny)
confmat_probit
You need to specify the threshold or cut-point for predicting a dichotomous outcome. Predict returns the predicted values, not 0 / 1.
And be careful with the predict function as the default type is "link", which in your case is the "probit". If you want predict to return the probabilities, specify type="response".
probit.pred <- predict(probit.fit, testing, type="response")
Then choose a cut-point; any prediction above this value will be TRUE:
confmat_probit <- table(`Predicted>0.1` = probit.pred > 0.1 , Actual = testing$deny)
confmat_probit
Actual
Predicted>0.1 0 1
FALSE 248 21
TRUE 273 53

Unable to get R-squared for test dataset

I am trying to learn a bit about different types of regression and I am hacking my way through the code sample below.
library(magrittr)
library(dplyr)
# Polynomial degree 1
df=read.csv("C:\\path_here\\auto_mpg.csv",stringsAsFactors = FALSE) # Data from UCI
df1 <- as.data.frame(sapply(df,as.numeric))
# Select key columns
df2 <- df1 %>% select(cylinder,displacement,horsepower,weight,acceleration,year,mpg)
df3 <- df2[complete.cases(df2),]
smp_size <- floor(0.75 * nrow(df3))
# Split as train and test sets
train_ind <- sample(seq_len(nrow(df3)), size = smp_size)
train <- mtcars[train_ind, ]
test <- mtcars[-train_ind, ]
Rsquared <- function (x, y) cor(x, y) ^ 2
# Fit a model of degree 1
fit <- lm(mpg~. ,data=train)
rsquared1 <-Rsquared(fit,test$mpg)
sprintf("R-squared for Polynomial regression of degree 1 (auto_mpg.csv) is : %f", rsquared1)
I am getting this error:
'Error in cor(x, y) : 'x' must be numeric'
I got the code samples from here (1.2b & 1.3a).
https://gigadom.wordpress.com/2017/10/06/practical-machine-learning-with-r-and-python-part-1/
The raw data is available here.
https://raw.githubusercontent.com/tvganesh/MachineLearning-RandPython/master/auto_mpg.csv
Just a few minutes ago I got an upvote for Function to calculate R2 (R-squared) in R. Now I guess it is from you, thanks.
Rsquare function expects two vectors, but you've passed in a model object fit (which is a list) and a vector test$mpg. I guess you want predict(fit, newdata = test) for its first argument here.

caret dummy-vars exclude target

How can I use dummy vars in caret without destroying my target variable?
set.seed(5)
data <- ISLR::OJ
data<-na.omit(data)
dummies <- dummyVars( Purchase ~ ., data = data)
data2 <- predict(dummies, newdata = data)
split_factor = 0.5
n_samples = nrow(data2)
train_idx <- sample(seq_len(n_samples), size = floor(split_factor * n_samples))
train <- data2[train_idx, ]
test <- data2[-train_idx, ]
modelFit<- train(Purchase~ ., method='lda',preProcess=c('scale', 'center'), data=train)
will fail, as the Purchase variable is missing. In case I replace it with data$Purchase <- ifelse(data$Purchase == "CH",1,0) beforehand caret complains that this no longer is a classification but a regression problem
At least the example code seems to have a few issues indicated in the comments below. To answer your questions:
The result of ifelse is an integer vector, not a factor, so the train function defaults to regression
Passing the dummyVars directly to the function is done by using the train(x = , y =, ...) instead of a formula
To avoid these problems, check the class of your objects carefully.
Be aware that option preProcess in train() will apply the preprocessing to all numeric variables, including the dummies. Option 2 below avoid this, be standardizing the data before calling train().
set.seed(5)
data <- ISLR::OJ
data<-na.omit(data)
# Make sure that all variables that should be a factor are defined as such
newFactorIndex <- c("StoreID","SpecialCH","SpecialMM","STORE")
data[, newFactorIndex] <- lapply(data[,newFactorIndex], factor)
library(caret)
# See help for dummyVars. The function does not take a dependent variable and predict will give an error
# I don't include the target variable here, so predicting dummies on new data will drop unknown columns
# including the target variable
dummies <- dummyVars(~., data = data[,-1])
# I don't change the data yet to apply standardization to the numeric variables,
# before turning the categorical variables into dummies
split_factor = 0.5
n_samples = nrow(data)
train_idx <- sample(seq_len(n_samples), size = floor(split_factor * n_samples))
# Option 1 (as asked): Specify independent and dependent variables separately
# Note that dummy variables will be standardized by preProcess as per the original code
# Turn the categorical variabels to (unstandardized) dummies
# The output of predict is a matrix, change it to data frame
data2 <- data.frame(predict(dummies, newdata = data))
modelFit<- train(y = data[train_idx, "Purchase"], x = data2[train_idx,], method='lda',preProcess=c('scale', 'center'))
# Option 2: Append dependent variable to the independent variables (needs to be a data frame to allow factor and numeric)
# Note that I also shift the proprocessing away from train() to
# avoid standardizing the dummy variables
train <- data[train_idx, ]
test <- data[-train_idx, ]
preprocessor <- preProcess(train[!sapply(train, is.factor)], method = c('center',"scale"))
train <- predict(preprocessor, train)
test <- predict(preprocessor, test)
# Turn the categorical variabels to (unstandardized) dummies
# The output of predict is a matrix, change it to data frame
train <- data.frame(predict(dummies, newdata = train))
test <- data.frame(predict(dummies, newdata = test))
# Reattach the target variable to the training data that has been
# dropped by predict(dummies,...)
train$Purchase <- data$Purchase[train_idx]
modelFit<- train(Purchase ~., data = train, method='lda')

Resources