Error : 'data' must be a data.frame, environment, or list - r

#define training and testing sets
set.seed(555)
train <- df2[1:800, c("charges")]
y_test <- df2[801:nrow(df2), c("charges")]
test <- df2[801:nrow(df2), c("age","bmi","children","smoker")]
#use model to make predictions on a test set
model <- pcr(charges~age+bmi+children+smoker, data = train, scale=TRUE, validation="CV")
pcr_pred <- predict(model, test, ncomp = 4)
#calculate RMSE
sqrt(mean((pcr_pred - y_test)^2))
I dont know why i get this error... already tried number of things but still stuck here

When you executed:
train <- df2[1:800, c("charges")]
You created an R atomic character vector. The class of the result would not be a list unless you also added the drop=FALSE parameter:
train <- df2[1:800, c("charges"), drop=FALSE]
That should fix that error although the lack of any data prevents any of us from determining whether further errors might arise. Actually, I'm pretty sure you did not want that train object to be just a single column since your model obviously expected other columns. Try this instead:
set.seed(555)
train <- df2[1:800, ]
test <- df2[801:nrow(df2), ]
#use model to make predictions on a test set
model <- pcr(charges~age+bmi+children+smoker, data = train, scale=TRUE, validation="CV")
pcr_pred <- predict(model, test, ncomp = 4)
#calculate RMSE
sqrt(mean((pcr_pred - y_test)^2))

Related

How to predict in kknn function? library(kknn)

I try to use kknn + loop to create a leave-out-one cross validation for a model, and compare that with train.kknn.
I have split the data into two parts: training (80% data), and test (20% data). In the training data, I exclude one point in the loop to manually create LOOCV.
I think something gets wrong in predict(knn.fit, data.test). I have tried to find how to predict in kknn through the kknn package instruction and online but all the examples are "summary(model)" and "table(validation...)" rather than the prediction on a separate test data. The code predict(model, dataset) works successfully in train.kknn function, so I thought I could use the similar arguments in kknn.
I am not sure if there is such a prediction function in kknn. If yes, what arguments should I give?
Look forward to your suggestion. Thank you.
library(kknn)
for (i in 1:nrow(data.train)) {
train.data <- data.train[-i,]
validation.data <- data.train[i,]
knn.fit <- kknn(as.factor(R1)~., train.data, validation.data, k = 40,
kernel = "rectangular", scale = TRUE)
# train.data + validation.data is the 80% data I split.
}
pred.knn <- predict(knn.fit, data.test) # data.test is 20% data.
Here is the error message:
Error in switch(type, raw = object$fit, prob = object$prob,
stop("invalid type for prediction")) : EXPR must be a length 1
vector
Actually I try to compare train.kknn and kknn+loop to compare the results of the leave-out-one CV. I have two more questions:
1) in kknn: is it possible to use another set of data as test data to see the knn.fit prediction?
2) in train.kknn: I split the data and use 80% of the whole data and intend to use the rest 20% for prediction. Is it an correct common practice?
2) Or should I just use the original data (the whole data set) for train.kknn, and create a loop: data[-i,] for training, data[i,] for validation in kknn? So they will be the counterparts?
I find that if I use the training data in the train.kknn function and use prediction on test data set, the best k and kernel are selected and directly used in generating the predicted value based on the test dataset.
In contrast, if I use kknn function and build a loop of different k values, the model generates the corresponding prediction results based on
the test data set each time the k value is changed. Finally, in kknn + loop, the best k is selected based on the best actual prediction accuracy rate of test data. In short, the best k train.kknn selected may not work best on test data.
Thank you.
For objects returned by kknn, predict gives the predicted value or the predicted probabilities of R1 for the single row contained in validation.data:
predict(knn.fit)
predict(knn.fit, type="prob")
The predict command also works on objects returned by train.knn.
For example:
train.kknn.fit <- train.kknn(as.factor(R1)~., data.train, ks = 10,
kernel = "rectangular", scale = TRUE)
class(train.kknn.fit)
# [1] "train.kknn" "kknn"
pred.train.kknn <- predict(train.kknn.fit, data.test)
table(pred.train.kknn, as.factor(data.test$R1))
The train.kknn command implements a leave-one-out method very close to the loop developed by #vcai01. See the following example:
set.seed(43210)
n <- 500
data.train <- data.frame(R1=rbinom(n,1,0.5), matrix(rnorm(n*10), ncol=10))
library(kknn)
pred.kknn <- array(0, nrow(data.train))
for (i in 1:nrow(data.train)) {
train.data <- data.train[-i,]
validation.data <- data.train[i,]
knn.fit <- kknn(as.factor(R1)~., train.data, validation.data, k = 40,
kernel = "rectangular", scale = TRUE)
pred.kknn[i] <- predict(knn.fit)
}
knn.fit <- train.kknn(as.factor(R1)~., data.train, ks = 40,
kernel = "rectangular", scale = TRUE)
pred.train.kknn <- predict(knn.fit, data.train)
table(pred.train.kknn, pred.kknn)
# pred.kknn
# pred.train.kknn 1 2
# 0 374 14
# 1 9 103

QDA | lengths of training and test data sets | How to split data in training and test data?

In QDA (Quadratic Discriminant Analysis), do i need to keep length of training and test data exactly same? If not, how do you find a Confusion Matrix in such cases?
Here's psuedo data.
Because if I keep training-data and test data sets of different lengths, it gives an error (Using R Studio):
"Error in table(pred, true) : all arguments must have the same length".
Tried to remove NAs using na.omit() on both data sets as well as pred and true; and using na.action = na.exclude for qda(), but it didn't work.
After dividing the data set in exactly half; half of it as training and half as test; it worked perfectly after na.omit() on pred and true.
Following is the code used for either of approaches. In approach 2, with data split into equal halves, it worked perfectly fine.
#Approach 1: divide data age-wise
train <- vif_data$Age < 30
# there are around 400 values passing (TRUE) above condition and around 50 failing (FALSE)
train_vif <- vif_data[train,]
test_vif <- vif_data[!train,]
#taking QDA
zone_qda <- qda(train_vif$Awareness~train_vif$Zone, na.action = na.exclude)
#compare QDA against test data
zone_pred <- predict(zone_qda, test_vif)
#omitting nulls
pred <- na.omit(zone_pred$class)
true <- na.omit(test_vif$Awareness)
length(pred) # result: 399
length(true) # result: 47
#that's where it throws error: "Error in table(zone_pred$class, train_vif) : all arguments must have the same length"
zone_aware <- table(zone_pred$class, train_vif)
# OR
zone_aware <- table(pred, true)
accur <- mean(zone_pred$class==test_vif$Awareness)
###############################
#Approach 2: divide data into random halves
train <- splitSample(dataset = vif_data, div = 2, path = "./", type = "csv")
train_data <- read.csv("splitSample_s1.csv")
test_data <- read.csv("splitSample_s2.csv")
#taking QDA
zone_qda <- qda(train_vif$Awareness~train_vif$Zone, na.action = na.exclude)
#compare QDA against test data
zone_pred <- predict(zone_qda, test_vif)
#omitting nulls
pred <- na.omit(zone_pred$class)
true <- na.omit(test_vif$Awareness)
length(train_vif)
# this works fine
zone_aware <- table(zone_pred$class, train_vif)
# OR
zone_aware <- table(pred, true)
accur <- mean(zone_pred$class==test_vif$Awareness)
Want to know if there is any method by which we can have a confusion matrix with data set unequally divided into training and test data set.
Thanks!
Are you plugging in your training inputs instead of your test set input data to predict? Notice how this yields the same error message:
table(c(1,2),c(1,2,3))
If pred isn't the right length, then you're probably predicting incorrectly. At the moment, you haven't shared any of your code, so I cannot say anything more. But there is no reason that you shouldn't be able to get a confusion matrix using test data of different size than your training data.

Error in predict.randomForest

I was hoping someone would be able to help me out with an issue I am having with the prediction function of the randomForest package in R. I keep getting the same error when I try to predict my test data:
Here's my code so far:
extractFeatures <- function(RCdata) {
features <- c(4, 9:13, 17:20)
fea <- RCdata[, features]
fea$Week <- as.factor(fea$Week)
fea$Age_Range <- as.factor(fea$Age_Range)
fea$Race <- as.factor(fea$Race)
fea$Referral_Source <- as.factor(fea$Referral_Source)
fea$Referral_Source_Category <- as.factor(fea$Referral_Source_Category)
fea$Rehire <- as.factor(fea$Rehire)
fea$CLFPR_.HS <- as.factor(fea$CLFPR_.HS)
fea$CLFPR_HS <- as.factor(fea$CLFPR_HS)
fea$Job_Openings <- as.factor(fea$Job_Openings)
fea$Turnover <- as.factor(fea$Turnover)
return(fea)
}
gp <- runif(nrow(RCdata))
RCdata <- RCdata[order(gp), ]
train <- RCdata[1:4600, ]
test <- RCdata[4601:6149, ]
rf <- randomForest(extractFeatures(train), suppressWarnings(as.factor(train$disposition_category)), ntree=100, importance=TRUE)
testpredict <- predict(rf, extractFeatures(test))
"Error in predict.randomForest(rf, extractFeatures(test)) :
Type of predictors in new data do not match that of the training data."
I have tried adding in the following line to the code, and still receive the same error:
testpredict <- predict(rf, extractFeatures(test), type="prob")
I found the source of the error being the fact that the training data has a level or two that is not found in the test data. So when I tried another suggestion I found online to adjust the levels of the test data to that of the training data, I keep getting NULL values in the fields I am using in both the training and test sets.
levels(test$Referral)
NULL
I can see the levels when I use the function, however.
levels(as.factor(test$Referral))
So then I tried the same suggestion I found online with adjusting the levels of the test to equal that of the training data using the following function and received an error:
levels(as.factor(test$Referral)) -> levels(as.factor(train$Referral))
Error in `levels<-.factor`(`*tmp*`, value = c(... :
number of levels differs
I am sure there is something simple I am missing (I am still very new to R), so any insight you can provide would be unbelievably helpful. Thanks!

randomForest Predict error from test set

I am running into a an error with the R package of randomForest where after I split the data using Caret into training and testing, when I go to predict I run into error:
Error in predict.randomForest(randomForestFit, type = "response", newdata =testing$GEN)
:number of variables in newdata does not match that in the training data
I split the file between train and test from the exact same file. There are no N/A or missing values in any of the data. Below is my full code, but I do not think there is an error there. I am at a loss as to why this error is occurring. Any ideas would be greatly appreciated!
library(caret)
require(foreign)
set.seed(825)
data <- read.spss("C:/MODEL_SAMPLE.sav",use.value.labels=TRUE, to.data.frame = TRUE)
inTraining <- createDataPartition(data$GEN, p = 0.75, list = FALSE)
training <- data[inTraining, ]
testing <- data[-inTraining, ]
library(randomForest)
library(foreach)
start.time <- Sys.time()
randomForestFit <- foreach(ntree=rep(63, 8), .combine=combine, .packages='randomForest')
%dopar% randomForest(training[-201],
training$GEN,
mtry = 40,
ntree=ntree,
verbose = TRUE,
importance = TRUE,
keep.forest=TRUE,
do.trace = TRUE)
randomForestFit
predict = predict(randomForestFit, type="response", newdata=testing$GEN)
stopCluster(cl)
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken
Without the data, its hard for anyone to say what the problem is exactly.
Three suggestions:
First, check the SPSS file for stray characters in the data.
Second, check the options from read.spss are set correctly especially: reencode = NA, use.missings =to.data.frame. You can use the latter option to specify non numeric characters to be turned into NA.
Third, use str(df), summary(df,useNA="if any") and make sure your factor variables including the response are actually factors. Apply as.numeric(as.character()) to numeric data in the data frame, this will generate NA values if there are expressions like VALUE!, #NA in the data frame.
You could also export to csv from SPSS and do the above again.
The key is the following
:number of variables in newdata does not match that in the training data
I therefore guess that the training and test data are different, in particular the column names. Maybe it breaks at this line?
inTraining <- createDataPartition(data$GEN, p = 0.75, list = FALSE)
To better understand the problem, you might have to post 3 rows of the training and test data set (with column names!).
I hope this helps!

How to perform 10 fold cross validation with LibSVM in R?

I know that in MatLab this is really easy ('-v 10').
But I need to do it in R. I did find one comment about adding cross = 10 as parameter would do it. But this is not confirmed in the help file so I am sceptical about it.
svm(Outcome ~. , data= source, cost = 100, gamma =1, cross=10)
Any examples of a successful SVM script for R would also be appreciated as I am still running into some dead ends?
Edit: I forgot to mention outside of the tags that I use the libsvm package for this.
I am also trying to perform a 10 fold cross validation. I think that using tune is not the right way in order to perform it, since this function is used to optimize the parameters, but not to train and test the model.
I have the following code to perform a Leave-One-Out cross validation. Suppose that dataset is a data.frame with your data stored. In each LOO step, the observed vs. predicted matrix is added, so that at the end, result contains the global observed vs. predicted matrix.
#LOOValidation
for (i in 1:length(dataset)){
fit = svm(classes ~ ., data=dataset[-i,], type='C-classification', kernel='linear')
pred = predict(fit, dataset[i,])
result <- result + table(true=dataset[i,]$classes, pred=pred);
}
classAgreement(result)
So in order to perform a 10-fold cross validation, I guess we should manually partition the dataset, and use the folds to train and test the model.
for (i in 1:10)
train <- getFoldTrainSet(dataset, i)
test <- getFoldTestSet(dataset,i)
fit = svm(classes ~ ., train, type='C-classification', kernel='linear')
pred = predict(fit, test)
results <- c(results,table(true=test$classes, pred=pred));
}
# compute mean accuracies and kappas ussing results, which store the result of each fold
I hope this help you.
Here is a simple way to create 10 test and training folds using no packages:
#Randomly shuffle the data
yourData<-yourData[sample(nrow(yourData)),]
#Create 10 equally size folds
folds <- cut(seq(1,nrow(yourData)),breaks=10,labels=FALSE)
#Perform 10 fold cross validation
for(i in 1:10){
#Segement your data by fold using the which() function
testIndexes <- which(folds==i,arr.ind=TRUE)
testData <- yourData[testIndexes, ]
trainData <- yourData[-testIndexes, ]
#Use test and train data howeever you desire...
}
Here is my generic code to run a k-fold cross validation aided by cvsegments to generate the index folds.
# k fold-cross validation
set.seed(1)
k <- 80;
result <- 0;
library('pls');
folds <- cvsegments(nrow(imDF), k);
for (fold in 1:k){
currentFold <- folds[fold][[1]];
fit = svm(classes ~ ., data=imDF[-currentFold,], type='C-classification', kernel='linear')
pred = predict(fit, imDF[currentFold,])
result <- result + table(true=imDF[currentFold,]$classes, pred=pred);
}
classAgreement(result)

Resources