Undoing createDataPartition ordering - r

I'm new to data science and trying to finish up this project. I have a data frame (from here https://www.kaggle.com/c/house-prices-advanced-regression-techniques) with assigned train and test sets (1:1460, 1461:2919) that I was suggested to use createDataPartition() on due to an error I was getting when trying to predict
> predSale <- as.data.frame(predict(model, newdata = test))
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev =
object$xlevels) : factor MSSubClass has new levels 150
But now when using createDataPartition, it's mixing up my original train and test sets, which I need in a specific order for the Kaggle submission. I've read in the vignette and looks like there is an argument for returnTrain. I'm not sure if this could be used (I don't fully understand it), but ultimately I'd like to know if there is a way to undo the ordering so I can submit my project with the original ordered set.
test$SalePrice <- NA
combined <- rbind(train, test)
train <- combined[1:1460, ]
test <- combined[1461:2919, ]
#____________Models____________
set.seed(666)
index <- createDataPartition(paste(combined$MSSubClass,combined$Neighborhood,
combined$Condition2,combined$LotConfig,
combined$Exterior1st,combined$Exterior2nd,
combined$RoofMatl,combined$MiscFeature,combined$SaleType))$Resample
train <- combined[index,]
test <- combined[-index,]
model <- lm(SalePrice ~., train)
predSale <- as.data.frame(predict(model, newdata = test))
SampleSubmission <- round(predSale, digits = 1)
write.csv(SampleSubmission, "SampleSubmission.csv")
Thanks!! If there is anything you need answered please let me know,I think I've provided everything (I'm not sure if you need the full code or what, I'll be happy to edit with whatever more needed)?

You do not use createDataPartition on a combined kaggle dataset. You need to keep these data sets separate. That's why kaggle provides them. If you want to combine them, then you have to split them again as they were after you have done your data cleaning.
But the issue you have is that there are factor levels in the test dataset that are not seen by the model. There are multiple posts on kaggle about this issue. But I must say that kaggle's search engines is crappy.
In kaggle competitions some people use the following code on character columns to turn them into numeric (e.g. for using the data with xgboost). This code assumes that you loaded the data sets with stringAsFactors = False.
for (f in feature.names) {
if (class(train[[f]])=="character") {
levels <- unique(c(train[[f]], test[[f]]))
test[[f]] <- as.integer(factor(test[[f]], levels=levels))
train[[f]] <- as.integer(factor(train[[f]], levels=levels))
}
}
Others use a version of the following to create all the level names in the training data set.
levels(xtrain[,x]) <- union(levels(xtest[,x]),levels(xtrain[,x]))
There are more ways of dealing with this.
Of course these solutions are nice for Kaggle, as this might give you a better score. But in a strict sense this is a sort of data leakage. And using this in a production setting, is asking for trouble. There are many situations in which you might not know all of the possible values in advance, and when encountering a new value returning a missing value instead of a prediction is a more sensible choice. But that discussion fills complete research articles.

Related

How can I use a machine learning model to predict on data whose features differ slightly?

I have a randomForest model trained on a bunch of NLP data (tf-idf values for each word). I want to use it to predict on a new dataset. The features in the model overlap with but don't quite match the features in the new data, such that when I predict on the new data I get:
Error in predict.randomForest(object = model, newdata = new_data) :
variables in the training data missing in newdata
I thought to get around this error by excluding all the features from the model which do not appear in the new data, and all the features in the new data which do not appear in the model. Putting aside for the moment the impact on model accuracy (this would significantly pare down the number of features, but there would still be plenty to predict with), I did something like this:
model$forest$xlevels <- model$forest$xlevels[colnames(new_data)]
# and vice versa
new_data <- new_data[names(model$forest$xlevels)]
This worked, insofar as names(model$forest$xlevels) == colnames(new_data) returned TRUE for each feature name.
However, when I try to predict on the resulting new_data I still get the variables in the training data missing in newdata error. I am fairly certain that I'm amending the correct part of the model (model$forest$xlevels), so why isn't it working?
i think you should go the other way around. That is add the missing columns to the newdata.
When you are working with bags of words, it is common to have words that are not present in some batch of new data. These missing words should just be encoded as a columns of zeros.
# do something like this (also exclude the target variable, obviously)
names_missing <- names(traindata)[!names(traindata) %in% names(new_data)]
new_data[,names_missing] <- 0L
and then you should be able to predict

Why is predict in R taking Train data instead of Test data? [duplicate]

Working in R to develop regression models, I have something akin to this:
c_lm = lm(trainingset$dependent ~ trainingset$independent)
c_pred = predict(c_lm,testset$independent))
and every single time, I get a mysterious error from R:
Warning message:
'newdata' had 34 rows but variables found have 142 rows
which essentially translates into R not being able to find the independent column of the testset data.frame. This is simply because the exact name from the right-hand side of the formula in lm must be there in predict. To fix it, I can do this:
tempset = trainingset
c_lm = lm(trainingset$dependent ~ tempset$independent)
tempset = testset
c_pred = predict(c_lm,tempset$independent))
or some similar variation, but this is really sloppy, in my opinion.
Is there another way to clean up the translation between the two so that the independent variables' data frame does not have to have the exact same name in predict as it does in lm?
No, No, No, No, No, No! Do not use the formula interface in the way you are doing if you want all the other sugar that comes with model formulas. You wrote:
c_lm = lm(trainingset$dependent ~ trainingset$independent)
You repeat trainingset twice, which is a waste of fingers/time, redundant, and not least causing you the problem that you are hitting. When you now call predict, it will be looking for a variable in testset that has the name trainingset$independent, which of course doesn't exist. Instead, use the data argument in your call to lm(). For example, this fits the same model as your formula but is efficient and also works properly with predict()
c_lm = lm(dependent ~ independent, data = trainingset)
Now when you call predict(c_lm, newdata = testset), you only need to have a data frame with a variable whose name is independent (or whatever you have in the model formula).
An additional reason to write formulas as I show them, is legibility. Getting the object name out of the formula allows you to more easily see what the model is.

Predict is not usbale for mira/mipo?

I'm a student who have no prior knowledge in anything coding related but I'm taking a module that requires RStudio and now I'm struggling.
I have an assignment that needed us to explore the methods of dealing with missing data in training data set and test data set (multiple rows and multiple variables) and then creating a linear model lm using training set. Then, use predict with the said lm with newdata = test data to observe the results. I am tasked to learn how to use MICE to deal with this assignment but I am at a dead end.
In my attempt, I tried to fill up missing data of the training data set via MICE with my approach as follows:
train = read.csv("Train_Data.csv", na.strings=c("","NA"))
missingtraindata = mice(train, m=5, maxit = 5, method = 'pmm')
model = with(missingtraindata, lm(LOS~.- PatientID, data = train))
miceresults = pool(model)
summary(miceresults)
Then I tried to use predict() but it doesn't work because it says mira/mipo doesn't work with predict(). I don't know what that means at all.
Honestly I have no idea what any of these codes do, I just tried to apply whatever information available from the notes that I have regarding MICE. I don't know if that's how you correctly use MICE to fill up missing data but i literally spent the entire day researching and trying but it's of no help. Please helppppp!

lm and predict - agreement of data.frame names

Working in R to develop regression models, I have something akin to this:
c_lm = lm(trainingset$dependent ~ trainingset$independent)
c_pred = predict(c_lm,testset$independent))
and every single time, I get a mysterious error from R:
Warning message:
'newdata' had 34 rows but variables found have 142 rows
which essentially translates into R not being able to find the independent column of the testset data.frame. This is simply because the exact name from the right-hand side of the formula in lm must be there in predict. To fix it, I can do this:
tempset = trainingset
c_lm = lm(trainingset$dependent ~ tempset$independent)
tempset = testset
c_pred = predict(c_lm,tempset$independent))
or some similar variation, but this is really sloppy, in my opinion.
Is there another way to clean up the translation between the two so that the independent variables' data frame does not have to have the exact same name in predict as it does in lm?
No, No, No, No, No, No! Do not use the formula interface in the way you are doing if you want all the other sugar that comes with model formulas. You wrote:
c_lm = lm(trainingset$dependent ~ trainingset$independent)
You repeat trainingset twice, which is a waste of fingers/time, redundant, and not least causing you the problem that you are hitting. When you now call predict, it will be looking for a variable in testset that has the name trainingset$independent, which of course doesn't exist. Instead, use the data argument in your call to lm(). For example, this fits the same model as your formula but is efficient and also works properly with predict()
c_lm = lm(dependent ~ independent, data = trainingset)
Now when you call predict(c_lm, newdata = testset), you only need to have a data frame with a variable whose name is independent (or whatever you have in the model formula).
An additional reason to write formulas as I show them, is legibility. Getting the object name out of the formula allows you to more easily see what the model is.

Rpart Variables were specificed with different types from the fit?

I make a classification tree using rpart. The data has 10 columns, all properly labeled. Five of these columns contain information such as the day of the week in the form of "Wed" and the other five contain numeric values.
I can successfully make a tree using Rpart, but when I try to run a test set of the data, or even the training set that made the tree, I get a bunch of warnings saying that the variables containing characters were changed to a factor, and then an error that says those same variables were specified with a different type from the fit.
Anyone know how to fix this?
My relavent code should be
library(rpart)
#read data into info
info <- data.frame(info)
set.seed(30198)
train_ind <- sample(1:2000, 1500)
training_data_info <- info[train_ind, ]
test_data_info <- info[-train_ind, ]
training_data_info <- data.frame(training_data_info)
test_data_info <- data.frame(test_data_info)
tree <- rpart(info ~ ., data = training_data_info, method = "class")
info.test.fit <- predict(tree, newdata=test_data_info) #this is where it goes wrong
You can't use character vectors in an rpart fit. You have to code them as factors. The code does this for you, but then you hit the problem that it is entirely possible for the test data to have a different set of levels from the training data used to fit the tree.
The error arises from the use of these two lines:
training_data_info <- data.frame(training_data_info)
test_data_info <- data.frame(test_data_info)
These are redundant, the objects are already data frames. All this achieves is to drop those levels from the whole dataset that are missing in either the training or test datasets. And that is where the error comes from. Try without those two lines and you should be good to go.

Resources