Predict is not usbale for mira/mipo? - r

I'm a student who have no prior knowledge in anything coding related but I'm taking a module that requires RStudio and now I'm struggling.
I have an assignment that needed us to explore the methods of dealing with missing data in training data set and test data set (multiple rows and multiple variables) and then creating a linear model lm using training set. Then, use predict with the said lm with newdata = test data to observe the results. I am tasked to learn how to use MICE to deal with this assignment but I am at a dead end.
In my attempt, I tried to fill up missing data of the training data set via MICE with my approach as follows:
train = read.csv("Train_Data.csv", na.strings=c("","NA"))
missingtraindata = mice(train, m=5, maxit = 5, method = 'pmm')
model = with(missingtraindata, lm(LOS~.- PatientID, data = train))
miceresults = pool(model)
summary(miceresults)
Then I tried to use predict() but it doesn't work because it says mira/mipo doesn't work with predict(). I don't know what that means at all.
Honestly I have no idea what any of these codes do, I just tried to apply whatever information available from the notes that I have regarding MICE. I don't know if that's how you correctly use MICE to fill up missing data but i literally spent the entire day researching and trying but it's of no help. Please helppppp!

Related

How can I use a machine learning model to predict on data whose features differ slightly?

I have a randomForest model trained on a bunch of NLP data (tf-idf values for each word). I want to use it to predict on a new dataset. The features in the model overlap with but don't quite match the features in the new data, such that when I predict on the new data I get:
Error in predict.randomForest(object = model, newdata = new_data) :
variables in the training data missing in newdata
I thought to get around this error by excluding all the features from the model which do not appear in the new data, and all the features in the new data which do not appear in the model. Putting aside for the moment the impact on model accuracy (this would significantly pare down the number of features, but there would still be plenty to predict with), I did something like this:
model$forest$xlevels <- model$forest$xlevels[colnames(new_data)]
# and vice versa
new_data <- new_data[names(model$forest$xlevels)]
This worked, insofar as names(model$forest$xlevels) == colnames(new_data) returned TRUE for each feature name.
However, when I try to predict on the resulting new_data I still get the variables in the training data missing in newdata error. I am fairly certain that I'm amending the correct part of the model (model$forest$xlevels), so why isn't it working?
i think you should go the other way around. That is add the missing columns to the newdata.
When you are working with bags of words, it is common to have words that are not present in some batch of new data. These missing words should just be encoded as a columns of zeros.
# do something like this (also exclude the target variable, obviously)
names_missing <- names(traindata)[!names(traindata) %in% names(new_data)]
new_data[,names_missing] <- 0L
and then you should be able to predict

Why is predict in R taking Train data instead of Test data? [duplicate]

Working in R to develop regression models, I have something akin to this:
c_lm = lm(trainingset$dependent ~ trainingset$independent)
c_pred = predict(c_lm,testset$independent))
and every single time, I get a mysterious error from R:
Warning message:
'newdata' had 34 rows but variables found have 142 rows
which essentially translates into R not being able to find the independent column of the testset data.frame. This is simply because the exact name from the right-hand side of the formula in lm must be there in predict. To fix it, I can do this:
tempset = trainingset
c_lm = lm(trainingset$dependent ~ tempset$independent)
tempset = testset
c_pred = predict(c_lm,tempset$independent))
or some similar variation, but this is really sloppy, in my opinion.
Is there another way to clean up the translation between the two so that the independent variables' data frame does not have to have the exact same name in predict as it does in lm?
No, No, No, No, No, No! Do not use the formula interface in the way you are doing if you want all the other sugar that comes with model formulas. You wrote:
c_lm = lm(trainingset$dependent ~ trainingset$independent)
You repeat trainingset twice, which is a waste of fingers/time, redundant, and not least causing you the problem that you are hitting. When you now call predict, it will be looking for a variable in testset that has the name trainingset$independent, which of course doesn't exist. Instead, use the data argument in your call to lm(). For example, this fits the same model as your formula but is efficient and also works properly with predict()
c_lm = lm(dependent ~ independent, data = trainingset)
Now when you call predict(c_lm, newdata = testset), you only need to have a data frame with a variable whose name is independent (or whatever you have in the model formula).
An additional reason to write formulas as I show them, is legibility. Getting the object name out of the formula allows you to more easily see what the model is.

R Cross Validation lm predict function [duplicate]

I am trying to convert Absorbance (Abs) values to Concentration (ng/mL), based on an established linear model & standard curve. I planned to do this by using the predict() function. I am having trouble getting predict() to return the desired results. Here is a sample of my code:
Standards<-data.frame(ng_mL=c(0,0.4,1,4),
Abs550nm=c(1.7535,1.5896,1.4285,0.9362))
LM.2<-lm(log(Standards[['Abs550nm']])~Standards[['ng_mL']])
Abs<-c(1.7812,1.7309,1.3537,1.6757,1.7409,1.7875,1.7533,1.8169,1.753,1.6721,1.7036,1.6707,
0.3903,0.3362,0.2886,0.281,0.3596,0.4122,0.218,0.2331,1.3292,1.2734)
predict(object=LM.2,
newdata=data.frame(Concentration=Abs[1]))#using Abs[1] as an example, but I eventually want predictions for all values in Abs
Running that last lines gives this output:
> predict(object=LM.2,
+ newdata=data.frame(Concentration=Abs[1]))
1 2 3 4
0.5338437 0.4731341 0.3820697 -0.0732525
Warning message:
'newdata' had 1 row but variables found have 4 rows
This does not seem to be the output I want. I am trying to get a single predicted Concentration value for each Absorbance (Abs) entry. It would be nice to be able to predict all of the entries at once and add them to an existing data frame, but I can't even get it to give me a single value correctly. I've read many threads on here, webpages found on Google, and all of the help files, and for the life of me I cannot understand what is going on with this function. Any help would be appreciated, thanks.
You must have a variable in newdata that has the same name as that used in the model formula used to fit the model initially.
You have two errors:
You don't use a variable in newdata with the same name as the covariate used to fit the model, and
You make the problem much more difficult to resolve because you abuse the formula interface.
Don't fit your model like this:
mod <- lm(log(Standards[['Abs550nm']])~Standards[['ng_mL']])
fit your model like this
mod <- lm(log(Abs550nm) ~ ng_mL, data = standards)
Isn't that some much more readable?
To predict you would need a data frame with a variable ng_mL:
predict(mod, newdata = data.frame(ng_mL = c(0.5, 1.2)))
Now you may have a third error. You appear to be trying to predict with new values of Absorbance, but the way you fitted the model, Absorbance is the response variable. You would need to supply new values for ng_mL.
The behaviour you are seeing is what happens when R can't find a correctly-named variable in newdata; it returns the fitted values from the model or the predictions at the observed data.
This makes me think you have the formula back to front. Did you mean:
mod2 <- lm(ng_mL ~ log(Abs550nm), data = standards)
?? In which case, you'd need
predict(mod2, newdata = data.frame(Abs550nm = c(1.7812,1.7309)))
say. Note you don't need to include the log() bit in the name. R recognises that as a function and applies to the variable Abs550nm for you.
If the model really is log(Abs550nm) ~ ng_mL and you want to find values of ng_mL for new values of Abs550nm you'll need to invert the fitted model in some way.

Undoing createDataPartition ordering

I'm new to data science and trying to finish up this project. I have a data frame (from here https://www.kaggle.com/c/house-prices-advanced-regression-techniques) with assigned train and test sets (1:1460, 1461:2919) that I was suggested to use createDataPartition() on due to an error I was getting when trying to predict
> predSale <- as.data.frame(predict(model, newdata = test))
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev =
object$xlevels) : factor MSSubClass has new levels 150
But now when using createDataPartition, it's mixing up my original train and test sets, which I need in a specific order for the Kaggle submission. I've read in the vignette and looks like there is an argument for returnTrain. I'm not sure if this could be used (I don't fully understand it), but ultimately I'd like to know if there is a way to undo the ordering so I can submit my project with the original ordered set.
test$SalePrice <- NA
combined <- rbind(train, test)
train <- combined[1:1460, ]
test <- combined[1461:2919, ]
#____________Models____________
set.seed(666)
index <- createDataPartition(paste(combined$MSSubClass,combined$Neighborhood,
combined$Condition2,combined$LotConfig,
combined$Exterior1st,combined$Exterior2nd,
combined$RoofMatl,combined$MiscFeature,combined$SaleType))$Resample
train <- combined[index,]
test <- combined[-index,]
model <- lm(SalePrice ~., train)
predSale <- as.data.frame(predict(model, newdata = test))
SampleSubmission <- round(predSale, digits = 1)
write.csv(SampleSubmission, "SampleSubmission.csv")
Thanks!! If there is anything you need answered please let me know,I think I've provided everything (I'm not sure if you need the full code or what, I'll be happy to edit with whatever more needed)?
You do not use createDataPartition on a combined kaggle dataset. You need to keep these data sets separate. That's why kaggle provides them. If you want to combine them, then you have to split them again as they were after you have done your data cleaning.
But the issue you have is that there are factor levels in the test dataset that are not seen by the model. There are multiple posts on kaggle about this issue. But I must say that kaggle's search engines is crappy.
In kaggle competitions some people use the following code on character columns to turn them into numeric (e.g. for using the data with xgboost). This code assumes that you loaded the data sets with stringAsFactors = False.
for (f in feature.names) {
if (class(train[[f]])=="character") {
levels <- unique(c(train[[f]], test[[f]]))
test[[f]] <- as.integer(factor(test[[f]], levels=levels))
train[[f]] <- as.integer(factor(train[[f]], levels=levels))
}
}
Others use a version of the following to create all the level names in the training data set.
levels(xtrain[,x]) <- union(levels(xtest[,x]),levels(xtrain[,x]))
There are more ways of dealing with this.
Of course these solutions are nice for Kaggle, as this might give you a better score. But in a strict sense this is a sort of data leakage. And using this in a production setting, is asking for trouble. There are many situations in which you might not know all of the possible values in advance, and when encountering a new value returning a missing value instead of a prediction is a more sensible choice. But that discussion fills complete research articles.

Pooling glmers of imputed datasets

The problem:
I have a dataset, with some missing predictor values. I'd like to pool glmer models together which have been applied to these imputation sets. I'm using the mice package to create the imputations (I've also used amelia and mi too with no success). I'd like to extract the fixed effects primarily.
Using the pool() function within the mice package returns the error:
Error in qhat[i, ] : incorrect number of dimensions
I've tried to use and adapt a previous rewrite of the pool() function here:
https://github.com/stefvanbuuren/mice/pull/5
There's probably an obvious solution I'm overlooking!
Here's an example:
# 1. create data (that can be replicated and converge later)
data = data.frame(x1=c(rep("1",0.1*1000), rep("0",0.5*1000),
rep("1",0.3*1000), rep("0",0.1*1000)),
x2=c(rep("fact1",0.55*1000), rep("fact2",0.1*1000),
rep(NA,0.05*1000), rep("fact3",0.3*1000)),
centre=c(rep("city1",0.1*1000), rep("city2",0.2*1000),
rep("city3",0.15*1000), rep("city1",0.25*1000),
rep("city2",0.3*1000) ))
# 2. set factors
data = sapply(data, as.factor)
# 3. mice imputation
library(mice)
imp.data = mice(data, m=5, maxit=20, seed=1234, pri=F)
# 4. apply the glmer function
library(lme4)
mice.fit = with(imp.data, glmer(x1~x2+(1|centre), family='binomial'))
# 5. pool imputations together
pooled.mi = pool(mice.fit)
The other function I've applied at step 4 is below, in the hope it'd create an object amenable to pool().
mice.fit = lapply(imp.data$imp, function(d){ glmer(x1~x2+(1|centre), data=d,
family='binomial') })
I've got a work around that involves using a meta-analysis model to pool the results of each of the fixed effects of the glmer models. That works- but it'd be much better to have the Rubin model working.
This Just Works for me after making my own fork of mice, pulling the extended version you referenced above into it, and cleaning it up a little bit: try
devtools::install_github("bbolker/mice")
and see how your process goes after that. (If it works, someone should submit a reminder/new pull request ...)
Is there a difference between an object of class "glmerMod" and "lmerMod"? I am unfamiliar with that package lme4. But if there is no difference, you can change the class of the mice.fit analyses to "lmerMod" and then it should run fine.

Resources