R. Reusing saved models for sentiment prediction - r

I have R code for importing text data into R, remove stop words, stem words and then create a matrix. Below is the code for
Using the create container function to split the matrix into training and test data sets.
Use train_models function to create a model based on SVM.
Execute the model on test.
Then I save the model.
library("RTextTools")
container = create_container(matrix, as.numeric(as.factor(data[, 2])),
trainSize = 1:2800,testSize = 2801:3162, virgin = FALSE)
models = train_models(container,"SVM", kernel = "linear",cost =1)
results = classify_models(container, models)
save(models, file = "my_model1.rda")
I am not able to use the saved model for prediction on new data(matrix_new) using predict function.
p <- predict(models,matrix_new)
#Error in predict.svm(X[[1L]], ...) : test data does not match model !
My question is: Is it feasible to use saved models on new data to predict sentiment ? From the error it looks like there is mismatch between the words that were used while creating the model and the new data. Please clarify if my understanding is correct.

Related

Writing a prediction equation from plsr model

Greeting to everyone.
I sucessfully computed pls-r model in R using the code below
pls_modB_Kexch_2 <- plsr(Av.K_exc~., data = trainKexch.sar.veg, scale=TRUE,method= "s",validation='CV')
The regression coeffiecents for ncomps =11 were
(
Intercept)= -4.692966e+05,
Easting = 6.068582e+03, Northings= 7.929767e+02,
sigma_vv = 8.024741e+05, sigma_vh = -6.375260e+05,
gamma_vv = -7.120684e+05, gamma_vh = 4.330279e+05,
beta_vv = -8.949598e+04, beta_vh = 2.045924e+05,
c11_db = 2.305016e+01, c22_db = -4.706773e+01,
c12_real = -1.877267e+00.)
It predicts well new data sets when applied with in R enviroment.
My challenge is presenting this model in form of y=sum(AX)+Bo equation where A are coeffiecents of respective variablesX
Or any other mathmetical form, that can be presented academically.
I tried a direct way by multiplying the coeff.to each variable and suming them up, aquick manual trial for predictions gave me strange results. Am missing something here, please help.

randomForest in R is including the class label as a feature prevents classifier from predicting on new dataset

So I have two datasets, og.data and newdata.df. I have matched their features and I want to use a feature from og.data to train a model so I can identify cases of this class in newdata.df. I am using the randomForest package in R documentation for it is here: https://cran.r-project.org/web/packages/randomForest/randomForest.pdf
split <- sample.split(og.data$class_label, SplitRatio = 0.7)
training_set = subset(og.data$class_label, split == TRUE)
test_set = subset(og.data$class_label, split == FALSE)
rf.classifier.object = randomForest(x = training_set[-1],
y = training_set$Engramcell,
ntree = 500)
I then use the test set to calculate the AUC, visualize ROC, precision, recall etc etc.
I do that using prediction probability generated like so...
predictions.df <- as.data.frame(predict(rf.classifier.object,
test_set,
type = "prob")
)
All is good I proceed to try to use the classifier I've trained on new data and now I am encountering a problem because the new data does not contain the feature class label. Whihc is annoying as the entire purpose of training the classifier to to label this newdata.
predictions.df <- as.data.frame(predict(rf.classifier.object,
newdata.df,
type = "prob")
)
Please note the error has different variable names simply because I changed the code to make it more general for readability.
Error in predict.randomForest(rf.classifier.object, newdata.df, :
variables in the training data missing in newdata
As per this stack post predict.randomForest(), called here as predict(), uses rownames of feature importance to make its precitions. And when I checked with a search of the feature names I find that it is infact the class label preventing me from making the test as I show bellow.
# > rownames(rf.classifier.object$importance)[!(rownames(rf.classifier.object$importance) %in% colnames(newdata) )]
# [1] "class_label"
It is not clear to me what I should change in my script so that the classifier can be used on other data than the testing set. I have followed the instructions exactly this seems like a bad design choice to have made the function this way. The class label should not be used for calculating feature importance at all and should not even be considered a feature imo.

How do I save my model to use in another project in mlr3?

I would like to divide my working pipeline in 2:
One place (internal) where to benchmark and auto-tune the alrithms to select the final model.
Apply the selected models to new datasets (external).
For the second part, I will need to somehow save the resulting model object to later use
model$predict_newdata() and transporting it without needing to re-train the algorithm and taking with it the original training data.
The idea is synthesized with the following error:
library("mlr3")
task = tsk("iris")
learner = lrn("classif.rpart")
learner$train(task, row_ids = 1:120)
predictions = learner$predict(task, row_ids = 121:150)
predictions
So far so good, but now I have to save this model into an object outside the R Session, but of course, this won't work:
store_model = learner$model
save(store_model, 'model_rpart.RData')
The solution is to save the whole object as an .rds object as Brian suggested.
saveRDS(learner, 'learner_rpart.rds')
model <- readRDS('learner_rpart.rds')
predictions = model$predict(task, row_ids = 121:150)
predictions$confusion

How to predict the labels for the test set when using a custom Iterator in MXnet?

I have a big dataset (around 20GB for training and 2GB for testing) and I want to use MXnet and R. Due to lack of memory, I search for an iterator to load the training and test set by a custom iterator and I found this solution.
Now, I can train the model using the code on this page, but the problem is that if I read the test set with the save iterator as follow:
test.iter <- CustomCSVIter$new(iter = NULL, data.csv = "test.csv", data.shape = 480, batch.size = batch.size)
Then, the prediction command does not work and there is no prediction template in the page;
preds <- predict(model, test.iter)
So, my specific problem is, if I build my model using the code on the page, how can I read my test set and predict its labels for the evaluation process? My test set and train set is in this format.
Thank you for your help
It actually works exactly as you explained. You just call predict with model and iterator:
preds = predict(model, test.iter)
The only trick here is that the predictions are displayed column-wise. By that I mean, if you take the whole sample you are referring to, execute it and add the following lines:
test.iter <- CustomCSVIter$new(iter = NULL, data.csv = "mnist_train.csv", data.shape = 28, batch.size = batch.size)
preds = predict(model, test.iter)
preds[,1] # index of the sample to see in the column position
You receive:
[1] 5.882561e-11 2.826923e-11 7.873914e-11 2.760162e-04 1.221306e-12 9.997239e-01 4.567645e-11 3.177564e-08 1.763889e-07 3.578671e-09
This show the softmax output for the 1st element of the training set. If you try to print everything by just writing preds, then you will see only empty values because of the RStudio print limit of 1000 - real data will have no chance to appear.
Notice that I reuse the training data for prediction. I do so, since I don't want to adjust iterator's code, which needs to be able to consume the data with and without a label in front (training and test sets). In real-world scenario you would need to adjust iterator so it would work with and without a label.

How to make a list of lmer model objects to use in a for loop in R?

I'm trying to write a for loop in R (my first!) in order to produce and save diagnostic plots of several mixed effects models fitted using the function lmer in the package lme4. This is what I've done so far exemplified with the sleepstudy data:
require(lme4)
mod1<-lmer(Reaction ~ Days + (1|Subject),sleepstudy)
mod2<-lmer(Reaction ~ 1 + (1|Subject),sleepstudy)
List<-c(mod1,mod2)
names<-c("mod1","mod2")
i=1
for (i in 1:length(List)) {
jpeg(file = paste("modelval_", names[i], ".jpg", sep=""))
par(mfrow=c(2,2))
plot(resid(List[i]) ~ fitted(List[i]),main="residual plot")
abline(h=0)
qqnorm(resid(List[i]), main="Q-Q plot of residuals")
qqnorm(ranef(List[i])$Subject$"(Intercept)", main="Q-Q plot of random effect" )
dev.off()
}
I get the following error message when typing into R consol:
Error in function (formula, data = NULL, subset = NULL, na.action = na.fail, :
invalid type (NULL) for variable 'resid(list[i])'
I've got a feeling the problem is related to the list of models I've created and not the for loop itself and I think it might be related to the model objects being of class S4. Is it possible to make such a list?
I've also tried to make the list like below, with no improvements (still get the same error message)
List<-list(mod1,mod2)
First using c can risk losing the class structure of the objects you've created. To make a list containing your models, use list(mod1, mod2).
Second, List[i] is a list of length 1 containing the i'th element of List. Use List[[i]] to extract the element itself (your model).

Resources