I am able to build a model with bigrf() package, but is there a way to predict probabilities instead of classes? For class prediction I use
predictions <- predict(forest, test, testset$y)
forest is a model. I tried type = "prob" but does not do anything. Is there a way to do this?
I have big data, so I need to use this package in order to be able to process it.
UPD:
library(bigrf)
library(randomForest)
data("iris")
iris <- iris[iris$Species != "virginica",]
x <- iris[,1:4]
y <- iris$Species
vars <- c(1:4)
s = sample(1:nrow(x), 60)
registerDoParallel(cores=detectCores(all.tests=TRUE))
forest <- bigrfc(x[s, ], y[s], ntree=5L, varselect=vars)
predictions <- predict(forest, x[-s, ])
So, the question is how to get probabilities in predictions instead of classes from object class bigrfc?
According to this post, it should be possible to obtain the class probabilities with
predictions_probs <- predictions#testvotes/rowSums(predictions#testvotes)
I haven't tested it though. HTH.
Related
I'll illustrate my problem with the iris data set in R. My objective here is to create 5 imputed data sets, fit a regression to each imputed data set, then pool together the results of these regressions into one final model. This is the preferred order of operations for a proper execution of multiple imputation.
library(mice)
df <- iris
# Inject some missingness into the data:
df$Sepal.Width[c(20,40,70,121)] <- NA
df$Species[c(15,80,99,136)] <- NA
# Perform the standard steps of multiple imputation with MICE:
imputed_data <- mice(df, method = c(rep("pmm", 5)), m = 5, maxit = 5)
model <- with(imputed_data, lm(Sepal.Length ~ Sepal.Width + Species))
pooled_model <- pool(model)
This leaves me with this pooled_model object which I am hoping to use as a fitted model in the predict command. However, that does not work. When I run:
predict(pooled_model, newdata = iris)
I get this error:
Error in UseMethod("predict") :
no applicable method for 'predict' applied to an object of class "c('mipo', 'data.frame')"
Disregard the reasoning why I am using the original iris data set in my newly fitted model; I simply want to be able to fit this data, or a subset of it, onto the model I created with my imputation.
I specifically chose a data set with multiple levels of a categorical variable to highlight my problem. I thought about using some matrix multiplication with which I could do this manually, but the presence of a categorical variable makes that tough. In my actual data set, I have over a hundred variables, many of which have multiple categorical levels. I say this because I realize one possible solution would be to re-code my categorical variables into dummy variables, and then I can apply some matrix multiplication to get my answer. But that would be an EXTREME amount of work for me. If there's a way I can somehow get a model object I can use in the predict function, that would make my life 100x easier.
Any suggestions?
You have two issues: 1) how to use stats::predict with pooled data and 2) what to do about your categorical variables.
Your first issue has already been documented on the mice Github page and it seems like there's been a desire to have a predict.mira function for a while. The author of the mice package posted some code on how to simulate a predict.mira-like function. Unfortunately, it only works with lm models, but it seems like that's okay considering your reprex. If you have a Github account, you can comment on that Github issue to demonstrate your interest in the predict.mira function.
Your question also has been posted on StackOverflow before; although the answer was never accepted, the SO user suggested this reading by Miles (2015).
For your second question, have you considered leaving out your current method argument when using mice()? As long as your variables have been classed as factors, then mice will default to the polyreg method for categorical variables and pmm for continuous variables. You can read more about the method argument here.
library(mice)
set.seed(123)
# make missing data
df <- iris
df$Sepal.Width[c(20,40,70,121)] <- NA
df$Species[c(15,80,99,136)] <- NA
# specify method
meth <- mice(df, maxit = 0, printFlag = FALSE)$meth
print(meth)
# this is how you would change your methods, if you wanted
# but pmm and polyreg are defaults
meth["Species"] <- "polr"
meth["Sepal.Width"] <- "midastouch"
print(meth)
# impute
imputed_data <- mice(df,
m = 5,
maxit = 5,
method = meth, # new method
printFlag = FALSE)
# make model
model <- with(imputed_data, lm(Sepal.Length ~ Sepal.Width + Species))
summary(pool(model))
# obtain predictions Q and prediction variance U
predm <- lapply(getfit(model), predict, se.fit = TRUE)
Q <- sapply(predm, `[[`, "fit")
U <- sapply(predm, `[[`, "se.fit")^2
dfcom <- predm[[1]]$df
# pool predictions
pred <- matrix(NA, nrow = nrow(Q), ncol = 3,
dimnames = list(NULL, c("fit", "se.fit", "df")))
for(i in 1:nrow(Q)) {
pi <- pool.scalar(Q[i, ], U[i, ], n = dfcom + 1)
pred[i, 1] <- pi[["qbar"]]
pred[i, 2] <- sqrt(pi[["t"]])
pred[i, 3] <- pi[["df"]]
}
head(pred)
For a ML course, I am supposed to build a model based on the training set to predict the variable "classe" on a validation set. I removed all unnecessary variables in the training set, used cross validation to prevent over-fitting, and made sure the validation set matched the training set in terms of which columns are removed. When I predict classe in the validation set, it yields all classe A, and I know this is incorrect.
I included the entire script below.
Where did I go wrong?
library(caret)
download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv", "train.csv")
download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv", "test.csv")
train <- read.csv("./train.csv")
val <- read.csv("./test.csv")
#getting rid of columns with NAs
nas <- sapply(train, function(x) sum(is.na(x)))
train <- train[, nas<1900]
#removing near zero variance columns
remove <- nearZeroVar(train)
train <- train[, -remove]
#create partition in our training set
set.seed(8675309)
inTrain <- createDataPartition(train$classe, p = .7, list = FALSE)
training <- train[inTrain,]
testing <- train[-inTrain,]
model <- train(classe ~ ., method = "rf", data = training)
confusionMatrix(predict(model, testing), testing$classe)
#make sure validation set has same features as training set
trainforvalid <- subset(training, select = -classe)
val <- val[, colnames(trainforvalid)]
predict(model, val)
#the above step yields all predictions as classe A
This might be happening because the data is unbalanced. If the data have a lot more data points for Class A then Class B, the model will simply learn to predict always Class A.
Try to use a better metric in this case like F1 score.
I also recommend using techniques like oversampling or undersampling to avoid the unbalanced data issue.
I try to run a randomForest model on iris data without the variable Petal.Length. The code give me errors on prédiction. How can I code properly? Thanks for help.
Richard
data (iris)
attach (iris)
iris$id <- 1:nrow(iris)
library (dplyr)
train <- iris %>%
sample_frac (0.8)
test <- iris %>%
anti_join(train, by = "id")
library (randomForest)
library (caret)
fit <- randomForest(Species ~
Sepal.Length +Sepal.Width +Petal.Width, data = train,)
prediction <- predict (fit, test [1:2 , 4])
confusionMatrix (test$Species,prediction)
You subsetting for test dataset is wrong. Just use
prediction <- predict (fit, newdata = test)
in place of
predict (fit, test [1:2 , 4])
It will automatically take the required independent variables.
Or you can use like
prediction <- predict (fit, subset(test, select = -c(Petal.Length)))
In the prediction function, you have to supply all the numeric data used for training. Try this instead
prediction <- predict (fit, test[ , c(1:4)])
I'm trying to do an example classification prediction with rpart() but for whatever reason it doesn't seem to be giving me the right predictions when I pass test data into a fitted tree.
library(rpart)
data.samples <- sample(1:nrow(cu.summary), nrow(cu.summary) * 0.7, replace = FALSE)
training.data <- cu.summary[data.samples, ]
test.data <- cu.summary[-data.samples, ]
fit <- rpart(
Type~Price + Country + Reliability + Mileage,
method="class",
data=training.data
)
fit.pruned<- prune(fit, cp=fit$cptable[which.min(fit$cptable[,"xerror"]),"CP"])
prediction <- predict(fit.pruned, test.data)
prediction
#table(prediction, test.data$Type)
This seems to give me everything except the classes I was trying to predict in the first place. Am I using a particular syntax wrong somewhere?
You must specify the type of prediction
predict(fit.pruned, test.data, type="class")
I am looking for some guidance on a homework assignment I am working on for a class. We are given a dataset with 14K observations and we are asked to build a prediction model. I subset the dataset into training and testing (4909 observations), here I am using the caret package, which predicts the last variable "classe". I pulled out the near zero variables and built the model but when I tried to do predictions I only get 97 predictions back. I reviewed the help files but still can't figure out where I am going wrong. Any hints would be appreciated.
Here is the Code:
set.seed(1234)
pml.training <- read.csv("./data/pml-training.csv")
#
library(caret)
inTrain <- createDataPartition(y=pml.training$classe, p=0.75, list=FALSE)
training <- pml.training[inTrain,]
testing <- pml.training[-inTrain,]
# Pull out the Near Zero Value (NZV)
nzv <- nearZeroVar(training, saveMetrics=TRUE)
omit <- which(nzv$nzv==TRUE)
training <- training[,-omit]
testing <- testing[,-omit]
# Fit the model
modFit <- train(classe ~., method="rf", data=training)
modFit
print(modFit$finalModel)
plot(modFit)
# Try and predict on the testing model
pred <- predict(modFit, newdata=testing)
testing$predRight <- pred==testing$classe
print(table(pred, testing$classe))
Thanks, Pat C.
Have you checked
sum(complete.cases(subset(testing, select = -classe)))
?