in this website:
https://medium.com/#ODSC/build-a-multi-class-support-vector-machine-in-r-abcdd4b7dab6
it says that we can use it for predict
prediction <- predict(svm1, test_iris)
> xtab <- table(test_iris$Species, prediction)
> xtab prediction
setosa versicolor virginica
setosa 20 0 0
versicolor 0 20 1
virginica 0 0 19
and use this for finding accuracy
(20+20+19)/nrow(test_iris) # Compute prediction accuracy
But when I have very very large data set I even can not see table how I can find this number (20+20+19)? to find accuracy?
You can get the correct classified with diag:
library(e1071)
svm1 <- svm(Species~., data=iris)
prediction <- predict(svm1, iris)
xtab <- table(iris$Species, prediction)
sum(diag(xtab))/sum(xtab) #Overall
#[1] 0.9733333
diag(xtab)/rowSums(xtab) #For each class per observation
# setosa versicolor virginica
# 1.00 0.96 0.96
diag(xtab)/colSums(xtab) #For each class per prediction
# setosa versicolor virginica
# 1.00 0.96 0.96
Related
I have a model, called predictive_fit <- fit(workflow, training) that classifies the Iris dataset species using xgboost. The data are pivoted wide such that each species is a dummied column represented by a 0 or 1. Here, I am trying to predict Virginica based on the Sepal and Petal columns.
Currently, I have the following code which then takes the dataset after the model has been fit to test if it can accurately predict the Virginia species of iris. (Snippet below)
testing_data <-
test %>%
bind_cols(
predict(predictive_fit, test)
)
I cannot, however, figure out how to scale this up with simulation. If I have another dataset with exactly the same structure, I would like to predict whether it is Virginica 100 times. (Snippet below)
new_iris_data <-
new_iris_data %>%
bind_cols(
replicate(n = 100, predict(predictive_fit, new_iris_data))
)
However, it looks as if when I run the new data the same predictions are just being copied 100 times. What is the appropriate way to repeatedly predict the classification? I wouldn't expect that all 100 times the model would predict exactly the same thing, but I'd like some way to have the predictions run n number of times so each and every row of new data can have its own proportion calculated.
I have already tried using the replicate() function to try this. However, it appears as if it copies the same exact results 100 times. I considered having a for loop that iterated through a different seed and then ran the predictions, but I was hoping for a more performant solution out there.
You are replicating the prediction of you model, not the data.frame you call new_iris_data, and the result is exactly that. In order to replicate a (random) part of the iris dataset, try this:
> data("iris")
>
> sample <- sample(nrow(iris), floor(nrow(iris) * 0.5))
>
> train <- iris[sample,]
> test <- iris[-sample,]
>
> new_test <- replicate(100, test, simplify = FALSE)
> new_test <- Reduce(rbind.data.frame, new_test)
>
> head(new_test)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
> nrow(new_test)
[1] 7500
The you can use the new_test in any prediction, independent of the model.
If you want 100 differents random parts of the data set, you need to drop the replicate function and do something like:
> new_test <- lapply(1:100, function(x) {
+ sample <- sample(nrow(iris), floor(nrow(iris) * 0.5))
+ iris[-sample,]
+ })
>
> new_test <- Reduce(rbind.data.frame, new_test)
>
> head(new_test)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
7 4.6 3.4 1.4 0.3 setosa
10 4.9 3.1 1.5 0.1 setosa
11 5.4 3.7 1.5 0.2 setosa
13 4.8 3.0 1.4 0.1 setosa
18 5.1 3.5 1.4 0.3 setosa
> nrow(new_test)
[1] 7500
>
Hope it helps.
I estimate a randomForest, then run the randomForest.predict function on some hold-out data.
What I would like to do is (preferably) append the prediction for each row to the dataframe containing the holdout data as a new column, or (second choice) save the (row number in test data, prediction for that row) as a .csv file.
What I can't do is access the internals of the results object in a way that lets me do that. I'm new to R so I appreciate your help.
I have:
res <-predict(forest_tst1,
test_d,
type="response")
which successfully gives me a bunch of predictions.
The following is not valid R, but ideally I would do something like:
test_d$predicted_value <- results[some_field_of_the_results]
or,
for i = 1:nrow(test_d)
test_d[i, new_column] = results[prediction_for_row_i]
end
Basically I just want a column of predicted 1's or 0's corresponding to rows in test_d. I've been trying to use the following commands to get at the internals of the res object, but I've not found anything that's helped me.
attributes(res)
names(res)
Finally - I'm a bit confused by the following if anyone can explain!
typeof(res) = "integer"
Edit: I can do
res != test_d$gold_label
which is if anything a little confusing, because I'm comparing a column and a non-column object (??), and
length(res) = 2053
and res appears to be indexable
attributes(res[1])
$names
[1] "6836"
$levels
[1] "0" "1"
$class
[1] "factor"
but I can't select out the sub-parts in a sensible way
> res[1][1]
6836
0
Levels: 0 1
> res[1]["levels"]
<NA>
<NA>
Levels: 0 1
If understand right, all you are trying to do is add predictions to your Test Data?
ind <- sample(2, nrow(iris), replace = TRUE, prob=c(0.8, 0.2))
TestData = iris[ind == 2,] ## Generate Test Data
iris.rf <- randomForest(Species ~ ., data=iris[ind == 1,]) ## Build Model
iris.pred <- predict(iris.rf, iris[ind == 2,]) ## Get Predictions
TestData$Predictions <- iris.pred ## Append the Predictions Column
OutPut:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Predictions
9 4.4 2.9 1.4 0.2 setosa setosa
16 5.7 4.4 1.5 0.4 setosa setosa
17 5.4 3.9 1.3 0.4 setosa setosa
32 5.4 3.4 1.5 0.4 setosa setosa
42 4.5 2.3 1.3 0.3 setosa setosa
46 4.8 3.0 1.4 0.3 setosa setosa
I am using randomForest for classification of data and I am unable to understand:
1- How can we obtain the information (preferably in a dataframe of 3 columns) which tells us the real classification in testData (e.g. in below example Species column), prediction made by random forest, and the probability score of that prediction. For example just consider the below data set and 1 case where in testData the Species (blinded information for random forest) was versicolor but it was predicted wrongly by classifier as virginica with a probability score of 0.67. I want this kind of information but don't know how can I obtain this
2- How can we get the confusion matrix for testData and trainingData which also gives us the class.error, like in the case when we print the model.
data(iris)
set.seed(111)
ind <- sample(2, nrow(iris), replace = TRUE, prob=c(0.8, 0.2))
trainData <- iris[ind==1,]
testData <- iris[ind==2,]
#grow forest
iris.rf <- randomForest(Species ~ ., data=trainData)
print(iris.rf)
Call:
randomForest(formula = Species ~ ., data = trainData)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 2
OOB estimate of error rate: 3.33%
Confusion matrix:
setosa versicolor virginica class.error
setosa 45 0 0 0.00000000
versicolor 0 39 1 0.02500000
virginica 0 3 32 0.08571429
**#predict using the training again...**
iris.pred <- predict(iris.rf, trainData)
table(observed = trainData$Species, predicted = iris.pred)
predicted
observed setosa versicolor virginica
setosa 45 0 0
versicolor 0 40 0
virginica 0 0 35
**#Testing on testData**
irisPred<-predict(iris.rf, newdata = testData)
table(irisPred, testData$Species)
irisPred setosa versicolor virginica
setosa 5 0 0
versicolor 0 8 1
virginica 0 2 14
I used the caret package to run random forest with trainControl:
library(caret)
library(PerformanceAnalytics)
model <- train(Species ~ .,trainData,
method='rf',TuneLength=3,
trControl=trainControl(
method='cv',number=10,
classProbs = TRUE))
model$results
irisPred_species<-predict(iris.rf, newdata = testData)
irisPred_prob<-predict(iris.rf, newdata = testData, "prob")
out.table <- data.frame(actual.species = testData$Species, pred.species = irisPred_species, irisPred_prob)
You can get the error rate by:
iris.rf$err.rate
And the confusion matrix:
iris.rf$confusion
After imputation in R with the MICE package, I want to generate contingency tables. The fit shows the tables in a list, but if I pool() them, the following error is thrown: Error in pool(fit) : Object has no coef() method. What am I doing wrong?
This basic example reproduces the error:
library("mice")
imp <- mice(nhanes)
fit <- with(imp, table(bmi, hyp))
est <- pool(fit)
The function mice::pool(object) simply calculated estimates and standard errors for scalar estimands using "Rubin's rules", for which it relies on the fact that the estimates are often extracted using coef(object), and the standard errors of these estimates are usually available in the diagonal of vcov(object). It is intended to be used with objects of classes like lm, which have coef and vcov methods neatly defined.
In your example, Rubin's rules do not apply. What are the "estimates" and "standard errors" of the entries in a contingency table? For this reason, pool complains that there is no method available for extracting the coefficients from your fit.
So if your "estimate" is simply supposed to be the "average" contingency table, try this:
library("mice")
imp <- mice(nhanes)
fit <- with(imp, table(bmi, hyp))
est <- pool(fit)
# dimensions
nl <- length(fit$analyses)
nr <- nrow(fit$analyses[[1]])
nc <- ncol(fit$analyses[[1]])
# names
rnames <- rownames(fit$analyses[[1]])
cnames <- colnames(fit$analyses[[1]])
# cast list to array
fit.arr <- array(unlist(fit$analyses), dim=c(nr,nc,nl),
dimnames=list(rnames,cnames))
# get "mean" contingency table
apply(fit.arr, 1:2, mean)
# 1 2
# 20.4 1.8 0.0
# 21.7 1.4 0.0
# 22 1.4 0.2
# 22.5 1.8 0.4
# 22.7 1.2 0.4
# 24.9 1.2 0.0
# 25.5 1.0 1.6
# 26.3 0.0 1.0
# 27.2 0.4 1.0
# 27.4 1.4 0.4
# 27.5 1.6 0.2
# 28.7 0.0 1.0
# 29.6 1.0 0.2
# 30.1 1.8 0.2
# 33.2 1.0 0.0
# 35.3 1.2 0.2
Whether or not the "average" table is of any use, however, is probably debatable.
I am very new to R and I wanted to know how can I store the classification error value which results from confusion matrix:
Example:
confusion(predict(irisfit, iris), iris$Species)
## Setosa Versicolor Virginica
## Setosa 50 0 0
## Versicolor 0 48 1
## Virginica 0 2 49
## attr(, "error"):
## [1] 0.02
I want to fetch the classification error value 0.02 and store it somewhere. How can I do that!?
Assuming that your code works. You should be able to do the following
myconf<-confusion(predict(irisfit, iris), iris$Species)
myerr<-attr(myconf, "error")
which will put the value 0.02 in the variable myerr.