I can't for the life of me figure out how to compute a confusion matrix on rpart.
Here is what I have done:
set.seed(12345)
UBANK_rand <- UBank[order(runif(1000)), ]
UBank_train <- UBank_rand[1:900, ]
UBank_test <- UBank_rand[901:1000, ]
dim(UBank_train)
dim(UBank_test)
#Build the formula for the Decision Tree
UB_tree <- Personal.Loan ~ Experience + Age+ Income +ZIP.Code + Family + CCAvg + Education
#Building the Decision Tree from Test Data
UB_rpart <- rpart(UB_tree, data=UBank_train)
Now, I would think I would do something like
table(predict(UB_rpart, UBank_test, UBank_Test$Default))
But that is not giving me a confusion matrix.
You didn't provide a reproducible example, so I'll create a synthetic dataset:
set.seed(144)
df = data.frame(outcome = as.factor(sample(c(0, 1), 100, replace=T)),
x = rnorm(100))
The predict function for an rpart model with type="class" will return the predicted class for each observation.
library(rpart)
mod = rpart(outcome ~ x, data=df)
pred = predict(mod, type="class")
table(pred)
# pred
# 0 1
# 51 49
Lastly, you can build the confusion matrix by running table between the prediction and true outcome:
table(pred, df$outcome)
# pred 0 1
# 0 36 15
# 1 14 35
You can try
pred <- predict(UB_rpart, UB_test)
confusionMatrix(pred, UB_test$Personal.Loan)
Related
I have trying to apply logistic regression or any other of ML algorithm to this simple data set but I have failed miserably and got many error. I am tr
dim(data)
[1] 11580 12
head(data)
ReturnJan ReturnFeb ReturnMar ReturnApr ReturnMay ReturnJune
1 0.08067797 0.06625000 0.03294118 0.18309859 0.130333952 -0.01764234
2 -0.01067989 0.10211539 0.14549595 -0.08442804 -0.327300392 -0.35926605
3 0.04774193 0.03598972 0.03970223 -0.16235294 -0.147426982 0.04858934
4 -0.07404022 -0.04816956 0.01821862 -0.02467917 -0.006036217 -0.02530364
5 -0.03104575 -0.21267723 0.09147609 0.18933823 -0.153846154 -0.10611511
6 0.57980016 0.33225225 -0.40546095 -0.06000000 0.060732113 -0.21536106
And the 12th column the one I am trying to predict looks like this
PositiveDec
0
0
0
1
1
1
Here is my attempt
new.data <- data[,-12] #Remove labels' column
index <- sample(1:nrow(new.data), size = 0.8*nrow(new.data))#Split data
train.data <- new.data[index,]
test.data <- new.data[-index,]
fit.glm <- glm(data[,12]~.,data = data, family = "binomial")
You are getting there, but have several syntactic errors and, as pointed out in comments, need to leave your outcome variable in. This should work:
index <- sample(1:nrow(data), size = 0.8 * nrow(data))
train.data <- data[index, ]
fit.glm <- glm(PositiveDec ~ ., data = train.data, family = "binomial")
I am fitting the following model
fit<- lmer(y ~ a + b + (1|c) + (1|a:d) , data=inputdata)
to real observations collected in "inputdata".
Now I want to generate various (1000) modelled datasets for a simulation based on the model parameters and the determined errors. I can use
pred <- predict(fit, newdata=list(a=val_a1, b=val_b1, c=val_c1, d = val_d1),
allow.new.levels = TRUE)
but this always provides the same (the most likely, mean value). Is there a way to get a distribution of values, meaning to draw from a predicted distribution?
As asked by #Adam Quek a reproducable example:
#creating dataset
a <- as.factor(sort(rep(1:4,5 )))
b <- rep(1:2,10)+0.5
c <- as.factor(c( sort(rep(1:2,5)),sort(rep(1:2,5)) ))
d <- as.factor(rep(1:5,4 ))
a <- c(a,a,a)
b <- c(b,b,b)
c <- c(c,c,c)
d <- c(d,d,d)
y <- rnorm(60)
inputdata = data.frame(y,a,b,c,d)
# fitting the model
fit<- lmer(y ~ a + b + (1|c) + (1|a:d) , data=inputdata)
# making specific predictions for a parameter set
val_a1 = 1
val_b1 = 2
val_c1 = 1
val_d1 = 4
pred <- predict(fit, newdata=list(a=val_a1, b=val_b1, c=val_c1, d = val_d1),
allow.new.levels = TRUE)
pred
what I obtain is:
0.2394255
If I do it again
pred <- predict(fit, newdata=list(a=val_a1, b=val_b1, c=val_c1, d = val_d1),
allow.new.levels = TRUE)
pred
I get of course:
0.2394255
but what I am searching for is a R function or routine that easily provides a suite of predictions that follow the distribution of my input values. Something like
for (i in 1:1000){
pred[i] <- predict(fit, newdata=list(a=val_a1, b=val_b1, c=val_c1, d =
val_d1),allow.new.levels = TRUE)
}
and mean(pred) = 0.2394255 but sd(pred) != 0
Thanks to #Alex W! bootMer does the job. Below for those who are interested the solution for the example:
m1 <- function(.) {
predict(., newdata=inputdata, re.form=NULL)
}
boot1 <- lme4::bootMer(fit, m1, nsim=1000, use.u=FALSE, type="parametric")
boot1$t[,1]
where boot1$t[,1]now contains the 1000 predictions when using the parameter values defined in inputdata[1,].
https://cran.r-project.org/web/packages/merTools/vignettes/Using_predictInterval.html
was a helpful link.
I have created a decision tree model in R. The target variable is Salary, where we are trying to predict if the salary of a person is above or below 50k based on the other input variables
df<-salary.data
train = sample(1:nrow(df), nrow(df)/2)
train = sample(1:nrow(df), size=0.2*nrow(df))
test = - train
training_data = df[train, ]
testing_data = df[test, ]
fit <- rpart(training_data$INCOME ~ ., method="class", data=training_data)##generate tree
testing_data$predictionsOutput = predict(fit, newdata=testing_data, type="class")##make prediction
After that I tried to create a Gain chart by doing the following
# Gain Chart
pred <- prediction(testing_data$predictionsOutput, testing_data$INCOME)
gain <- performance(pred,"tpr","fpr")
plot(gain, col="orange", lwd=2)
By looking at the reference I am unable to understand how to use the ROCR package to build the chart by using the 'Prediction' function. Is this only for binary target variables? I get the error saying 'format of predictions is invalid'
Any help with this would be much appreciated to help me build a Gain chart for the above model. Thanks!!
AGE EMPLOYER DEGREE MSTATUS JOBTYPE SEX C.GAIN C.LOSS HOURS
1 39 State-gov Bachelors Never-married Adm-clerical Male 2174 0 40
2 50 Self-emp-not-inc Bachelors Married-civ-spouse Exec-managerial Male 0 0 13
3 38 Private HS-grad Divorced Handlers-cleaners Male 0 0 40
COUNTRY INCOME
1 United-States <=50K
2 United-States <=50K
3 United-States <=50K
Convert the prediction to a vector, using c()
library('rpart')
library('ROCR')
setwd('C:\\Users\\John\\Google Drive\\working\\R\\questions')
df<-read.csv(file='salary-class.csv',header=TRUE)
train = sample(1:nrow(df), nrow(df)/2)
train = sample(1:nrow(df), size=0.2*nrow(df))
test = - train
training_data = df[train, ]
testing_data = df[test, ]
fit <- rpart(training_data$INCOME ~ ., method="class", data=training_data)##generate tree
testing_data$predictionsOutput = predict(fit,
newdata=testing_data, type="class")##make prediction
# Doesn't work
# pred <- prediction(testing_data$predictionsOutput, testing_data$INCOME)
v <- c(pred = testing_data$predictionsOutput)
pred <- prediction(v, testing_data$INCOME)
gain <- performance(pred,"tpr","fpr")
plot(gain, col="orange", lwd=2)
This should work if you change
predict(fit, newdata=testing_data, type="class")
to
predict(fit, newdata=testing_data, type="prob")
The gains chart wants to rank-order by model probability.
consider the following example
rm(list = ls(all=T))
library(ISLR)
library(glmnet)
Hitters=na.omit(Hitters)
# Binary proble - Logistic regression
Hitters$Salary <- ifelse(Hitters$Salary > 1000, 1, 0)
Hitters$Salary <- as.factor(Hitters$Salary)
# the class is unbalanced
# > table(Hitters$Salary)
# 0 1
# 233 30
# cls <- sapply(Hitters, class)
# for(j in names(cls[cls == 'integer'])) Hitters[,j] <- as.double(Hitters[,j])
x = model.matrix(~ . -1, Hitters[,names(Hitters)[!names(Hitters) %in% c('Salary')]] )
inx_train <- 1:200
inx_test <- 201:dim(Hitters)[1]
x_train <- x[inx_train, ]
x_test <- x[inx_test, ]
y_train <- Hitters[inx_train, c('Salary')]
y_test <- Hitters[inx_test, 'Salary']
fit = cv.glmnet(x=x_train, y=y_train, alpha=1, type.measure='auc', family = "binomial")
plot(fit)
pred = predict(fit, s='lambda.min', newx=x_test)
quantile(pred)
# 0% 25% 50% 75% 100%
# -5.200853 -3.704760 -2.883836 -1.937052 1.386215
Given the above probabilities, which function or parameter in predict should I use/modify to transform them between 0 and 1?
In your predict call you need the type="response" argument set. As per the documentation it returns the fitted probabilities.
pred = predict(fit, s='lambda.min', newx=x_test, type="response")
Also, if you are just wanted the classification labels you can use type="class"
unfortunately I have problems using predict() in the following simple example:
library(e1071)
x <- c(1:10)
y <- c(0,0,0,0,1,0,1,1,1,1)
test <- c(11:15)
mod <- svm(y ~ x, kernel = "linear", gamma = 1, cost = 2, type="C-classification")
predict(mod, newdata = test)
The result is as follows:
> predict(mod, newdata = test)
1 2 3 4 <NA> <NA> <NA> <NA> <NA> <NA>
0 0 0 0 0 1 1 1 1 1
Can anybody explain why predict() only gives the fitted values of the training sample (x,y) and does not care about the test-data?
Thank you very much for your help!
Richard
It looks like this is because you misuse the formula interface to svm(). Normally, one supplies a data frame or similar object within which the variables in the formula are searched for. It usually doesn't matter if you don't do this, even if it is not best practice, but when you want to predict, not putting variables in a data frame gets you in a right mess. The reason it returns the training data is because you don't provide newdata an object with a component named x in it. Hence it can't find the new data x so returns the fitted values. This is common for most R predict methods I know.
The solution then is to i) put your training data in a data frame and pass svm this as the data argument, and ii) supply a new data frame containing x (from test) to predict(). E.g.:
> DF <- data.frame(x = x, y = y)
> mod <- svm(y ~ x, data = DF, kernel = "linear", gamma = 1, cost = 2,
+ type="C-classification")
> predict(mod, newdata = data.frame(x = test))
1 2 3 4 5
1 1 1 1 1
Levels: 0 1
You need newdata to be of the same form, ie using a data.frame helps:
R> library(e1071)
Loading required package: class
R> df <- data.frame(x=1:10, y=sample(c(0,1), 10, rep=TRUE))
R> mod <- svm(y ~ x, kernel = "linear", gamma = 1,
+ cost = 2, type="C-classification", data=df)
R> newdf <- data.frame(x=11:15)
R> predict(mod, newdata=newdf)
1 2 3 4 5
0 0 0 0 0
Levels: 0 1
R>
By the way, this is also shown the help page for svm():
## density-estimation
# create 2-dim. normal with rho=0:
X <- data.frame(a = rnorm(1000), b = rnorm(1000))
attach(X)
# traditional way:
m <- svm(X, gamma = 0.1)
# formula interface:
m <- svm(~., data = X, gamma = 0.1)
# or:
m <- svm(~ a + b, gamma = 0.1)
# test:
newdata <- data.frame(a = c(0, 4), b = c(0, 4))
predict (m, newdata)
So in sum, use the formula interface and supply a data.frame --- that is how essentially all modeling functions in R work.