Please help me with plotting this model. I tried just using the plot function but I'm not sure how to incorprate the testing dataset. Please help/Thank You.
TravelInsurance <- read.csv(file="TravelInsurancePrediction.csv",header=TRUE)
set.seed(2022)
Training <- sample(c(1:1987),1500,replace=FALSE)
Test <- c(1:1987)[-Training]
TrainData <- TravelInsurance[Training,]
TestData <- TravelInsurance[Test,]
TravIns=as.factor(TravelInsurance$TravelInsurance)
years= TravelInsurance$Age
EMPTY=as.factor(TravelInsurance$Employment.Type)
Grad=as.factor(TravelInsurance$GraduateOrNot)
Income=TravelInsurance$AnnualIncome
Fam=TravelInsurance$FamilyMembers
CD=as.factor(TravelInsurance$ChronicDiseases)
FF=as.factor(TravelInsurance$FrequentFlyer)
logreg = glm(TravIns~ EMPTY+years+Grad+Income+Fam+CD+FF,family = binomial)
Too long for a comment.
Couple of things here:
You divide your dataset into train and test but then build the
model using the full dataset??
Passing vectors is not a good way to use glm(...), or any of the R modeling functions. Better to pass the data frame and reference the columns in the formula.
So, with your dataset,
logreg <- glm(TravIns~ EMPTY+years+Grad+Income+Fam+CD+FF,family = binomial, data=TrainData)
pred <- predict(logreg, newdata=TestData, type='response')
As this is a logistic regression, the responses are probabilities (that someone buys travel insurance?). There are several ways to assess goodness-of-fit. One visualization uses receiver operating characteristic (ROC) curves.
library(pROC)
roc(TestData$TravIns, pred, plot=TRUE)
The area under the roc curve (the "auc") is a measure of goodness of fit; 1.0 is prefect, 0.5 is no better than random chance. See the docs: ?roc and ?auc
Related
My prof decided that our first experience with coding was going to be trying to fit the function z(t) = A(1-e^(-t/T)) into a given data-set from class using R. I'm completely lost. I keep using lm and nls functions, without quite knowing how they work. So far, I have the data graphed but I have no clue how to get any sort of line more complicated than
mod3<-lm(y~I(x^1/5))
pre3<-predict(mod3)
lines(pre3)
to sum up: how do I find the A and T parameters? Do I use nls for the formula? Anything helps. I'll include a picture of the graph and the data. Please ignore the random lines on the plot. graph depicting my dataset dataset I have to use
One could attempt transform your expression into a linear relationship, but sometimes it is easier to just let the computer do the work. As mention in the comments, R has the nls function to perform the nonlinear regression.
Here is an example using some dummy data. The supply the nls function with your equation, the data frame containing the data and supply it with the initial estimates of the parameters.
See comments for additional details.
#create dummy data
A= 0.8
T1 = 13
t <- seq(2, 50, 3)
z <- A*(1-exp(-t/T1))
z<- z +rnorm(length(z), 0, 0.005) #add noise
#starting data frame
df <-data.frame(t, z)
#solve non-linear model
model <- nls(z ~ A*(1-exp(-t/Tc)), data=df, start = list(A=1, Tc=1))
print(summary(model))
#predict
pred_y <-predict(model, data.frame(t))
#plot
plot(x=t, y=z)
lines(y=pred_y, x= t, col="blue")
While using the predict function in R to get the predictions from a Random Forest model, I misspecified the training data as newdata as follows:
RF1pred <- predict(RF1, newdata=TrainS1, type = "class")
Used like this, I get extremely high accuracy and AUC, which I am sure is not right, but I couldn't find a good explanation for it. This thread is the closest I got, but I can's say I fully understand the explanation there.
If someone could elaborate, I will be grateful.
Thank you!
EDIT: Important to note: I get sensible accuracy and AUC if I run the prediction without specifying a dataset altogether, like so:
RF1pred <- predict(RF1, type = "class")
If a new dataset is not explicitly specified, isn't the training data used for prediction. Hence, shouldn't I get the same results from both lines of code?
EDIT2: Here is a sample code with random data that illustrates the point. When predicting without specifying newdata, the AUC is 0.4893. When newdata=train is explicitly specified, the AUC is 0.7125.
# Generate sample data
set.seed(15)
train <- data.frame(x1=sample(0:1, 100, replace=T), x2=rpois(100,10), y=sample(0:1, 100, replace=T))
# Build random forest
library(randomForest)
model <- randomForest(x1 ~ x2, data=train)
pred1 <- predict(model)
pred2 <- predict(model, newdata = train)
# Calculate AUC
library(ROCR)
ROCRpred1 <- prediction(pred1, train$x1)
AUC <- as.numeric(performance(ROCRpred1, "auc")#y.values)
AUC # 0.4893
ROCRpred2 <- prediction(pred2, train$x1)
AUC <- as.numeric(performance(ROCRpred2, "auc")#y.values)
AUC # 0.7125
If you look at the documentation for predict.randomForest you will see that if you do not supply a new data set you will get the out-of-bag (OOB) performance of the model. Since the OOB performance is theoretically related to the performance of your model on a different data set, the results will be much more realistic (although still not a substitute for a real, independently collected, validation set).
Please give me a simple example. I am in worry! I have tried the errorest function and do it as the example as it give for 10-fold cv of LDA. But when I used my own data, it just said the predict is not numeric. I don't know why! Thank you!
The R code is like this. I want to do the binary LDA so I generate the data:
library(MASS)
n=500
#generate x1 and x2.
Sigma=matrix(c(2,0,0,1),nrow=2,ncol=2)
#Logistic model with parameter{1,4,-2}
beta.star=c(1,4,-2)
Xtilde=mvrnorm(n=n,mu=c(0.5,2),Sigma=Sigma)
X=cbind(1,Xtilde)
z=X%*%beta.star
#pass througn an inv-logit function
pr=exp(z)/(1+exp(z))
#Simulate binary response
# The "probability of respoonse is a vector"
y=rbinom(n,1,pr)
Then I use the LDA to get the model:
library(MASS)
df.cv=data.frame(V1=Xtilde[,1],V2=Xtilde[,2])
exper1<-lda(y~V1+V2,data=df.d)
plda<-predict(exper1,newdata=df.cv)
Finally I want to use the CV with th original data and see the error. I do this which is wrong:
mypredict.lda <- function(object, newdata)
predict(object, newdata = newdata)$class
errorest(y ~ ., data=data.frame(da), model=lda,estimator ="cv", predict= as.numeric(mypredict.lda))
What should I do to get the error with CV?
So we start with all your previous code setting up fake data
library(MASS)
n=500
#generate x1 and x2.
Sigma=matrix(c(2,0,0,1),nrow=2,ncol=2)
#Logistic model with parameter{1,4,-2}
beta.star=c(1,4,-2)
Xtilde=mvrnorm(n=n,mu=c(0.5,2),Sigma=Sigma)
X=cbind(1,Xtilde)
z=X%*%beta.star
#pass througn an inv-logit function
pr=exp(z)/(1+exp(z))
#Simulate binary response
y=rbinom(n,1,pr)
#Now we do the LDA
df.cv=data.frame(V1=Xtilde[,1],V2=Xtilde[,2])
Below, we divide the data into two parts; a training set and a test set. If you want to do a 10 fold cross validation, you would use 0.9 instead of 0.8 (0.8 corresponds to 80% train, 20% test, which is five-fold cross validation)
library(ROCR)
inds=sample(1:nrow(df.cv),0.8*nrow(df.cv))
df.train=df.cv[inds,]
df.test=df.cv[-inds,]
train.model = lda(y[inds] ~ V1+V2, data=df.train)
From the trained model, we predict on the test set. Below, I determine the predicted values, and then assess the accuracy of the predictions. Here, I use a ROC curve, but you can use whatever metric you'd like, I guess. I didn't understand what you meant by error.
preds=as.numeric(predict(train.model, df.test)$class)
actual=y[-inds]
aucCurve=performance(prediction(preds,actual), "tpr", "fpr")
plot(aucCurve)
The area under this ROC curve is a measure of predictive accuracy. Values closer to 1 mean you have good predictive capability.
auc=performance(prediction(preds,actual), "auc")
auc#y.values
Hopefully this helped, and isn't horribly incorrect. Other folks please chime in with corrections or clarifications.
I spilt the data set into train and test as following:
splitdata<-split(sb[1:nrow(sb),], sample(rep(1:2, as.integer(nrow(sb)/2))))
test<-splitdata[[1]]
train<-rbind(splitdata[[2]])
sb is the name of original data set, so it is 50/50 train and test.
Then I fitted a glm using the training set.
fitglm<- glm(num_claims~year+vt+va+public+pri_bil+persist+penalty_pts+num_veh+num_drivers+married+gender+driver_age+credit+col_ded+car_den, family=poisson, train)
now I want to predict using this glm, say the next 10 observations.
I have trouble to specify the newdata in predict(),
I tried:
pred<-predict(fitglm,newdata=data.frame(train),type="response", se.fit=T)
this will give a number of predictions that is equal to the number of samples in training set.
and finally, how to plot these predictions with confidence intervals?
Thank you for the help
If you are asking how to construct predictions on the next 10 in the test set then:
pred10<-predict(fitglm,newdata=data.frame(test)[1:10, ], type="response", se.fit=T)
Edit 9 years later:
#carsten's comment is correct regarding how to construct a confidence interval. If one has a non-linear link function for a glm-object, fitglm then this is a reasonably general method to recover the inverse of the link function and construct a two-sided 95% CI on the response scale:
pred.fit <- predict(fitglm, newdata=newdata, se.fit=TRUE)
pred.fit <- predict(fitglm, newdata=newdata, se.fit=TRUE)
CI.pred.upper <- family(fitglm)$linkinv( # that information is in the model
pred.fit+ 1.96*pred.fit$se.fit )
CI.pred.lower <- family(fitglm)$linkinv( # that information is in the model
pred.fit$fit - 1.96*pred.fit$se.fit )
I'm trying to plot ROC curve of a random forest classification. Plotting works, but I think I'm plotting the wrong data since the resulting plot only has one point (the accuracy).
This is the code I use:
set.seed(55)
data.controls <- cforest_unbiased(ntree=100, mtry=3)
data.rf <- cforest(type ~ ., data = dataset ,controls=data.controls)
pred <- predict(data.rf, type="response")
preds <- prediction(as.numeric(pred), dataset$type)
perf <- performance(preds,"tpr","fpr")
performance(preds,"auc")#y.values
confusionMatrix(pred, dataset$type)
plot(perf,col='red',lwd=3)
abline(a=0,b=1,lwd=2,lty=2,col="gray")
To plot a receiver operating curve you need to hand over continuous output of the classifier, e.g. posterior probabilities. That is, you need to predict (data.rf, newdata, type = "prob").
predicting with type = "response" already gives you the "hardened" factor as output. Thus, your working point is implicitly fixed already. With respect to that, your plot is correct.
side note: in bag prediction of random forests will be highly overoptimistic!