issue with calculating precision and recall for SVM model in R - r

I use census-data to build logistic regression model and SVM model, first, I convert <=50K to 0, >50K to 1 to make the data binomial. I try to calculate precision and recall for both models and compare which model perform better? But table(test$salary,pred1 >0.5) for SVM model in r says < table of extent 2 x 0 >
Warning message:
In Ops.factor(pred1, 0.5) : ‘>’ not meaningful for factors
but similar code is working for logistic regression model. I am new to R software, I hope that I can get help from here.Thanks a lot. Any help is welcome.
I hope the question is clear enough.
#setwd("C:/Users/)
Censusdata <- read.csv(file="census-data.csv", header=TRUE, sep=",")
library("dplyr", lib.loc="~/R/win-library/3.4")
df <- Censusdata[,]
# convert <=50K to 0, >50K to 1
df$salary <- as.numeric(factor(df$salary))-1
head(df,10)
library(lattice)
library(ggplot2)
library(caret)
data <- Censusdata
indexes <- sample(1:nrow(data),size=0.7*nrow(data))
test <- data[indexes,]
train <- data[-indexes,]
#logistic regression model fit
model <- glm(salary ~ education.num + hours.per.week,family = binomial,data = test)
pred <- predict(model,data=train)
summary(model)
# calculate precision and recall
table(test$salary,pre >0.5)
# for SVM model
model1 <- svm(salary ~ education.num + hours.per.week,family = binomial, data=test)
pred1 <- predict(model1,data=train)
table(test$salary,pred1 >0.5)

Related

Confusion matrix for multinomial logistic regression & ordered logit

I would like to create confusion matrices for a multinomial logistic regression as well as a proportional odds model but I am stuck with the implementation in R. My attempt below does not seem to give the desired output.
This is my code so far:
CH <- read.table("http://data.princeton.edu/wws509/datasets/copen.dat", header=TRUE)
CH$housing <- factor(CH$housing)
CH$influence <- factor(CH$influence)
CH$satisfaction <- factor(CH$satisfaction)
CH$contact <- factor(CH$contact)
CH$satisfaction <- factor(CH$satisfaction,levels=c("low","medium","high"))
CH$housing <- factor(CH$housing,levels=c("tower","apartments","atrium","terraced"))
CH$influence <- factor(CH$influence,levels=c("low","medium","high"))
CH$contact <- relevel(CH$contact,ref=2)
model <- multinom(satisfaction ~ housing + influence + contact, weights=n, data=CH)
summary(model)
preds <- predict(model)
table(preds,CH$satisfaction)
omodel <- polr(satisfaction ~ housing + influence + contact, weights=n, data=CH, Hess=TRUE)
preds2 <- predict(omodel)
table(preds2,CH$satisfaction)
I would really appreciate some advice on how to correctly produce confusion matrices for my 2 models!
You can refer -
Predict() - Maybe I'm not understanding it
Here in predict() you need to pass unseen data for prediction.

Cross Validation K-Fold with Forecast Package in R

I have created a model in R using the forecast package.
My source of learning this is from here:
https://robjhyndman.com/hyndsight/dailydata/
I am using the last section which includes fourier series as such:
y <- ts(x, frequency=7)
z <- fourier(ts(x, frequency=365.25), K=5)
zf <- fourier(ts(x, frequency=365.25), K=5, h=100)
fit <- auto.arima(y, xreg=cbind(z,holiday), seasonal=FALSE)
fc <- forecast(fit, xreg=cbind(zf,holidayf), h=100)
After I create this model, is there a way I can do a cross validation k-fold test to determine the error and adjusted error?
I know how to do it with a generalized linear model as such:
library(boot)
lm1 <- glm(ValuePerSqFt ~ Units + SqFt + Boro, data = housing)
lm1cv <- cv.glm(housing, lm1, K=5)
lm1cv$delta
[1] 1870.31 1869.352
This shows the error and adjusted error.
Is there a function in the forecast package that can do this and it will help me compare the accuracy of this model with the glm model?

calculate precision and recall for SVM model in R

I use census-data to build logistic regression model and SVM model, first, I convert <=50K to 0, >50K to 1 to make the data binomial. I try to calculate precision and recall for both models and compare which model perform better. But table(test$salary,pred1 >0.5) for SVM model only gives fault values and no true values ( FALSE
0 26
1 8) . Does anyone know what the the problem is?
I am new to R software, I hope that I can get help from here.Thanks a lot. Any help is welcome. I hope the question is clear enough.
#setwd("C:/Users/)
Censusdata <- read.csv(file="census-data.csv", header=TRUE, sep=",")
library("dplyr", lib.loc="~/R/win-library/3.4")
# convert <=50K to 0, >50K to 1
data = Censusdata
data$salary<-as.numeric(factor(data$salary))-1
library(lattice)
library(ggplot2)
library(caret)
data <- Censusdata
indexes <- sample(1:nrow(data),size=0.7*nrow(data))
test <- data[indexes,]
train <- data[-indexes,]
#logistic regression model fit
model <- glm(salary ~ education.num + hours.per.week,family = binomial,data = test)
pred <- predict(model,data=train)
summary(model)
# calculate precision and recall
table(test$salary,pre >0.5)
# I get
FALSE TRUE
0 26 0
1 6 2
# for SVM model
model1 <- svm(salary ~ education.num + hours.per.week,family = binomial, data=test)
pred1 <- predict(model1,data=train)
table(test$salary,pred1 >0.5)
# I get the following
FALSE
0 26
1 8

how to carry out logistic regression and random forest to predict churn rate

I am using following dataset: http://www.sgi.com/tech/mlc/db/churn.data
And the variable description: http://www.sgi.com/tech/mlc/db/churn.names
Ii did preliminary coding but I am really not able to make out how to perform a logistic regression and Random Forest techniques to this data to predict the importance of variables and churn rate.
nm <- read.csv("http://www.sgi.com/tech/mlc/db/churn.names",
skip=4, colClasses=c("character", "NULL"), header=FALSE, sep=":")[[1]]
nm
dat <- read.csv("http://www.sgi.com/tech/mlc/db/churn.data", header=FALSE, col.names=c(nm, "Churn"))
dat
View(dat)
View(dat)
library(survival)
s <- with(dat, Surv(account.length, as.numeric(Churn)))
model <- coxph(s ~ total.day.charge + number.customer.service.calls, data=dat[, -4])
summary(model)
plot(survfit(model))
Also I am not able to figure out how to use the model that I built in my further analysis.
please help me.
Do you have any example code of what you're trying to do? What further analysis do you have planned? If you're just trying to run a logistic regression on the data, the general format is:
lr <- glm(Churn ~ international.plan + voice.mail.plan + number.vmail.messages
+ account.length, family = "binomial", data = dat)
Try help(glm) and help(randomForest)

Example of Time Series Prediction using Neural Networks in R

Anyone's got a quick short educational example how to use Neural Networks (nnet in R) for the purpose of prediction?
Here is an example, in R, of a time series
T = seq(0,20,length=200)
Y = 1 + 3*cos(4*T+2) +.2*T^2 + rnorm(200)
plot(T,Y,type="l")
Many thanks
David
I think you can use the caret package and specially the train function
This function sets up a grid of tuning parameters for a number
of classification and regression routines.
require(quantmod)
require(nnet)
require(caret)
T = seq(0,20,length=200)
y = 1 + 3*cos(4*T+2) +.2*T^2 + rnorm(200)
dat <- data.frame( y, x1=Lag(y,1), x2=Lag(y,2))
names(dat) <- c('y','x1','x2')
dat <- dat[c(3:200),] #delete first 2 observations
#Fit model
model <- train(y ~ x1+x2 ,
dat,
method='nnet',
linout=TRUE,
trace = FALSE)
ps <- predict(model, dat)
#Examine results
plot(T,Y,type="l",col = 2)
lines(T[-c(1:2)],ps, col=3)
legend(5, 70, c("y", "pred"), cex=1.5, fill=2:3)
The solution proposed by #agstudy is useful, but in-sample fits are not a reliable guide to out-of-sample forecasting accuracy. The gold standard in forecasting accuracy measurement is to use a holdout sample. Remove the last 5 or 10 or 20 observations (depending to the length of the time series) from the training sample, fit your models to the rest of the data, use the fitted models to forecast the holdout sample and simply compare accuracies on the holdout, using Mean Absolute Deviations (MAD) or weighted Mean Absolute Percentage Errors (wMAPEs).
So to do this you can change the code above in this way:
require(quantmod)
require(nnet)
require(caret)
t = seq(0,20,length=200)
y = 1 + 3*cos(4*t+2) +.2*t^2 + rnorm(200)
dat <- data.frame( y, x1=Lag(y,1), x2=Lag(y,2))
names(dat) <- c('y','x1','x2')
train_set <- dat[c(3:185),]
test_set <- dat[c(186:200),]
#Fit model
model <- train(y ~ x1+x2 ,
train_set,
method='nnet',
linout=TRUE,
trace = FALSE)
ps <- predict(model, test_set)
#Examine results
plot(T,Y,type="l",col = 2)
lines(T[c(186:200)],ps, col=3)
legend(5, 70, c("y", "pred"), cex=1.5, fill=2:3)
This last two lines output the wMAPE of the forecasts from the model
sum(abs(ps-test_set["y"]))/sum(test_set)

Resources