Applying logistic regression in titanic dataset - r

I have the famous titanic data set from Kaggle's website. I want to predict the survival of the passengers using logistic regression. I am using the glm() function in R. I first divide my data frame(total rows = 891) into two data frames i.e. train(from row 1 to 800) and test(from row 801 to 891).
The code is as follows
`
>> data <- read.csv("train.csv", stringsAsFactors = FALSE)
>> names(data)
`[1] "PassengerId" "Survived" "Pclass" "Name" "Sex" "Age" "SibSp"
[8] "Parch" "Ticket" "Fare" "Cabin" "Embarked" `
#Replacing NA values in Age column with mean value of non NA values of Age.
>> data$Age[is.na(data$Age)] <- mean(data$Age, na.rm = TRUE)
#Converting sex into binary values. 1 for males and 0 for females.
>> sexcode <- ifelse(data$Sex == "male",1,0)
#dividing data into train and test data frames
>> train <- data[1:800,]
>> test <- data[801:891,]
#setting up the model using glm()
>> model <- glm(Survived~sexcode[1:800]+Age+Pclass+Fare,family=binomial(link='logit'),data=train, control = list(maxit = 50))
#creating a data frame
>> newtest <- data.frame(sexcode[801:891],test$Age,test$Pclass,test$Fare)
>> prediction <- predict(model,newdata = newtest,type='response')
`
And as I run the last line of code
prediction <- predict(model,newdata = newtest,type='response')
I get the following error
Error in eval(expr, envir, enclos) : object 'Age' not found
Can anyone please explain what the problem is. I have checked the newteset variable and there doesn't seem to be any problem in that.
Here is the link to titanic data set https://www.kaggle.com/c/titanic/download/train.csv

First, you should add the sexcode directly to the dataframe:
data$sexcode <- ifelse(data$Sex == "male",1,0)
Then, as I commented, you have a problem in your columns names in the newtest dataframe because you create it manually. You can use directly the test dataframe.
So here is your full working code:
data <- read.csv("train.csv", stringsAsFactors = FALSE)
data$Age[is.na(data$Age)] <- mean(data$Age, na.rm = TRUE)
data$sexcode <- ifelse(data$Sex == "male",1,0)
train <- data[1:800,]
test <- data[801:891,]
model <- glm(Survived~sexcode+Age+Pclass+Fare,family=binomial(link='logit'),data=train, control = list(maxit = 50))
prediction <- predict(model,newdata = test,type='response')

Related

Multiple imputation and mlogit for a multinomial regression

I am trying to run a multinomial regression with imputed data. I can do this with the nnet package, however I want to use mlogit. Using the mlogit package I keep getting the following error "Error in 1:nrow(data) : argument of length 0".
So making the data
library(mlogit)
library(nnet)
library(tidyverse)
library(mice)
df <- data.frame(vax = sample(1:6, 500, replace = T),
age = runif(500, 12, 18),
var1 = sample(1:2, 500, replace = T),
var2 = sample(1:5, 500, replace = T))
# Create missing data using the mice package:
df2 <- ampute(df, prop = 0.15)
df3 <- df2$amp
df3$vax <- as.factor(df3$vax)
df3$var1 <- as.factor(df3$var1)
df3$var2 <- as.factor(df3$var2)
# Inpute missing data:
df4 <- mice(df3, m = 5, print = T, seed = 123)
It works using nnet's multinom:
multinomtest <- with(df4, multinom(vax ~ age + var1 + var2, data = df, model = T))
summary(pool(multinomtest))
But throws up an error when I try to reshape the data into mlogit format
test <- with(df4, dfidx(data = df4, choice = "vax", shape = "wide"))
Does anyone have any idea how I can get the imputed data into mlogit format, or even whether mlogit has compatibility with mice or any other imputation package?
Answer
You are using with.mids incorrectly, and thus both lines of code are wrong; the multinom line just doesn't give an error. If you want to apply multiple functions to the imputed datasets, you're better off using something like lapply:
analyses <- lapply(seq_len(df4$m), function(i) {
data.i <- complete(df4, i)
data.idx <- dfidx(data = data.i, choice = "vax", shape = "wide")
mlogit(vax ~ 1 | age + var1 + var2,
data = data.idx,
reflevel = "1",
nests = list(type1 = c("1", "2"), type2 = c("3","4"), type3 = c("5","6")))
})
test <- list(call = "", call1 = df4$call, nmis = df4$nmis, analyses = analyses)
oldClass(test) <- c("mira", "matrix")
summary(pool(test))
How with.mids works
When you apply with to a mids object (AKA the output of mice::mice), then you are actually calling with.mids.
If you use getAnywhere(with.mids) (or just type mice:::with.mids), you'll find that it does a couple of things:
It loops over all imputed datasets.
It uses complete to get one dataset.
It runs the expression with the dataset as the environment.
The third step is the problem. For functions that use formulas (like lm, glm and multinom), you can use that formula within a given environment. If the variables are not in the current environment (but rather in e.g. a data frame), you can specify a new environment by setting the data variable.
The problems
This is where both your problems derive from:
In your multinom call, you set the data variable to be df. Hence, you are actually running your multinom on the original df, NOT the imputed dataset!
In your dfidx call, you are again filling in data directly. This is also wrong. However, leaving it empty also gives an error. This is because with.mids doesn't fill in the data argument, but only the environment. That isn't sufficient for you.
Fixing multinom
The solution for your multinom line is simple: just don't specify data:
multinomtest <- with(df4, multinom(vax ~ age + var1 + var2, model = T))
summary(pool(multinomtest))
As you will see, this will yield very different results! But it is important to realise that this is what you are trying to obtain.
Fixing dfidx (and mlogit)
We cannot do this with with.mids, since it uses the imputed dataset as the environment, but you want to use the modified dataset (after dfidx) as your environment. So, we have to write our own code. You could just do this with any looping function, e.g. lapply:
analyses <- lapply(seq_len(df4$m), function(i) {
data.i <- complete(df4, i)
data.idx <- dfidx(data = data.i, choice = "vax", shape = "wide")
mlogit(vax ~ 1 | age + var1 + var2, data = data.idx, reflevel = "1", nests = list(type1 = c("1", "2"), type2 = c("3","4"), type3 = c("5","6")))
})
From there, all we have to do is make something that looks like a mira object, so that we can still use pool:
test <- list(call = "", call1 = df4$call, nmis = df4$nmis, analyses = analyses)
oldClass(test) <- c("mira", "matrix")
summary(pool(test))
Offering this as a way forward to circumvent the error with dfidx():
df5 <- df4$imp %>%
# work with a list, where each top-element is a different imputation run (imp_n)
map(~as.list(.x)) %>%
transpose %>%
# for each run, impute and return the full (imputed) data set
map(function(imp_n.x) {
df_out <- df4$data
df_out$vax[is.na(df_out$vax)] <- imp_n.x$vax
df_out$age[is.na(df_out$age)] <- imp_n.x$age
df_out$var1[is.na(df_out$var1)] <- imp_n.x$var1
df_out$var2[is.na(df_out$var2)] <- imp_n.x$var2
return(df_out)
}) %>%
# No errors with dfidx() now
map(function(imp_n.x) {
dfidx(data = imp_n.x, choice = "vax", shape = "wide")
})
However, I'm not too familiar with mlogit(), so can't help beyond this.
Update 8/2/21
As #slamballais mentioned in their answer, the issue is with dataset you refer to when fitting the model. I assume that mldata (from your code in the comments section) is a data.frame? This is probably why you are seeing the same coefficients - you are not referring to the imputed data sets (which I've identified as imp_n.x in the functions). The function purrr::map() is very similar to lapply(), where you apply a function to elements of a list. So to get the code working properly, you would want to change mldata to imp_n.x:
# To fit mlogit() for each imputed data set
df5 %>%
map(function(imp_n.x) {
# form as specified in the comments
mlogit(vax ~ 1 | age + var1 + var2,
data = imp_n.x,
reflevel = "1",
nests = list(type1 = c('1', '2'),
type2 = c('3','4'),
type3 = c('5','6')))
})

I am trying to run XGBoost in R but am facing some issues

I have a dataset of 25 variables and 248 rows.
There are 8-factor variables and the rest are integers and numbers.
I am trying to run XGBoost.
I have done the following code: -
# Partition Data
set.seed(1234)
ind <- sample(2, nrow(mission), replace = T, prob = c(0.7,0.3))
train <- mission[ind == 1,]
test <- mission[ind == 2,]
# Create matrix - One-Hot Encoding for Factor variables
trainm <- sparse.model.matrix(GRL ~ .-1, data = train)
head(trainm)
train_label <- train[,"GRL"]
train_matrix <- xgb.DMatrix(data = as.matrix(trainm), label = train_label)
testm <- sparse.model.matrix(GRL~.-1, data = test)
test_label <- test[,"GRL"]
test_matrix <- xgb.DMatrix(data = as.matrix(testm),label = test_label)
The response variable here is "GRL" and I am running the test_label <- test[,"GRL"]
The above code is getting executed but when I am trying to use it in xgb.DMatrix, I am encountering the following error:
Error in setinfo.xgb.DMatrix(dmat, names(p), p[[1]]) :
The length of labels must equal to the number of rows in the input data
I have partitioned the data into 70:30.
test[,"GRL"] returns a data.frame, and XGBoost needs the label to be a vector.
Just use teste$GRL or test[["GRL"]] instead. You also need to do the same for the training dataset

SVM with rolling window

I have a code which predict the change in the sign of future returns.
library(quantmod)
library(PerformanceAnalytics)
library(forecast)
library(e1071)
library(caret)
library(kernlab)
library(dplyr)
library(roll)
# get data yahoo finance
getSymbols("^GSPC", from = "1990-01-01", to = "2017-12-01")
# take logreturns
rnull <- CalculateReturns(prices = GSPC$GSPC.Adjusted ,method ="log")
# lags 1, 2, 3, 4, 5 as features
feat <- merge(na.trim(lag(rnull,1)),na.trim(lag(rnull,2)),na.trim(lag(rnull,3)),na.trim(lag(rnull,4)),na.trim(lag(rnull,5)),all=FALSE)
# create dataset. 6th column is actural. Previous is lagged
dataset <- merge(feat,rnull,all=FALSE)
# set columns' names
colnames(dataset) = c("lag.1", "lag.2", "lag.3","lag.4","lag.5","TARGET")
# get signs and make a data.frame
x <- sign(dataset)%>%as.data.frame
# exclude 0 sign and assume that these values are positive
x[x==0] <- 1
# for svm purposes we need to set dependent variable as factor and make levels to interpretation
x$TARGET <- as.factor(as.character(x$TARGET))
levels(x$TARGET) <- list(positive = "1", negative = "-1")
# divide sample to training and test subsamples
trainindex <- x[1:5792,]
testindex <- x[5792:7030,]
# run svm
svmFit <- ksvm(TARGET~.,data=trainindex,type="C-svc",kernel= "rbfdot")
# prediction
predsvm <- predict(svmFit, newdata=testindex)
# results
confusionMatrix(predsvm, testindex$TARGET)
The next thing I am going to do is add a rolling window (1 step forecast) to my model.
However the basic methods as rollapply does not work with dataframe. Commom methods of one step forecast for time-series are also not valid for data.frame used in e1071 package.
I wrote the following function:
svm_next_day_prediction <- function(x){
svmFit <- svm(TARGET~., data=x)
prediction <- predict(object = svmFit, newdata = tail(x,1) )
return(prediction)
}
apl = rollapplyr(data = x, width = 180, FUN = svm_next_day_prediction, by.column = TRUE)
but recieved a error because rollapply does not understand data.frames:
Error in terms.formula(formula, data = data) : '.' in formula and
no 'data' argument
Can you please explain how to apply rolling window for svm classification model with dataframe?
A few points
rollapply works with data frames that can be coerced to a matrix so be sure that your input is entirely numeric -- not a mix of numeric and factor. For example, this works using the built-in data frame BOD which has two numeric columns. Note that x passed to pred is a matrix here.
pred <- function(x) predict(svm(demand ~ Time, x))
rollapplyr(BOD, 3, FUN = pred, by.column = FALSE)
giving
## 1 2 3
## [1,] 8.868888 10.86889 17.25474
## [2,] 11.661666 17.24870 16.00000
## [3,] 18.328435 16.18583 15.78583
## [4,] 16.230474 15.83247 19.56886
I can't reproduce the error you get. I get a different error.
the code in the question has by.column = TRUE (which is the default anyways)
but that has the result of passing only a single vector to the function which
is not what you want. You want by.column = FALSE.
Try this:
x0 <- data.matrix(x)
rollapplyr(data = x0, width = 180, FUN = svm_next_day_prediction, by.column = FALSE)
you can create a list with the individual data frames and then apply your function. I rename x to df to avoid confusion:
df=x
rowwindow=179
dfList=lapply(1:(nrow(df)-rowwindow),function(x) df[x:(rowwindow+x),])
result=sapply(dfList,svm_next_day_prediction)

R glm generating different p-values for same categorical variables of different type

I am generating a model fit using glm. My data has a mix of integer variables and categorical variables. Categorical variables are in the form of codes and hence integer type in the data. Initially when I tried to generate the model I passed the categorical variables in integer format as it is and got the model. I was looking at the p-values to check the once that are significant and noticed few variables were significant which I was not expecting.
This is when realized that may be the categorical variables in integer form are creating some issue. So like code 3 might get a higher importance than code 1 (not sure on this and it would be great if someone can confirm this). On doing some research I found that we can convert the categorical integer variable to factor. I did the same and re-generated the model.
I also saw some posts where it was mentioned to convert to binary, so I did that we well. So now I have 3 results -
r1 >> with categorical integer variables
r2 >> with categorical factor variables
r3 >> with categorical variable converted to binary
I feel that output 1 with categorical integer variables is incorrect (Please confirm). But between output 2 and 3 I am confused which one to consider as
p-values are different,
which one would be more accurate
can I related the p-values of output 3 with output 2?
How does glm handle such variables
Hope glm inside a for loop is not an issue
My database is big, can we do glm using data.table?
I am pasting below my code with some sample data to be reproduced
library("plyr")
library("foreign")
library("data.table")
#####Generating sample data
set.seed(1200)
id <- 1:100
bill <- sample(1:3,100,replace = T)
nos <- sample(1:40,100,replace = T)
stru <- sample(1:4,100,replace = T)
type <- sample(1:7,100,replace = T)
value <- sample(100:1000,100,replace = T)
df1 <- data.frame(id,bill,nos,stru,type,value)
var1 <- c("bill","nos","stru")
options(scipen = 999)
r1 <- data.frame()
for(type1 in unique(df1$type)){
for(var in var1){
# dynamically generate formula
fmla <- as.formula(paste0("value ~ ", var))
# fit glm model
fit <- glm(fmla, data=df1[df1$type == type1,],family='quasipoisson')
p.value <- coef(summary(fit))[8]
cfit <- coef(summary(fit))
# create data frame
df2 <- data.frame(var = var, type = type1, basket="value",p.value = cfit[8],stringsAsFactors = F)
r1 <- rbind(r1, df2)
}
}
##### converting the categorical numeric variables to factor variables
df1$bill_f <- as.factor(bill)
df1$stru_f <- as.factor(stru)
var1 <- c("bill_f","nos","stru_f")
r2 <- data.frame()
for(type1 in unique(df1$type)){
for(var in var1){
# dynamically generate formula
fmla <- as.formula(paste0("value ~ ", var))
# fit glm model
fit <- glm(fmla, data=df1[df1$type == type1,],family='quasipoisson')
p.value <- coef(summary(fit))[8]
cfit <- coef(summary(fit))
# create data frame
df2 <- data.frame(var = var, type = type1, basket="value",p.value = cfit[8],stringsAsFactors = F)
r2 <- rbind(r2, df2)
}
}
#####converting the categorical numeric variables to binary format (1/0)
df1$bill_1 <- ifelse(df1$bill == 1,1,0)
df1$bill_2 <- ifelse(df1$bill == 2,1,0)
df1$bill_3 <- ifelse(df1$bill == 3,1,0)
df1$stru_1 <- ifelse(df1$stru == 1,1,0)
df1$stru_2 <- ifelse(df1$stru == 2,1,0)
df1$stru_3 <- ifelse(df1$stru == 3,1,0)
df1$stru_4 <- ifelse(df1$stru == 4,1,0)
var1 <- c("bill_1","bill_2","bill_3","nos","stru_1","stru_2","stru_3")
r3 <- data.frame()
for(type1 in unique(df1$type)){
for(var in var1){
# dynamically generate formula
fmla <- as.formula(paste0("value ~ ", var))
# fit glm model
fit <- glm(fmla, data=df1[df1$type == type1,],family='quasipoisson')
p.value <- coef(summary(fit))[8]
cfit <- coef(summary(fit))
# create data frame
df2 <- data.frame(var = var, type = type1, basket="value",p.value = cfit[8],stringsAsFactors = F)
r3 <- rbind(r3, df2)
}
}
Your feeling is mostly correct. For a GLM you should make the distinction between continious variables and discrete (categorical) variables.
Binary variables are variables which contain only 2 levels, for example 0 and 1.
Since you only have variables with 2+ levels, you should use the factor() function.

SVM a dataframe based on the last column

I'm trying since hours and hours to svm a dataframe based on the last class name.
I have this data frame
#FIll the data frame
df = read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data",
sep=",",
col.names=c("buying", "maint", "doors", "persons", "lug_boot", "safety", ""),
fill=TRUE,
strip.white=TRUE)
lastColName <- colnames(df)[ncol(df)]
...
model <- svm(lastColName~.,
data = df,
kernel="polynomial",
degree = degree,
type = "C-classification",
cost = cost)
I'm getting either NULL or Error in model.frame.default(formula = str(lastColName) ~ ., data = df1, : invalid type (NULL) for variable 'str(lastColName)'. I understand that NULL arrives when the column hasn't a name. I don't understand the other error since it's the last column name..
Any idea?
You have to use as.formula when you are trying to use dynamic variable in the formula. For details see ?as.formula
The following code works fine:
library(e1071)
df_1 = read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data",
sep=",",
col.names=c("buying", "maint", "doors", "persons", "lug_boot", "safety", ""),
fill=TRUE,
strip.white=TRUE)
lastColName <- colnames(df_1)[ncol(df_1)]
model <- svm(as.formula(paste(lastColName, "~ .", sep = " ")),
data = df_1,
kernel="polynomial",
degree = 3,
type = "C-classification",
cost = 1)
# to predict on the data remove the last column
prediction <- predict(model, df_1[,-ncol(df_1)])
# The output
table(prediction)
# The output is:
prediction
acc good unacc vgood
0 0 1728 0
# Since this is a highly unbalanced classification the model is not doing a very good job

Resources