Logistic Regression in R - using package "logistf" - r

I used the package "logistf" to perform a logistic regression in R.
df <- read.csv("data.csv",header=T,row.names=1)
df <- as.data.frame(sapply(df, as.numeric))
df_split <- initial_split(df, prop = 0.9)
df_train <-
training(df_split) %>%
verify(expr = nrow(.) == 14355L)
df_test <-
testing(df_split) %>%
verify(expr = nrow(.) == 1596L)
x_train <- as.matrix(df_train[,1:259]) # Removes class
y_train <- as.double(as.matrix(df_train[, 260]))
mle <- logistf(y_train ~ x_train, firth=TRUE, family = binomial)
When I run the above code, I get the following error:
Error in logistf.fit(x = x, y = y, weight = weight, offset = offset, firth, :
In iteration 0: Determinant of Fisher information matrix was numerically 0
How can I fix this error?

Related

How to obtain the confusion matrix below?

Loading the data set in R
data(titanic_train, package = "titanic")
titanicTib <- as_tibble(titanic_train)
imp <- impute(titanicClean, cols = list(Age = imputeMean()))
This is the task
titanicTask <- makeClassifTask(data =
imp$data, target = "Survived")
This is the learner
logReg <- makeLearner("classif.logreg",
predict.type = "prob")
logRegModel <- train(logReg, titanicTask)
Prediction
data(titanic_test, package = "titanic")
titanicNew <- as_tibble(titanic_test)
titanicNewClean_unseen <- titanicNew %>%
mutate_at(.vars = c("Sex", "Pclass"), .funs = factor) %>%
mutate(FamSize = SibSp + Parch) %>%
select(Pclass, Sex, Age, Fare, FamSize)
titanic_prediction <- predict(logRegModel,
newdata = titanicNewClean_unseen)
Trying to calculate the confusion matrix for the predictions made to see how well the model perform. I tried the below but can't seem to get the matrix.
calculateConfusionMatrix(titanic_prediction$data$response)

How to find the predicted values with Keras

I'm learning keras, and would like to see the predicted numbers that are returned. The model has a number of items returned, but none of them seem to be the predicted values.
df <- MASS::Boston
index <- sample(c(TRUE, FALSE), nrow(df), replace=TRUE, prob=c(0.7,0.3))
train_features <- Boston[index,]
test_features <- Boston[!index,]
train_labels <- Boston$medv[index]
test_labels <- Boston$medv[!index]
train_features <- scale(train_features)
train_features <- train_features[,1:ncol(train_features)]
test_features <- scale(test_features)
test_features <- test_features[,1:ncol(test_features)]
mean <- apply(train_features, 2, mean)
sd <- apply(train_features, 2, sd)
train_data <- scale(train_features, center = mean, scale = sd)
test_data <- scale(test_features, center = mean, scale = sd)
train_targets <- Boston$medv[index]
test_targets <- Boston$medv[!index]
Here is where the model is built:
build_model <- function() {
model <- keras_model_sequential() %>%
layer_dense(64, activation = "relu") %>%
layer_dense(64, activation = "relu") %>%
layer_dense(1)
model %>% compile(optimizer = "rmsprop",
loss = "mse",
metrics = "mse")
model
}
Next we set up five folds, and track all_scores:
k <- 5
fold_id <- sample(rep(1:k, length.out = nrow(train_data)))
num_epochs <- 100
all_scores <- numeric()
for (i in 1:k) {
cat("Processing fold #", i, "\n")
val_indices <- which(fold_id == i)
val_data <- train_data[val_indices, ]
val_targets <- train_targets[val_indices]
partial_train_data <- train_data[-val_indices, ]
partial_train_targets <- train_targets[-val_indices]
model <- build_model()
model %>% fit (
partial_train_data,
partial_train_targets,
epochs = num_epochs,
batch_size = 16,
verbose = 0
)
results <- model %>%
evaluate(val_data, val_targets, verbose = 0)
all_scores[[i]] <- results[['mse']]
}
keras.RMSE <- sqrt(mean(all_scores))
However, none of the variables seem to have the predicted values. A few examples:
all_scores is a set of RMSE scores (which I also want)
val_targets appears to be the wrong dimensions
model$fit does not return a value or set of values
model$predict generates predicted values, but those have already been generated, and I can't locate them.
How are the predicted values returned in a keras model?

Why do I get probabilities outside 0 and 1 with my Logistic regularized glmnet code?

library(tidyverse)
library(caret)
library(glmnet)
creditdata <- read_excel("R bestanden/creditdata.xlsx")
df <- as.data.frame(creditdata)
df <- na.omit(df)
df$married <- as.factor(df$married)
df$graduate_school <- as.factor(df$graduate_school)
df$high_school <- as.factor(df$high_school)
df$default_payment_next_month <- as.factor(df$default_payment_next_month)
df$sex <- as.factor(df$sex)
df$single <- as.factor(df$single)
df$university <- as.factor(df$university)
set.seed(123)
training.samples <- df$default_payment_next_month %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- df[training.samples, ]
test.data <- df[-training.samples, ]
x <- model.matrix(default_payment_next_month~., train.data)[,-1]
y <- ifelse(train.data$default_payment_next_month == 1, 1, 0)
cv.lasso <- cv.glmnet(x, y, alpha = 1, family = "binomial")
lasso.model <- glmnet(x, y, alpha = 1, family = "binomial",
lambda = cv.lasso$lambda.1se)
x.test <- model.matrix(default_payment_next_month ~., test.data)[,-1]
probabilities <- lasso.model %>% predict(newx = x.test)
predicted.classes <- ifelse(probabilities > 0.5, "1", "0")
observed.classes <- test.data$default_payment_next_month
mean(predicted.classes == observed.classes)
Hi guys,
I'm new in R and I've been trying to use the exact code as on this website http://www.sthda.com/english/articles/36-classification-methods-essentials/149-penalized-logistic-regression-essentials-in-r-ridge-lasso-and-elastic-net/ to perform a logistic ridge regression.
My aim is to predict if a client has credit card default or not, and we have a data set with factor variables as well as numerical variables. The problem is that most of my probabilities are negative and smaller than -1, so -2.6, -1.4 etc. Does anyone know what is going wrong here?
Thanks in advance for the help!
Just like for glm, by default the predict function for glmnet returns predictions on the scale of the link function, which aren't probabilities.
To get the predicted probabilities, add type = "response" to the predict call:
probabilities <- lasso.model %>% predict(newx = x.test, type = "response")

I'm getting Error in dat$y : $ operator is invalid for atomic vectors when trying to calculate the possible results using map() function

I am currently taking an online Data science: Machine learning course and we are asked to fit a lm 100 times and obtain the values of the mean (rmse) and sd(rmse) for data sets of different sizes n=c(100,500,1000,5000,10000).
we are asked to create a function that takes the size n and builds the dataset, then runs the loop made for fitting the 100 models, then set the seed and use a map() or sapply() function for applying our new function to the n different sizes.
The code I did is showing me "Error in dat$y : $ operator is invalid for atomic vectors" error when I run f1
This is my code:
library(MASS)
library(caret)
ff=function(n){
Sigma <- 9*matrix(c(1.0, 0.5, 0.5, 1.0), 2, 2)
dat <- MASS::mvrnorm(n, c(69, 69), Sigma)%>%data.frame() %>% setNames(c("x", "y"))
}
set.seed(1,sample.kind = "Rounding")
n=c(100,500,1000,5000,10000)
f1=map(n,function(dat){
rmse=replicate(100,{
y <- dat$y
test_index <- createDataPartition(y, times = 1, p = 0.5, list = FALSE)
train_set <- dat %>% slice(-test_index)
test_set <- dat %>% slice(test_index)
fit <- lm(y ~ x, data = train_set)
y_hat <- fit$coef[1] + fit$coef[2]*test_set$x
sqrt(mean((y_hat - test_set$y)^2))
})
structure(c(mean(rmse),sd(rmse)))
})
Thank you for your help!!
I think you should use something like :
library(caret)
library(dplyr)
n=c(100,500,1000,5000,10000)
f1= purrr::map(n,function(x){
rmse=replicate(100,{
dat <- ff(x)
y <- 1:nrow(dat)
test_index <- createDataPartition(y, times = 1, p = 0.5, list = FALSE)
train_set <- dat %>% slice(-test_index)
test_set <- dat %>% slice(test_index)
fit <- lm(y ~ x, data = train_set)
y_hat <- fit$coef[1] + fit$coef[2]*test_set$x
sqrt(mean((y_hat - test_set$y)^2))
})
c(mean(rmse),sd(rmse))
})

How to evaluate LightGBM in R using cohen's kappa?

I use XGBoost in R on a regular basis and want to start using LightGBM on the same data. My goal is to use cohen's kappa as evaluation metric. However, I am not able to properly implement LightGBM - it seems that no learning occurs. As a very simple example, I'll use the titanic dataset.
library(data.table)
library(dplyr)
library(caret)
titanic <- fread("https://raw.githubusercontent.com/pcsanwald/kaggle-titanic/master/train.csv")
titanic_complete <- titanic %>%
select(survived, pclass, sex, age, sibsp, parch, fare, embarked) %>%
mutate_if(is.character, as.factor) %>%
mutate(survived = as.factor(survived)) %>%
na.omit()
train_class <- titanic_complete %>%
select(survived) %>%
pull()
train_numeric <- titanic_complete %>%
select_if(is.numeric) %>%
data.matrix()
ctrl <- trainControl(method = "none", search = "grid")
tune_grid_xgbTree <- expand.grid(
nrounds = 700,
eta = 0.1,
max_depth = 3,
gamma = 0,
colsample_bytree = 0,
min_child_weight = 1,
subsample = 1)
set.seed(512)
fit_xgb <- train(
x = train_numeric,
y = train_class,
tuneGrid = tune_grid_xgbTree,
trControl = ctrl,
method = "xgbTree",
metric = "Kappa",
verbose = TRUE)
confusionMatrix(predict(fit_xgb, train_numeric), train_class)
Gives me a Kappa of 0.57 evaluated on the training set (which is only to show my problem, otherwise I would use cross-validation).
For LightGBM, I write Kappa as a custom evaluation function:
library(lightgbm)
lgb.kappa <- function(preds, y) {
label <- getinfo(y, "label")
k <- unlist(e1071::classAgreement(table(label, preds)))["kappa"]
return(list(name = "kappa", value = as.numeric(k), higher_better = TRUE))
}
X_train <- titanic_complete %>% select(-survived) %>% data.matrix()
y_train <- titanic_complete %>% select(survived) %>% data.matrix()
y_train <- y_train - 1
dtrain <- lgb.Dataset(data = X_train, label = y_train)
Here, I use the same parameter set than in XGBoost but I tried different combinations without success.
fit_lgbm <- lgb.train(data = dtrain,
objective = "binary",
learning_rate = 0.1,
nrounds = 700,
colsample_bytree = 0,
eval = lgb.kappa,
min_child_weight = 1,
max_depth = 3)
No learning occurs and the algorithm outputs "No further splits with positive gain, best gain: -inf" and Kappa = 0.
If someone hast successfully implemented LightGBM (maybe with a custom evaluation metric), I would be very happy for a hint of how to resolve this.
No learning occurs and the algorithm outputs "No further splits with positive gain, best gain: -inf"
This is because LightGBM's default parameter values are configured for larger datasets. The training dataset in your example above only has 714 rows. To deal with this, I recommend setting LightGBM's parameters to values that permit smaller leaf nodes, and limiting the number of leaves instead of the depth.
list(
"min_data_in_leaf" = 3
, "max_depth" = -1
, "num_leaves" = 8
)
and Kappa = 0.
I believe your implementation of Cohen's kappa has a mistake. The input to e1071::classAgreement() is expected to be a table of counts (a confusion matrix), and preds is in the form of predicted probabilities. I think this implementation is correct, based on the description of this metric on Wikipedia.
lgb.kappa <- function(preds, dtrain) {
label <- getinfo(dtrain, "label")
threshold <- 0.5
thresholded_preds <- as.integer(preds > threshold)
k <- unlist(e1071::classAgreement(table(label, thresholded_preds)))["kappa"]
return(list(name = "kappa", value = as.numeric(k), higher_better = TRUE))
}
Finally, I think 700 iterations is probably too many for a 700ish-observation dataset. You can see the value of metrics evaluated against the training data at each iteration by passing the training data as a validation set.
Taken together, I think the code below accomplishes what the original question asked for.
library(data.table)
library(dplyr)
library(caret)
library(lightgbm)
titanic <- fread("https://raw.githubusercontent.com/pcsanwald/kaggle-titanic/master/train.csv")
titanic_complete <- titanic %>%
select(survived, pclass, sex, age, sibsp, parch, fare, embarked) %>%
mutate_if(is.character, as.factor) %>%
mutate(survived = as.factor(survived)) %>%
na.omit()
train_class <- titanic_complete %>%
select(survived) %>%
pull()
train_numeric <- titanic_complete %>%
select_if(is.numeric) %>%
data.matrix()
lgb.kappa <- function(preds, dtrain) {
label <- getinfo(dtrain, "label")
threshold <- 0.5
thresholded_preds <- as.integer(preds > threshold)
k <- unlist(e1071::classAgreement(table(label, thresholded_preds)))["kappa"]
return(list(name = "kappa", value = as.numeric(k), higher_better = TRUE))
}
X_train <- titanic_complete %>% select(-survived) %>% data.matrix()
y_train <- titanic_complete %>% select(survived) %>% data.matrix()
y_train <- y_train - 1
# train, printing out eval metrics at ever iteration
fit_lgbm <- lgb.train(
data = lgb.Dataset(
data = X_train,
label = y_train
),
params = list(
"min_data_in_leaf" = 3
, "max_depth" = -1
, "num_leaves" = 8
),
objective = "binary",
learning_rate = 0.1,
nrounds = 10L,
verbose = 1L,
valids = list(
"train" = lgb.Dataset(
data = X_train,
label = y_train
)
),
eval = lgb.kappa,
)
# evaluate a custom function after training
fit_lgbm$eval_train(
feval = lgb.kappa
)

Resources