Issues performing Hosmer-Lemeshow test - r

Please help!
I am trying to perform a HL test to assess the goodness of fit for my model but I keep getting the above error!
I have these packages installed:
library(jtools)
library(skimr)
library(epiR)
library(rms)
library(epimisc)
library(DescTools)
library(car)
library(readxl)
library(summarytools)
library(survival)
library(ggplot2)
library(survminer)
library(PredictABEL)
This is my code:
# Final model
l_final <- glm(pro$treatment ~ pro$age,family = binomial(link="logit"),data = pro, x=TRUE)
summ(l_final)
# Assessing the fit of the model
#predict proabailities
pro$l_final_pred <- predict(l_final, type = "response")
# Hosmer-Lemeshow test
HosmerLemeshowTest(pro$l_final_pred, pro$treatment, X = l_final_pred$x)
This is the output:
> # Assessing the fit of the model
> pro$l_final_pred <- predict(l_final, type = "response")
Error:
! Assigned data `predict(l_final, type = "response")` must be compatible with existing data.
✖ Existing data has 866 rows.
✖ Assigned data has 864 rows.
ℹ Only vectors of size 1 are recycled.
Backtrace:
1. base::`$<-`(`*tmp*`, l_final_pred, value = `<dbl>`)
12. tibble (local) `<fn>`(`<vctrs___>`)
> # Hosmer-Lemeshow test
> HosmerLemeshowTest(pro$l_final_pred, pro$treatment, X = l_final_pred$x)
Error in cut.default(fit, breaks = brks, include.lowest = TRUE) :
'x' must be numeric
In addition: Warning message:
Unknown or uninitialised column: `l_final_pred`.
Thankyou!!
I know it has something to do with the lengths of the data sets (there are two missing in treatment = 864, compared to age - 866) but can't quite work out for to fix it

Related

How can I compare 3 binary variables in R?

I'm looking at debris ingestion in gulls. Each gull is listed by row. Columns contain the sex(0=male, 1=female), if they ate debris (0=no, 1=yes) and if I found any number of other items in their stomach, for this problem I'd like to see if sex and presence of debris influences the number of birds with Shells in their stomach (0=no shells, 1=shells). Debris prevalence is likely overdispersed and zero-inflated, but I'm not sure that matters if I'm using it as a factor to evaluate shell prevalence. Shell prevalence might be overdispersed and zero inflated as well.
I've plotted the data and want to test whether the differences seen in the plot are significant.
But when trying to run a zero-inflated negative binomial model I get many diff errors depending on how I set it up.
library (aod)
library(MASS)
library (ggplot2)
library(gridExtra)
library(pscl)
library(boot)
library(reshape2)
mydata1 <- read.csv('D:/mp paper/analysis wkshts/stats files/FOdata.csv')
mydata1 <- within(mydata1, {
debris <- factor(debris)
sex <- factor(sex)
Shell_frags <- factor(Shell_frags)
})
summary(mydata1)
ggplot(mydata1, aes(Shell_frags, fill=debris)) +
stat_count() +
facet_grid(debris ~ sex, margins=TRUE, scales="free_y")
m1 <- zeroinfl((Shell_frags ~ sex + debris), data = mydata1, dist = "negbin", EM = TRUE)
summary(m1)
Error message:
Error in if (all(Y > 0)) stop("invalid dependent variable, minimum count is not zero") :
missing value where TRUE/FALSE needed
In addition: Warning messages:
1: In model.response(mf, "numeric") :
using type = "numeric" with a factor response will be ignored
2: In Ops.factor(Y, 0) : ‘>’ not meaningful for factors
> summary(m1)
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'object' in selecting a method for function 'summary': object 'm1'
not found

Error in model.frame.default: variable lengths differ, R predict function

This is not a new question, I have seen several proposed solutions elsewhere and have tried them, none works, so I ask.
How can I fix this error? I am using R version 3.5.3 (2019-03-11)
Error in model.frame.default(data = ov_val, formula = Surv(time = ov_dev$futime, : variable lengths differ (found for 'rx')
Here is a reproducible example:
library(survival)
library(survminer)
library(dplyr)
# Create fake development dataset
ov_dev <- ovarian[1:13,]
# Create fake validation dataset
ov_val <- ovarian[13:26,]
# Run cox model
fit.coxph <- coxph(Surv(time = ov_dev$futime, event = ov_dev$fustat) ~ rx + resid.ds + age + ecog.ps, data = ov_dev)
summary(fit.coxph)
# Where error occurs
p <- log(predict(fit.coxph, newdata = ov_val, type = "expected"))
I think this has happened because you have used ov_dev$futime and ov_dev$fustat in your model specification rather than just using futime and fustat. That means that when you come to predict, the model is using the ov_dev data for the dependent variable but ov_val for the independent variables, which are of different length (13 versus 14). Just remove the data frame prefix and trust the data parameter:
library(survival)
library(survminer)
library(dplyr)
# Create fake development dataset
ov_dev <- ovarian[1:13,]
# Create fake validation dataset
ov_val <- ovarian[13:26,]
# Run cox model
fit.coxph <- coxph(Surv(futime, fustat) ~ rx + resid.ds + age + ecog.ps,
data = ov_dev)
p <- log(predict(fit.coxph, newdata = ov_val, type = "expected"))
p
#> [1] 0.4272783 -0.1486577 -1.8988833 -1.1887086 -0.8849632 -1.3374428
#> [7] -1.2294725 -1.5021708 -0.3264792 0.5633839 -3.0457613 -2.2476071
#> [13] -1.6754877 -3.0691996
Created on 2020-08-19 by the reprex package (v0.3.0)

Predictions function of ROCR package gives Error: 'predictions' contains NA

I have been following the Edx course The Analytics Edge and I am currently in the logistic regression section: Framingham Heart Study part.
Here they use the predictions function of the ROCR package to predict the data accuracy with the threshold value set to 0.5. I have downloaded the .csv file from the edx course portal and written the exact same code but I am getting the error that 'Predictions' contains NA.
Here's the code:
framingham <- read.csv("framingham.csv")
library(caTools)
# framingham <- na.omit(framingham) # If I use this line then the code works fine but people must be working all the time with data with missing values. So, it should work for all the cases too.
set.seed(1000)
split <- sample.split(framingham$TenYearCHD, SplitRatio = 0.65)
train <- subset(framingham, split == TRUE)
test <- subset(framingham, split == FALSE)
framinghamLog <- glm(TenYearCHD ~ ., data = train, family = binomial)
summary(framinghamLog)
predictTest <- predict(framinghamLog, type = "response", newdata = test)
table(test$TenYearCHD, predictTest > 0.5)
library(ROCR)
ROCRpred <- prediction(predictTest, test$TenYearCHD)
This is the error:
Error: 'predictions' contains NA.

Error in eval(predvars, data, env) : object 'Customer Count' not found

I am trying to do a Random Forest model on a dataset to predict a two classification variable. I have attached the code below and come back with this error. The variable Customer Count is in the dataset and this error is still getting thrown.
This is for my predictive model. I have tried to reorganize the dataset to get around Customer Count as the first variable. I have also tried to trim the dataset so it is not as large and that maybe that was an issue.
# Load the dataset and explore
library(readxl)
rawData <- read_excel("StrippedTransformerModelData.xlsx")
View(rawData)
head(rawData)
str(rawData)
summary(rawData)
# Split into Train and Validation sets
# Training Set : Validation Set = 70 : 30 (random)
set.seed(100)
train <- sample(nrow(rawData), 0.7*nrow(rawData), replace = FALSE)
TrainSet <- rawData[train,]
ValidSet <- rawData[-train,]
summary(TrainSet)
summary(ValidSet)
# Create a Random Forest model with default parameters
model1 <- randomForest(data = TrainSet, Failure ~ ., ntree = 500, mtry = 6, importance = TRUE)
model1
Error in eval(predvars, data, env) : object 'Customer Count' not found.
The variable Customer Count is in the dataset for sure and I don't know why it is saying it is not found.

Missings values for variable importance for neural network in Package IML in R

I try to get variables importance from a neural network with iml package in R. The dependant variable is binary and predictors are normalised. I get a missing value for every predictor. Here's the code I'm using:
library(mlr)
library(iml)
tsk = makeClassifTask(data = fullnorm, target = "churn")
rfa <- makeLearner("classif.nnet", predict.type = "prob")# cross validation with NN
mod = train(rfa, tsk)
X =fullnorm[which(names(fullnorm) != "churn")]
Y <- as.numeric(as.character(fullnorm$churn))
predictor = iml::Predictor$new(mod, data = X, y = Y)
imp = FeatureImp$new(predictor, loss = "f1")
plot(imp)
I get no message except the fact that missing values (i.e. all predictors) are not fit.
> plot(imp)
Warning messages:
1: Removed 15 rows containing missing values (geom_point).
2: Removed 15 rows containing missing values (geom_segment).

Resources