I want to carry out logistic regression on a binary outcome variable (0 and 1). Summary of data looks like this:
Classes ‘data.table’ and 'data.frame': 1044 obs. of 16 variables:
$ age : int 18 17 15 15 16 16 16 17 15 15 ...
$ Medu : Factor w/ 5 levels "0","1","2","3",..: 5 2 2 5 4 5 3 5 4 4 ...
$ Fedu : Factor w/ 5 levels "0","1","2","3",..: 5 2 2 3 4 4 3 5 3 5 ...
$ got_parent : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ traveltime : Factor w/ 4 levels "1","2","3","4": 2 1 1 1 1 1 1 2 1 1 ...
$ studytime : Factor w/ 4 levels "1","2","3","4": 2 2 2 3 2 2 2 2 2 2 ...
$ failures : int 0 0 3 0 0 0 0 0 0 0 ...
$ famrel : Factor w/ 5 levels "1","2","3","4",..: 4 5 4 3 4 5 4 4 4 5 ...
$ freetime : Factor w/ 5 levels "1","2","3","4",..: 3 3 3 2 3 4 4 1 2 5 ...
$ goout : Factor w/ 5 levels "1","2","3","4",..: 4 3 2 2 2 2 4 4 2 1 ...
$ Dalc : Factor w/ 5 levels "1","2","3","4",..: 1 1 2 1 1 1 1 1 1 1 ...
$ Walc : Factor w/ 5 levels "1","2","3","4",..: 1 1 3 1 2 2 1 1 1 1 ...
$ health : Factor w/ 5 levels "1","2","3","4",..: 3 3 3 5 5 5 3 1 1 5 ...
$ absences : int 6 4 10 2 4 10 0 6 0 0 ...
$ final_grade: int 6 6 10 15 10 15 11 6 19 15 ...
$ binge_drink: Factor w/ 2 levels "0","1": 1 1 2 1 1 1 1 1 1 1 ...
- attr(*, ".internal.selfref")=<externalptr>
Where binge_drink is the Y variable I want to predict. I split the data into train and test set with 70% split ratio, but when I run the code as follows I get the warning message:
m1 <- glm(binge_drink ~., data = data_train, family = "binomial")
Warning messages:
1: glm.fit: algorithm did not converge
2: glm.fit: fitted probabilities numerically 0 or 1 occurred
And the summary output looks like this:
> summary(m1)
Call:
glm(formula = binge_drink ~ ., family = "binomial", data = data_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-6.443e-06 -2.908e-06 -2.315e-06 2.110e-08 1.441e-05
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.975e+01 5.800e+05 0.000 1.000
age 2.793e-02 1.018e+04 0.000 1.000
Medu1 2.662e+00 5.230e+05 0.000 1.000
Medu2 2.049e+00 5.229e+05 0.000 1.000
Medu3 2.335e+00 5.235e+05 0.000 1.000
Medu4 2.500e+00 5.233e+05 0.000 1.000
Fedu1 2.100e-01 1.442e+05 0.000 1.000
Fedu2 4.332e-01 1.457e+05 0.000 1.000
Fedu3 4.615e-01 1.469e+05 0.000 1.000
Fedu4 3.043e-01 1.477e+05 0.000 1.000
got_parent1 -4.496e-01 4.591e+04 0.000 1.000
traveltime2 1.706e-01 2.492e+04 0.000 1.000
traveltime3 5.142e-01 4.216e+04 0.000 1.000
traveltime4 3.188e-01 7.544e+04 0.000 1.000
studytime2 -3.050e-02 2.572e+04 0.000 1.000
studytime3 2.261e-02 3.546e+04 0.000 1.000
studytime4 -4.446e-01 6.649e+04 0.000 1.000
failures -1.537e-03 1.925e+04 0.000 1.000
famrel2 5.388e-01 8.514e+04 0.000 1.000
famrel3 3.928e-01 6.791e+04 0.000 1.000
famrel4 1.007e-01 6.622e+04 0.000 1.000
famrel5 1.544e-01 6.772e+04 0.000 1.000
freetime2 3.437e-01 5.248e+04 0.000 1.000
freetime3 3.095e-01 4.963e+04 0.000 1.000
freetime4 -7.933e-02 5.250e+04 0.000 1.000
freetime5 -3.947e-01 6.455e+04 0.000 1.000
goout2 -7.329e-02 4.856e+04 0.000 1.000
goout3 -1.290e-01 4.948e+04 0.000 1.000
goout4 8.735e-02 5.056e+04 0.000 1.000
goout5 3.687e-01 5.613e+04 0.000 1.000
Dalc2 5.186e+01 3.300e+04 0.002 0.999
Dalc3 5.050e+01 5.089e+04 0.001 0.999
Dalc4 5.030e+01 8.331e+04 0.001 1.000
Dalc5 4.882e+01 8.648e+04 0.001 1.000
Walc2 2.194e-01 3.208e+04 0.000 1.000
Walc3 6.827e-01 3.267e+04 0.000 1.000
Walc4 5.085e+01 3.812e+04 0.001 0.999
Walc5 4.912e+01 6.719e+04 0.001 0.999
health2 2.159e-01 4.387e+04 0.000 1.000
health3 6.133e-02 4.077e+04 0.000 1.000
health4 -2.094e-01 4.418e+04 0.000 1.000
health5 2.766e-01 3.684e+04 0.000 1.000
absences -4.326e-03 1.951e+03 0.000 1.000
final_grade 1.050e-02 3.322e+03 0.000 1.000
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 9.4833e+02 on 729 degrees of freedom
Residual deviance: 7.4180e-09 on 686 degrees of freedom
AIC: 88
Number of Fisher Scoring iterations: 25
I have tried creating dummy variables for categorical predictors but the output is similar. Does anyone know why?
Thank you.
Related
I run an lm() in R and this is the results of the summary:
Multiple R-squared: 0.8918, Adjusted R-squared: 0.8917
F-statistic: 9416 on 9 and 10283 DF, p-value: < 2.2e-16
and it seems that it is a good model, but if I calculate the R^2 manually I obtain this:
model=lm(S~0+C+HA+L1+L2,data=train)
pred=predict(model,train)
rss <- sum((model$fitted.values - train$S) ^ 2)
tss <- sum((train$S - mean(train$S)) ^ 2)
1 - rss/tss
##[1] 0.247238
rSquared(train$S,(train$S-model$fitted.values))
## [,1]
## [1,] 0.247238
What's wrong?
str(train[,c('S','Campionato','HA','L1','L2')])
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 10292 obs. of 5 variables:
$ S : num 19 18 9 12 12 8 21 24 9 8 ...
$ C : Factor w/ 6 levels "D","E","F","I",..: 4 4 4 4 4 4 4 4 4 4 ...
$ HA : Factor w/ 2 levels "A","H": 1 2 1 1 2 1 2 2 1 2 ...
$ L1 : num 0.99 1.41 1.46 1.43 1.12 1.08 1.4 1.45 0.85 1.44 ...
$ L2 : num 1.31 0.63 1.16 1.15 1.29 1.31 0.7 0.65 1.35 0.59 ...
You are running a model without the intercept (the ~0 on the right hand side of your formula). For these kinds of models the calculation of R^2 is problematic and will produce misleading values. This post explains it very well: https://stats.stackexchange.com/a/26205/99681
I am pretty new to R, but ok with spotfire. I found it was difficult to do in spotfire and someone told me about the TERR tools in spotfire that can run R codes.
I am trying to find the p10, p50 and p90 of a multi category time series data. Example of the data looks like this:
Category Time Rate
1 0
1 0.0104
1 0.1354 0.002
1 0.2604 0.139
1 0.3854 0.280
1 0.5104 0.299
1 0.6354 0.313
1 0.7604 0.403
1 0.8854 0.429
1 1.0104 0.408
1 1.1354 0.415
1 1.2604 0.482
1 1.3854 0.484
2 0
2 0.0104
2 0.1354
2 0.2604
2 0.3854 0.064
2 0.5104 0.166
2 0.6354 0.148
2 0.7604 0.141
2 0.8854 0.254
2 1.0104 0.286
2 1.1354 0.292
2 1.2604 0.296
2 1.3854 0.310
2 1.5104 0.304
2 1.6354 0.303
2 1.7604 0.301
2 1.8854 0.300
2 2.0104 0.319
2 2.1354 0.330
2 2.2604 0.330
2 2.3854 0.331
2 2.5104 0.332
2 2.6354 0.334
2 2.7604 0.330
2 2.8854 0.326
2 3.0104 0.325
3 0
3 0.0104
3 0.1354
3 0.2604 0.010
3 0.3854 0.021
3 0.5104 0.021
3 0.6354 0.021
3 0.7604 0.023
3 0.8854 0.026
3 1.0104 0.028
3 1.1354 0.029
3 1.2604 0.027
3 1.3854 0.033
3 1.5104 0.035
3 1.6354 0.034
In the end, I want to calculate other columns with p10, p50 and p90 values as in the attached picture.
p10 and 90 are the dash lines and p50 is the solid red line. enter image description here
Thanks
I run an lm() in R and this is the results of the summary:
Multiple R-squared: 0.8918, Adjusted R-squared: 0.8917
F-statistic: 9416 on 9 and 10283 DF, p-value: < 2.2e-16
and it seems that it is a good model, but if I calculate the R^2 manually I obtain this:
model=lm(S~0+C+HA+L1+L2,data=train)
pred=predict(model,train)
rss <- sum((model$fitted.values - train$S) ^ 2)
tss <- sum((train$S - mean(train$S)) ^ 2)
1 - rss/tss
##[1] 0.247238
rSquared(train$S,(train$S-model$fitted.values))
## [,1]
## [1,] 0.247238
What's wrong?
str(train[,c('S','Campionato','HA','L1','L2')])
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 10292 obs. of 5 variables:
$ S : num 19 18 9 12 12 8 21 24 9 8 ...
$ C : Factor w/ 6 levels "D","E","F","I",..: 4 4 4 4 4 4 4 4 4 4 ...
$ HA : Factor w/ 2 levels "A","H": 1 2 1 1 2 1 2 2 1 2 ...
$ L1 : num 0.99 1.41 1.46 1.43 1.12 1.08 1.4 1.45 0.85 1.44 ...
$ L2 : num 1.31 0.63 1.16 1.15 1.29 1.31 0.7 0.65 1.35 0.59 ...
You are running a model without the intercept (the ~0 on the right hand side of your formula). For these kinds of models the calculation of R^2 is problematic and will produce misleading values. This post explains it very well: https://stats.stackexchange.com/a/26205/99681
Let say I have these three vectors:
time <- c(306,455,1010,210,883,1022,310,361,218,166)
status <- c(0,1,0,1,0,0,1,0,1,1)
gender <- c("Male","Male","Female","Male","Male","Male","Female","Female","Female","Female")
and I want to do a Survival Analysis and get the summary.
A <- survfit(Surv(time, status)~gender)
summary(A, censored = TRUE)
The output would be like this:
> summary(A, censored = TRUE)
Call: survfit(formula = Surv(time, status) ~ gender)
gender=Female
time n.risk n.event survival std.err lower 95% CI upper 95% CI
166 5 1 0.8 0.179 0.516 1
218 4 1 0.6 0.219 0.293 1
310 3 1 0.4 0.219 0.137 1
361 2 0 0.4 0.219 0.137 1
1010 1 0 0.4 0.219 0.137 1
gender=Male
time n.risk n.event survival std.err lower 95% CI upper 95% CI
210 5 1 0.800 0.179 0.516 1
306 4 0 0.800 0.179 0.516 1
455 3 1 0.533 0.248 0.214 1
883 2 0 0.533 0.248 0.214 1
1022 1 0 0.533 0.248 0.214 1
My question is, is there any way that I can split the output into Male and Female. For example:
output_Female <- ?????
output_Female
output_Female
time n.risk n.event survival std.err lower 95% CI upper 95% CI
166 5 1 0.8 0.179 0.516 1
218 4 1 0.6 0.219 0.293 1
310 3 1 0.4 0.219 0.137 1
361 2 0 0.4 0.219 0.137 1
1010 1 0 0.4 0.219 0.137 1
output_Male <- ?????
output_Male
output_Male
time n.risk n.event survival std.err lower 95% CI upper 95% CI
166 5 1 0.8 0.179 0.516 1
218 4 1 0.6 0.219 0.293 1
310 3 1 0.4 0.219 0.137 1
361 2 0 0.4 0.219 0.137 1
1010 1 0 0.4 0.219 0.137 1
Here is an option using tidy
library(broom)
library(dplyr)
tidy(A, censored = TRUE) %>%
split(.$strata)
Or with base R
txt <- capture.output(summary(A, censored = TRUE))
ind <- cumsum(grepl("gender=", txt))
lst <- lapply(split(txt[ind >0], ind[ind >0]), function(x)
read.table(text = x[-(1:2)], header = FALSE))
nm1 <- scan(text= gsub("\\s+[0-9]|%\\s+", ".", txt[4]), quiet = TRUE, what = "")
lst <- lapply(lst, setNames, nm1)
I would like to train a knn using caret::train to classify digits (classic problem) employing a PCA on the features before training.
control = trainControl(method = "repeatedcv",
number = 10,
repeats = 5,
p = 0.9)
knnFit = train(x = trainingDigit,
y = label,
metric = "Accuracy",
method = "knn",
trControl = control,
preProcess = "pca")
I don't understand how to represent my data for training resulting in an error:
Error in sample.int(length(x), size, replace, prob) :
cannot take a sample larger than the population when 'replace = FALSE'
My training data is represented as follows (Rdata file):
List of 10
$ : num [1:400, 1:324] 0.934 0.979 0.877 0.853 0.945 ...
$ : num [1:400, 1:324] 0.807 0.98 0.803 0.978 0.969 ...
$ : num [1:400, 1:324] 0.745 0.883 0.776 0.825 0.922 ...
$ : num [1:400, 1:324] 0.892 0.817 0.835 0.84 0.842 ...
$ : num [1:400, 1:324] 0.752 0.859 0.881 0.884 0.855 ...
$ : num [1:400, 1:324] 0.798 0.969 0.925 0.921 0.873 ...
$ : num [1:400, 1:324] 0.964 0.93 0.97 0.857 0.926 ...
$ : num [1:400, 1:324] 0.922 0.939 0.958 0.946 0.867 ...
$ : num [1:400, 1:324] 0.969 0.947 0.916 0.861 0.86 ...
$ : num [1:400, 1:324] 0.922 0.933 0.978 0.968 0.971 ...
Labels as follows (.Rdata file):
List of 10
$ : num [1:400] 0 0 0 0 0 0 0 0 0 0 ...
$ : num [1:400] 1 1 1 1 1 1 1 1 1 1 ...
$ : num [1:400] 2 2 2 2 2 2 2 2 2 2 ...
$ : num [1:400] 3 3 3 3 3 3 3 3 3 3 ...
$ : num [1:400] 4 4 4 4 4 4 4 4 4 4 ...
$ : num [1:400] 5 5 5 5 5 5 5 5 5 5 ...
$ : num [1:400] 6 6 6 6 6 6 6 6 6 6 ...
$ : num [1:400] 7 7 7 7 7 7 7 7 7 7 ...
$ : num [1:400] 8 8 8 8 8 8 8 8 8 8 ...
$ : num [1:400] 9 9 9 9 9 9 9 9 9 9 ...
The problem is in your representation of the data. Try this before you start training:
label <- factor(c(label, recursive = TRUE))
trainingDigit <- data.frame(do.call(rbind, trainingDigit))
You need to massage your data into a data.frame or data.frame-like format with a single column representing your different outcomes with the other columns being features for each outcome.
Also, if you want to do classification, not regression, your outcomes need to be a factor.
To be clear, I tried to run the training code as follows, and it works just fine.
library(caret)
load("data.RData")
load("testClass_new.RData")
label <- factor(c(label, recursive = TRUE))
trainingDigit <- data.frame(do.call(rbind, trainingDigit))
control <- trainControl(method = "repeatedcv",
number = 10,
repeats = 5,
p = 0.9)
knnFit <- train(x = trainingDigit,
y = label,
metric = "Accuracy",
method = "knn",
trControl = control,
preProcess = "pca")