rsquared in linear regresion using r [duplicate] - r

I run an lm() in R and this is the results of the summary:
Multiple R-squared: 0.8918, Adjusted R-squared: 0.8917
F-statistic: 9416 on 9 and 10283 DF, p-value: < 2.2e-16
and it seems that it is a good model, but if I calculate the R^2 manually I obtain this:
model=lm(S~0+C+HA+L1+L2,data=train)
pred=predict(model,train)
rss <- sum((model$fitted.values - train$S) ^ 2)
tss <- sum((train$S - mean(train$S)) ^ 2)
1 - rss/tss
##[1] 0.247238
rSquared(train$S,(train$S-model$fitted.values))
## [,1]
## [1,] 0.247238
What's wrong?
str(train[,c('S','Campionato','HA','L1','L2')])
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 10292 obs. of 5 variables:
$ S : num 19 18 9 12 12 8 21 24 9 8 ...
$ C : Factor w/ 6 levels "D","E","F","I",..: 4 4 4 4 4 4 4 4 4 4 ...
$ HA : Factor w/ 2 levels "A","H": 1 2 1 1 2 1 2 2 1 2 ...
$ L1 : num 0.99 1.41 1.46 1.43 1.12 1.08 1.4 1.45 0.85 1.44 ...
$ L2 : num 1.31 0.63 1.16 1.15 1.29 1.31 0.7 0.65 1.35 0.59 ...

You are running a model without the intercept (the ~0 on the right hand side of your formula). For these kinds of models the calculation of R^2 is problematic and will produce misleading values. This post explains it very well: https://stats.stackexchange.com/a/26205/99681

Related

Logistic Regression algorithm did not converge

I want to carry out logistic regression on a binary outcome variable (0 and 1). Summary of data looks like this:
Classes ‘data.table’ and 'data.frame': 1044 obs. of 16 variables:
$ age : int 18 17 15 15 16 16 16 17 15 15 ...
$ Medu : Factor w/ 5 levels "0","1","2","3",..: 5 2 2 5 4 5 3 5 4 4 ...
$ Fedu : Factor w/ 5 levels "0","1","2","3",..: 5 2 2 3 4 4 3 5 3 5 ...
$ got_parent : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ traveltime : Factor w/ 4 levels "1","2","3","4": 2 1 1 1 1 1 1 2 1 1 ...
$ studytime : Factor w/ 4 levels "1","2","3","4": 2 2 2 3 2 2 2 2 2 2 ...
$ failures : int 0 0 3 0 0 0 0 0 0 0 ...
$ famrel : Factor w/ 5 levels "1","2","3","4",..: 4 5 4 3 4 5 4 4 4 5 ...
$ freetime : Factor w/ 5 levels "1","2","3","4",..: 3 3 3 2 3 4 4 1 2 5 ...
$ goout : Factor w/ 5 levels "1","2","3","4",..: 4 3 2 2 2 2 4 4 2 1 ...
$ Dalc : Factor w/ 5 levels "1","2","3","4",..: 1 1 2 1 1 1 1 1 1 1 ...
$ Walc : Factor w/ 5 levels "1","2","3","4",..: 1 1 3 1 2 2 1 1 1 1 ...
$ health : Factor w/ 5 levels "1","2","3","4",..: 3 3 3 5 5 5 3 1 1 5 ...
$ absences : int 6 4 10 2 4 10 0 6 0 0 ...
$ final_grade: int 6 6 10 15 10 15 11 6 19 15 ...
$ binge_drink: Factor w/ 2 levels "0","1": 1 1 2 1 1 1 1 1 1 1 ...
- attr(*, ".internal.selfref")=<externalptr>
Where binge_drink is the Y variable I want to predict. I split the data into train and test set with 70% split ratio, but when I run the code as follows I get the warning message:
m1 <- glm(binge_drink ~., data = data_train, family = "binomial")
Warning messages:
1: glm.fit: algorithm did not converge
2: glm.fit: fitted probabilities numerically 0 or 1 occurred
And the summary output looks like this:
> summary(m1)
Call:
glm(formula = binge_drink ~ ., family = "binomial", data = data_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-6.443e-06 -2.908e-06 -2.315e-06 2.110e-08 1.441e-05
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.975e+01 5.800e+05 0.000 1.000
age 2.793e-02 1.018e+04 0.000 1.000
Medu1 2.662e+00 5.230e+05 0.000 1.000
Medu2 2.049e+00 5.229e+05 0.000 1.000
Medu3 2.335e+00 5.235e+05 0.000 1.000
Medu4 2.500e+00 5.233e+05 0.000 1.000
Fedu1 2.100e-01 1.442e+05 0.000 1.000
Fedu2 4.332e-01 1.457e+05 0.000 1.000
Fedu3 4.615e-01 1.469e+05 0.000 1.000
Fedu4 3.043e-01 1.477e+05 0.000 1.000
got_parent1 -4.496e-01 4.591e+04 0.000 1.000
traveltime2 1.706e-01 2.492e+04 0.000 1.000
traveltime3 5.142e-01 4.216e+04 0.000 1.000
traveltime4 3.188e-01 7.544e+04 0.000 1.000
studytime2 -3.050e-02 2.572e+04 0.000 1.000
studytime3 2.261e-02 3.546e+04 0.000 1.000
studytime4 -4.446e-01 6.649e+04 0.000 1.000
failures -1.537e-03 1.925e+04 0.000 1.000
famrel2 5.388e-01 8.514e+04 0.000 1.000
famrel3 3.928e-01 6.791e+04 0.000 1.000
famrel4 1.007e-01 6.622e+04 0.000 1.000
famrel5 1.544e-01 6.772e+04 0.000 1.000
freetime2 3.437e-01 5.248e+04 0.000 1.000
freetime3 3.095e-01 4.963e+04 0.000 1.000
freetime4 -7.933e-02 5.250e+04 0.000 1.000
freetime5 -3.947e-01 6.455e+04 0.000 1.000
goout2 -7.329e-02 4.856e+04 0.000 1.000
goout3 -1.290e-01 4.948e+04 0.000 1.000
goout4 8.735e-02 5.056e+04 0.000 1.000
goout5 3.687e-01 5.613e+04 0.000 1.000
Dalc2 5.186e+01 3.300e+04 0.002 0.999
Dalc3 5.050e+01 5.089e+04 0.001 0.999
Dalc4 5.030e+01 8.331e+04 0.001 1.000
Dalc5 4.882e+01 8.648e+04 0.001 1.000
Walc2 2.194e-01 3.208e+04 0.000 1.000
Walc3 6.827e-01 3.267e+04 0.000 1.000
Walc4 5.085e+01 3.812e+04 0.001 0.999
Walc5 4.912e+01 6.719e+04 0.001 0.999
health2 2.159e-01 4.387e+04 0.000 1.000
health3 6.133e-02 4.077e+04 0.000 1.000
health4 -2.094e-01 4.418e+04 0.000 1.000
health5 2.766e-01 3.684e+04 0.000 1.000
absences -4.326e-03 1.951e+03 0.000 1.000
final_grade 1.050e-02 3.322e+03 0.000 1.000
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 9.4833e+02 on 729 degrees of freedom
Residual deviance: 7.4180e-09 on 686 degrees of freedom
AIC: 88
Number of Fisher Scoring iterations: 25
I have tried creating dummy variables for categorical predictors but the output is similar. Does anyone know why?
Thank you.

R-squared in lm() for zero-intercept model

I run an lm() in R and this is the results of the summary:
Multiple R-squared: 0.8918, Adjusted R-squared: 0.8917
F-statistic: 9416 on 9 and 10283 DF, p-value: < 2.2e-16
and it seems that it is a good model, but if I calculate the R^2 manually I obtain this:
model=lm(S~0+C+HA+L1+L2,data=train)
pred=predict(model,train)
rss <- sum((model$fitted.values - train$S) ^ 2)
tss <- sum((train$S - mean(train$S)) ^ 2)
1 - rss/tss
##[1] 0.247238
rSquared(train$S,(train$S-model$fitted.values))
## [,1]
## [1,] 0.247238
What's wrong?
str(train[,c('S','Campionato','HA','L1','L2')])
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 10292 obs. of 5 variables:
$ S : num 19 18 9 12 12 8 21 24 9 8 ...
$ C : Factor w/ 6 levels "D","E","F","I",..: 4 4 4 4 4 4 4 4 4 4 ...
$ HA : Factor w/ 2 levels "A","H": 1 2 1 1 2 1 2 2 1 2 ...
$ L1 : num 0.99 1.41 1.46 1.43 1.12 1.08 1.4 1.45 0.85 1.44 ...
$ L2 : num 1.31 0.63 1.16 1.15 1.29 1.31 0.7 0.65 1.35 0.59 ...
You are running a model without the intercept (the ~0 on the right hand side of your formula). For these kinds of models the calculation of R^2 is problematic and will produce misleading values. This post explains it very well: https://stats.stackexchange.com/a/26205/99681

Mclust : NAs in foreign function call (arg 13)

While trying to determine the optimal number of clusters for a kmeans, I tried to use the package mclust with the following code :
d_clust <- Mclust(df,
G=1:10,
mclust.options("emModelNames"))
d_clust$BIC
df is a data frame of 132656 obs. of 19 variables, the data is scaled, and there is no missing values (no NA/NaN/Inf values I checked with is.na and is.finite). Also, my variables are all in numeric format thanks to as.numeric
However after using the code, the screen displays "fitting" with a loading bar, goes up to 11%, and then after a moment I get the error message :
NAs in foreign function call (arg 13)
Does anyone know why I have this type of error ?
EDIT
Output of str(df) (I modified the variable name because of confidential issues)
'data.frame': 132656 obs. of 19 variables:
$ X1: num 0.5 1 1 1 0.5 1 1 1 1 1 ...
$ X2: num 0.714 0.286 1 0.857 0.286 ...
$ X3: num 0.667 1 0.667 0.667 0.667 ...
$ X4: num 0.714 0.429 1 0.714 0.429 ...
$ X5: num 0.667 0.333 1 0.667 0.333 ...
$ X6: num 0.5 0.25 1 0.5 0.25 0.25 0 0.5 0.5 0.25 ...
$ X7: num 0.667 0.667 0.667 0.667 0.667 ...
$ X8: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
$ X9: num 0.667 0 0.667 0.333 0 ...
$ X10: num 1 0.833 1 1 1 ...
$ X11: num 1 0.75 1 1 1 1 1 1 1 1 ...
$ X12: num 1 1 1 0.8 1 1 1 1 1 1 ...
$ X13: num 0.5 0.75 0.75 0.5 0.75 0.25 0.75 0.5 0.5 0.5 ...
$ X14: num 0.75 0.75 0.75 1 0.75 0.75 0.75 1 0.75 0.75 ...
$ X15: num 1 0 0.5 1 1 1 0.75 1 0.5 1 ...
$ X16: num 1 0.333 0.667 0.833 0.833 ...
$ X17: num 1 1 1 1 1 1 1 1 1 1 ...
$ X18: num 0.00157 0.000438 0.001059 0.000879 0.004919 ...
$ X19: num 0.5 0.125 1 0.625 0.125 0.125 0.125 1 0.5 0.25 ...

data training with R where data preprocessed into PCA components?

I would like to train a knn using caret::train to classify digits (classic problem) employing a PCA on the features before training.
control = trainControl(method = "repeatedcv",
number = 10,
repeats = 5,
p = 0.9)
knnFit = train(x = trainingDigit,
y = label,
metric = "Accuracy",
method = "knn",
trControl = control,
preProcess = "pca")
I don't understand how to represent my data for training resulting in an error:
Error in sample.int(length(x), size, replace, prob) :
cannot take a sample larger than the population when 'replace = FALSE'
My training data is represented as follows (Rdata file):
List of 10
$ : num [1:400, 1:324] 0.934 0.979 0.877 0.853 0.945 ...
$ : num [1:400, 1:324] 0.807 0.98 0.803 0.978 0.969 ...
$ : num [1:400, 1:324] 0.745 0.883 0.776 0.825 0.922 ...
$ : num [1:400, 1:324] 0.892 0.817 0.835 0.84 0.842 ...
$ : num [1:400, 1:324] 0.752 0.859 0.881 0.884 0.855 ...
$ : num [1:400, 1:324] 0.798 0.969 0.925 0.921 0.873 ...
$ : num [1:400, 1:324] 0.964 0.93 0.97 0.857 0.926 ...
$ : num [1:400, 1:324] 0.922 0.939 0.958 0.946 0.867 ...
$ : num [1:400, 1:324] 0.969 0.947 0.916 0.861 0.86 ...
$ : num [1:400, 1:324] 0.922 0.933 0.978 0.968 0.971 ...
Labels as follows (.Rdata file):
List of 10
$ : num [1:400] 0 0 0 0 0 0 0 0 0 0 ...
$ : num [1:400] 1 1 1 1 1 1 1 1 1 1 ...
$ : num [1:400] 2 2 2 2 2 2 2 2 2 2 ...
$ : num [1:400] 3 3 3 3 3 3 3 3 3 3 ...
$ : num [1:400] 4 4 4 4 4 4 4 4 4 4 ...
$ : num [1:400] 5 5 5 5 5 5 5 5 5 5 ...
$ : num [1:400] 6 6 6 6 6 6 6 6 6 6 ...
$ : num [1:400] 7 7 7 7 7 7 7 7 7 7 ...
$ : num [1:400] 8 8 8 8 8 8 8 8 8 8 ...
$ : num [1:400] 9 9 9 9 9 9 9 9 9 9 ...
The problem is in your representation of the data. Try this before you start training:
label <- factor(c(label, recursive = TRUE))
trainingDigit <- data.frame(do.call(rbind, trainingDigit))
You need to massage your data into a data.frame or data.frame-like format with a single column representing your different outcomes with the other columns being features for each outcome.
Also, if you want to do classification, not regression, your outcomes need to be a factor.
To be clear, I tried to run the training code as follows, and it works just fine.
library(caret)
load("data.RData")
load("testClass_new.RData")
label <- factor(c(label, recursive = TRUE))
trainingDigit <- data.frame(do.call(rbind, trainingDigit))
control <- trainControl(method = "repeatedcv",
number = 10,
repeats = 5,
p = 0.9)
knnFit <- train(x = trainingDigit,
y = label,
metric = "Accuracy",
method = "knn",
trControl = control,
preProcess = "pca")

R Plyr Rename multiple columns in list of dataframes

Only just discovered Plyr and it has saved me a tonne of lines combining multiple data frames which is great. BUT I have another renaming problem I cannot fathom.
I have a list, which contains a number of data frames (this is a subset as there are actually 108 in the real list).
> str(mydata)
List of 4
$ C11:'data.frame': 8 obs. of 3 variables:
..$ X : Factor w/ 8 levels "n >= 1","n >= 2",..: 1 2 3 4 5 6 7 8
..$ n.ENSEMBLE.COVERAGE: num [1:8] 1 1 1 1 0.96 0.91 0.74 0.5
..$ n.ENSEMBLE.RECALL : num [1:8] 0.88 0.88 0.88 0.88 0.9 0.91 0.94 0.95
$ C12:'data.frame': 8 obs. of 3 variables:
..$ X : Factor w/ 8 levels "n >= 1","n >= 2",..: 1 2 3 4 5 6 7 8
..$ n.ENSEMBLE.COVERAGE: num [1:8] 1 1 1 1 0.96 0.89 0.86 0.72
..$ n.ENSEMBLE.RECALL : num [1:8] 0.91 0.91 0.91 0.91 0.93 0.96 0.97 0.98
$ C13:'data.frame': 8 obs. of 3 variables:
..$ X : Factor w/ 8 levels "n >= 1","n >= 2",..: 1 2 3 4 5 6 7 8
..$ n.ENSEMBLE.COVERAGE: num [1:8] 1 1 1 1 0.94 0.79 0.65 0.46
..$ n.ENSEMBLE.RECALL : num [1:8] 0.85 0.85 0.85 0.85 0.88 0.9 0.92 0.91
$ C14:'data.frame': 8 obs. of 3 variables:
..$ X : Factor w/ 8 levels "n >= 1","n >= 2",..: 1 2 3 4 5 6 7 8
..$ n.ENSEMBLE.COVERAGE: num [1:8] 1 1 1 1 0.98 0.95 0.88 0.74
..$ n.ENSEMBLE.RECALL : num [1:8] 0.91 0.91 0.91 0.91 0.92 0.94 0.95 0.98
What I really want to achieve is for each data frame to have the columns prepended with the title of the dataframe. So in the example the columns would be:
C11.X, C11.n.ENSEMBLE.COVERAGE & C11.n.ENSEMBLE.RECALL
C12.X, C12.n.ENSEMBLE.COVERAGE & C12.n.ENSEMBLE.RECALL
C13.X, C13.n.ENSEMBLE.COVERAGE & C13.n.ENSEMBLE.RECALL
C14.X, C14.n.ENSEMBLE.COVERAGE & C14.n.ENSEMBLE.RECALL
Can anyone suggest an elegant approach to renaming columns like this?
Here's a reproducible example using the iris data set:
# produce a named list of data.frames as sample data:
dflist <- split(iris, iris$Species)
# store the list element names:
n <- names(dflist)
# rename the elements:
Map(function(df, vec) setNames(df, paste(vec, names(df), sep = ".")), dflist, n)

Resources