How to only print (adjusted) R-squared of regression model? - r

I am a beginner with R. I have a data set on air pollution. The columns are site, measured concentration and 80 variables (v1-v80) that might influence the concentration.
I want to make a model with forward stepwise regression based on R-squared/adj with my own code (so I do not want to use something like step() or regsubset()). The dependent variable is concentration and the variables v1-v80 as independent variables. I wrote the following code for the first step (data set is simplified):
site concentration v1 v2 v3
1 1 -0.84085548 1.7114409 -0.2857736 -1.0803926
2 2 1.38435934 -0.6029080 0.1381082 -0.1575344
3 3 -1.25549186 -0.4721664 1.2276303 -1.0717600
for (j in names(df)){
model <- lm(concentration ~ df[[j]], data = df)
print(j)
print(summary(model))
}
This works well, but I am only interested in R-squared and adjusted R-squared. I tried to only have (adjusted) R-squared printed with:
for (j in names(df)){
model <- lm(concentration ~ df[[j]], data = df)
print(j)
print(summary(model$r.squared))
print(summary(model$adj.r.squared))
}
But then I get as output (this is only a part):
[1] "v1"
Length Class Mode
0 NULL NULL
Length Class Mode
0 NULL NULL
[1] "v2"
Length Class Mode
0 NULL NULL
Length Class Mode
0 NULL NULL
Etcetera.
How can I get as output only the name of the relevant variable and (adjusted) R-squared for every model that is produced in the for-loop?
Thanks!

library(broom)
glance(model)[c(1,2)]
Input = ("site concentration v1 v2 v3
1 1 -0.84085548 1.7114409 -0.2857736 -1.0803926
2 2 1.38435934 -0.6029080 0.1381082 -0.1575344
3 3 -1.25549186 -0.4721664 1.2276303 -1.0717600")
df = read.table(textConnection(Input),header=TRUE)
for (j in names(df)){
model <- lm(concentration ~ df[[j]], data = df)
print(j)
print(glance(model)[c(1,2)])
}
[1] "site"
r.squared adj.r.squared
1 0.02132635 -0.9573473
[1] "concentration"
r.squared adj.r.squared
1 1 1
[1] "v1"
r.squared adj.r.squared
1 0.1717716 -0.6564568
[1] "v2"
r.squared adj.r.squared
1 0.1482473 -0.7035055
[1] "v3"
r.squared adj.r.squared
1 0.9762587 0.9525174
Warning message:
In stats::summary.lm(x) :
essentially perfect fit: summary may be unreliable
Using base R
summary(model)$adj.r.squared
summary(model)$r.squared

Related

Set intercept to zero when using predict.glm

How do I remove the intercept from the prediction when using predict.glm? I'm not talking about the model itself, just in the prediction.
For example, I want to get the difference and standard error between x=1 and x=3
I tried putting newdata=list(x=2), intercept = NULL when using predict.glm and it doesn't work
So for example:
m <- glm(speed ~ dist, data=cars, family=gaussian(link="identity"))
prediction <- predict.glm(m, newdata=list(dist=c(2)), type="response", se.fit=T, intercept=NULL)
I'm not sure if this is somehow implemented in predict, but you could the following trick1.
Add a manual intercept column (i.e. a vector of 1s) to the data and use it in the model while adding 0 to RHS of formula (to remove the "automatic" intercept).
cars$intercept <- 1L
m <- glm(speed ~ 0 + intercept + dist, family=gaussian, data=cars)
This gives us an intercept column in the model.frame, internally used by predict,
model.frame(m)
# speed intercept dist
# 1 4 1 2
# 2 4 1 10
# 3 7 1 4
# 4 7 1 22
# ...
which allows us to set it to an arbitrary value such as zero.
predict.glm(m, newdata=list(dist=2, intercept=0), type="response", se.fit=TRUE)
# $fit
# 1
# 0.3311351
#
# $se.fit
# [1] 0.03498896
#
# $residual.scale
# [1] 3.155753

Bootstrap method for mixed glm zero-inflated model

I'd like to bootstrap a mixed glm zero-inflated model (m_F) using the glmmTMB package, but despite the use of coef or fixef for coefficients specification, I always have as output the error:
Error in bres[i, ] <- coef(bfit) :
incorrect number of subscripts on matrix
My example:
library(glmmTMB)
library(boot)
my.ds <- read.csv("https://raw.githubusercontent.com/Leprechault/trash/main/ds.desenvol.csv")
str(my.ds)
# 'data.frame': 400 obs. of 4 variables:
# $ temp : num 0 0 0 0 0 0 0 0 0 0 ...
# $ storage : int 5 5 5 5 5 5 5 5 5 5 ...
# $ rep : chr "r1" "r2" "r3" "r4" ...
# $ development: int 0 23 22 27 24 25 24 22 0 22 ...
# Fit a GLM mixed Hurdle (zero-inflated) log-link Gamma model
m_F <- glmmTMB(development ~ poly(temp,2) + (1 | storage), data = my.ds,
family = ziGamma(link = "log"),
ziformula = ~ 1)
summary(m_F)
# Create a bootstrap aproach
nboot <- 1000
bres <- matrix(NA,nrow=nboot,
ncol=length(coef(m_F)),
dimnames=list(rep=seq(nboot),
coef=names(coef(m_F))))
set.seed(1000)
bootsize <- 100
for (i in seq(nboot)) {
bdat <- my.ds[sample(nrow(my.ds),size=bootsize,replace=TRUE),]
bfit <- update(m_F, data=bdat) ## refit with new data
bres[i,] <- coef(bfit)
}
Please, any help wit it?
My answer is somewhat similar to #RuiBarradas's, but closer to your original code. The main point is that coef() doesn't do what you think; (1) the convention (set originally by the nlme package) is that for mixed models coef() returns a matrix (or list of matrices) of group-level coefficients, while fixef() returns the fixed-effect (population-level) coefficients; (2) for glmmTMB, fixef() returns a list of fixed-effect vectors for the conditional, zero-inflation, and dispersion models (unlist() collapses this back to a vector with concatenated names).
The other point to keep in mind is that bootstrapping at the level of individual observations may not be sensible for a data set with grouping structure (you can bootstrap at the group level, or the within-group level, or both; you can bootstrap residuals (if you have a linear model - this won't work for GLMMs with count data); you can also use lme4::bootMer to do parametric bootstrapping, which is pretty much the only alternative when you have GLMMs with crossed random effects).
PS what is bootsize doing here? The standard approach to bootstrapping is to resample a data set the same size as the original with replacement. Resampling only a quarter of the data set (nrow(my.ds) == 400, bootsize == 100) is well-defined but very unusual — are you doing some particular non-standard kind of bootstrap on purpose ... ?
sum_fun <- function(fit) {
unlist(fixef(fit))
}
bres <- matrix(NA,
nrow=nboot,
ncol=length(sum_fun(m_F)),
dimnames=list(rep=seq(nboot),
coef=names(sum_fun(m_F))))
set.seed(1000)
bootsize <- 100
pb <- txtProgressBar(max = bootsize, style = 3)
for (i in seq(nboot)) {
setTxtProgressBar(pb, i)
bdat <- my.ds[sample(nrow(my.ds), size=bootsize,replace=TRUE),]
bfit <- update(m_F, data=bdat) ## refit with new data
bres[i,] <- sum_fun(bfit)
}
To use package boot, you must define a function that bootstraps the data and then computes the statistic or vector of statistics from it. This is function ziboot below. Then call boot passing it the data, the function and the number of replicates.
The function fits the same model as the question's code but must transform the model output in a vector of coefficients. That is what the lapply does.
library(glmmTMB)
library(boot)
my.ds <- read.csv("https://raw.githubusercontent.com/Leprechault/trash/main/ds.desenvol.csv")
# Create a bootstrap aproach
# This function will be called by boot() below
ziboot <- function(data, i) {
# this bootstraps the data
d <- data[i, ]
model <- glmmTMB(development ~ temp + (1 | storage), data = d,
family = ziGamma(link = "log"),
ziformula = ~ 1)
cf <- coef(model)$cond$storage
l <- as.list(cf)
unlist(lapply(seq_along(l), \(i){
x <- l[[i]]
nms <- paste(names(l)[i], row.names(cf), sep = "_")
setNames(x, nms)
}))
}
set.seed(1000)
bootsize <- 100
b <- boot(my.ds, ziboot, R = bootsize)
colnames(b$t) <- names(b$t0)
head(b$t)
#> (Intercept)_5 (Intercept)_10 (Intercept)_15 (Intercept)_20 (Intercept)_30
#> [1,] 3.156717 3.153949 3.139001 3.147799 3.196308
#> [2,] 3.172563 3.157384 3.164663 3.143005 3.196966
#> [3,] 3.175124 3.154946 3.158715 3.129027 3.168753
#> [4,] 3.149817 3.143550 3.135256 3.141367 3.167679
#> [5,] 3.159183 3.179388 3.147193 3.148219 3.237395
#> [6,] 3.148815 3.168335 3.117576 3.126973 3.178377
#> temp_5 temp_10 temp_15 temp_20 temp_30
#> [1,] -0.004089067 -0.004089067 -0.004089067 -0.004089067 -0.004089067
#> [2,] -0.004404738 -0.004404738 -0.004404738 -0.004404738 -0.004404738
#> [3,] -0.003153053 -0.003153053 -0.003153053 -0.003153053 -0.003153053
#> [4,] -0.003547863 -0.003547863 -0.003547863 -0.003547863 -0.003547863
#> [5,] -0.003989763 -0.003989763 -0.003989763 -0.003989763 -0.003989763
#> [6,] -0.003137722 -0.003137722 -0.003137722 -0.003137722 -0.003137722
Created on 2022-07-05 by the reprex package (v2.0.1)

Support Vector Machine - R code - Predict Residual error of Time Series

I'm trying to predict residual error of time series using R code. My dataset have the following two columns (I will put a sample with the first 10 rows):
Observation Residuals
1 -0,087527458
2 -0,06907199
3 -0,066604145
4 -0,07796713
5 -0,081723932
6 -0,094046868
7 -0,101535816
8 -0,101884203
9 -0,11131246
10 -0,092548176
For the prediction I'm building a Support Vector Machine using R:
# Load the data from the csv file
dataDirectory <- "C://"
data <- read.csv(paste(dataDirectory, "Data_SVM_Test.csv", sep=""),sep=";", header = TRUE)
head(data)
# Plot the data
plot(data, pch=16)
# Create a linear regression model
model <- lm(Residuals ~ Observation, data)
# Add the fitted line
abline(model)
predictedY <- predict(model, data)
# display the predictions
points(data$Observation, predictedY, col = "blue", pch=4)
# This function will compute the RMSE
rmse <- function(error)
{
sqrt(mean(error^2))
}
error <- model$residuals # same as data$Y - predictedY
predictionRMSE <- rmse(error) # 5.70377
plot(data, pch=16)
plot.new()
# svr model ==============================================
if(require(e1071)){
model <- svm(Residuals ~ Observation , data)
predictedY <- predict(model, data)
points(data$Observation, predictedY, col = "red", pch=4)
# /!\ this time svrModel$residuals is not the same as data$Y - predictedY
# so we compute the error like this
error <- data$Residuals - predictedY
svrPredictionRMSE <- rmse(error) # 3.157061
}
When I execute the above code I am getting the following error message and without any output:
Warning message:
In Ops.factor(data$Residuals, predictedY) : ‘-’ not meaningful for factors
Anyone have an idea how can solve this error?
Many thanks!
When using svm for classification, the output is of type factor. This is from the documentation:
Output of svm: A vector of predicted values (for classification: a vector of labels, for density estimation: a logical vector).
This can be seen from the following example:
library(e1071)
model <- svm(Species ~ ., data = iris)
> str( predict(model, iris))
Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, "names")= chr [1:150] "1" "2" "3" "4" ...
It is the same for your data. Levels show that PredictedY is a factor:
> predictedY <- predict(model, df)
> predictedY
1 2 3 4 5 6 7 8 9 10
-0,087527458 -0,06907199 -0,066604145 -0,07796713 -0,081723932 -0,094046868 -0,101535816 -0,101884203 -0,11131246 -0,092548176
Levels: -0,066604145 -0,06907199 -0,07796713 -0,081723932 -0,087527458 -0,092548176 -0,094046868 -0,101535816 -0,101884203 -0,11131246
In your line of code predictedY <- predict(model, data), predictedY is of type factor. If you try to deduct a number from a factor (or vice versa) you get your error:
> 1:10 - as.factor(1:10)
[1] NA NA NA NA NA NA NA NA NA NA
Warning message:
In Ops.factor(1:10, as.factor(1:10)) : ‘-’ not meaningful for factors
If you want to make it work, you need to convert factors into numbers using as.numeric. 1:10 - as.numeric(as.factor(1:10)).
I don't know what your data looks like, but I judging from the title of the question svm is probably not a good idea for time series.

Order of predictions from merTools predictInterval()

I'm encountering an issue with predictInterval() from merTools. The predictions seem to be out of order when compared to the data and midpoint predictions using the standard predict() method for lme4. I can't reproduce the problem with simulated data, so the best I can do is show the lmerMod object and some of my data.
> # display input data to the model
> head(inputData)
id y x z
1 calibration19 1.336 0.531 001
2 calibration20 1.336 0.433 001
3 calibration22 0.042 0.432 001
4 calibration23 0.042 0.423 001
5 calibration16 3.300 0.491 001
6 calibration17 3.300 0.465 001
> sapply(inputData, class)
id y x z
"factor" "numeric" "numeric" "factor"
>
> # fit mixed effects regression with random intercept on z
> lmeFit = lmer(y ~ x + (1 | z), inputData)
>
> # display lmerMod object
> lmeFit
Linear mixed model fit by REML ['lmerMod']
Formula: y ~ x + (1 | z)
Data: inputData
REML criterion at convergence: 444.245
Random effects:
Groups Name Std.Dev.
z (Intercept) 0.3097
Residual 0.9682
Number of obs: 157, groups: z, 17
Fixed Effects:
(Intercept) x
-0.4291 5.5638
>
> # display new data to predict in
> head(predData)
id x z
1 29999900108 0.343 001
2 29999900207 0.315 001
3 29999900306 0.336 001
4 29999900405 0.408 001
5 29999900504 0.369 001
6 29999900603 0.282 001
> sapply(predData, class)
id x z
"factor" "numeric" "factor"
>
> # estimate fitted values using predict()
> set.seed(1)
> preds_mid = predict(lmeFit, newdata=predData)
>
> # estimate fitted values using predictInterval()
> set.seed(1)
> preds_interval = predictInterval(lmeFit, newdata=predData, n.sims=1000) # wrong order
>
> # estimate fitted values just for the first observation to confirm that it should be similar to preds_mid
> set.seed(1)
> preds_interval_first_row = predictInterval(lmeFit, newdata=predData[1,], n.sims=1000)
>
> # display results
> head(preds_mid) # correct prediction
1 2 3 4 5 6
1.256860 1.101074 1.217913 1.618505 1.401518 0.917470
> head(preds_interval) # incorrect order
fit upr lwr
1 1.512410 2.694813 0.133571198
2 1.273143 2.521899 0.009878347
3 1.398273 2.785358 0.232501376
4 1.878165 3.188086 0.625161201
5 1.605049 2.813737 0.379167003
6 1.147415 2.417980 -0.108547846
> preds_interval_first_row # correct prediction
fit upr lwr
1 1.244366 2.537451 -0.04911808
> preds_interval[round(preds_interval$fit,3)==round(preds_interval_first_row$fit,3),] # the correct prediction ends up as observation 1033
fit upr lwr
1033 1.244261 2.457012 -0.0001299777
>
To put this into words, the first observation of my data frame predData should have a fitted value around 1.25 according to the predict() method, but it has a value around 1.5 using the predictInterval() method. This does not seem to be simply due to differences in the prediction approaches, because if I restrict the newdata argument to the first row of predData, the resulting fitted value is around 1.25, as expected.
The fact that I can't reproduce the problem with simulated data leads me to believe it has to do with an attribute of my input or prediction data. I've tried reclassifying the factor variable as character, enforcing the order of the rows prior to fitting the model, between fitting the model and predicting, but found no success.
Is this a known issue? What can I do to avoid it?
I have attempted to make a minimal reproducible example of this issue, but have been unsuccessful.
library(merTools)
d <- data.frame(x = rnorm(1000), z = sample(1:25L, 1000, replace=TRUE),
id = sample(LETTERS, 1000, replace = TRUE))
d$z <- as.factor(d$z)
d$id <- factor(d$id)
d$y <- simulate(~x+(1|z),family = gaussian,
newdata=d,
newparams=list(beta=c(2, -1.1), theta=c(.25),
sigma = c(.23)), seed =463)[[1]]
lmeFit <- lmer(y ~ x + (1|z), data = d)
predData <- data.frame(x = rnorm(25), z = sample(1:25L, 25, replace=TRUE),
id = sample(LETTERS, 25, replace = TRUE))
predData$z <- as.factor(predData$z)
predData$id <- factor(predData$id)
predict(lmeFit, predData)
predictInterval(lmeFit, predData)
predictInterval(lmeFit, predData[1, ])
But, playing around with this code I was not able to recreate the error observed above. Can you post a synthetic example or see if you can create a synthetic example?
Or can you test the issue first coercing the factors to characters and seeing if you see the same re-ordering issue?

How can I run logistic regression loop that will run across all Independent variables , pairs and trios

I would like to run the dependent variable of a logistic regression (in my data set it's : dat$admit) with all available variables, pairs and trios(3 Independent vars), each regression with a different Independent variables vs dependent variable. The outcome that I would like to get back is a list of each regression summary in a row: coeff,p-value ,AUC,CI 95%. Using the data set submitted below there should be 7 regressions:
dat$admit vs dat$female
dat$admit vs dat$apcalc
dat$admit vs dat$num
dat$admit vs dat$female + dat$apcalc
dat$admit vs dat$female + dat$num
dat$admit vs dat$apcalc + dat$num
dat$admit vs dat$female + dat$apcalc + dat$num
Here is a sample data set (where dat$admit is the logistic regression dependent variable) :
dat <- read.table(text = " female apcalc admit num
0 0 0 7
0 0 1 1
0 1 0 3
0 1 1 7
1 0 0 5
1 0 1 1
1 1 0 0
1 1 1 6",header = TRUE)
Per #marek comment, the output should be like this (for female alone and from female & apcalc ):
# Intercept Estimate P-Value (Intercept) P-Value (Estimate) AUC
# female 0.000000e+00 0.000000e+00 1 1 0.5
female+apcalc 0.000000e+00 0.000000e+00 1 1 0.5
There is a good code that #David Arenburg wrote that produces the stats but with no models creations of pairs and trios so I would like to know how can add the models creations.
Here is David Arenburg's code?
library(caTools)
ResFunc <- function(x) {
temp <- glm(reformulate(x,response="admit"), data=dat,family=binomial)
c(summary(temp)$coefficients[,1],
summary(temp)$coefficients[,4],
colAUC(predict(temp, type = "response"), dat$admit))
}
temp <- as.data.frame(t(sapply(setdiff(names(dat),"admit"), ResFunc)))
colnames(temp) <- c("Intercept", "Estimate", "P-Value (Intercept)", "P-Value (Estimate)", "AUC")
temp
# Intercept Estimate P-Value (Intercept) P-Value (Estimate) AUC
# female 0.000000e+00 0.000000e+00 1 1 0.5
# apcalc 0.000000e+00 0.000000e+00 1 1 0.5
# num 5.177403e-16 -1.171295e-16 1 1 0.5
Any idea how to create this list? Thanks, Ron
Simple solution is to make the list of models by hand:
results <- list(
"female" = glm(admit~female , family=binomial, dat)
,"apcalc" = glm(admit~apcalc , family=binomial, dat)
,"num" = glm(admit~num , family=binomial, dat)
,"female + apcalc" = glm(admit~female + apcalc, family=binomial, dat)
,"female + num" = glm(admit~female + num , family=binomial, dat)
,"apcalc + num" = glm(admit~apcalc + num , family=binomial, dat)
,"all" = glm(admit~female + apcalc + num, family=binomial, dat)
)
Then you could check models by lapplying over the list of models:
lapply(results, summary)
Or more advanced (coefficient statistics):
require(plyr)
ldply(results, function(m) {
name_rows(as.data.frame(summary(m)$coefficients))
})
In similar way you could extract every information you want. Just write function to extract statistics you want, which takes glm model as argument:
get_everything_i_want <- function(model) {
#... do what i want ...
# eg:
list(AIC = AIC(model))
}
and then apply to each model:
lapply(results, get_everything_i_want)
# $female
# $female$AIC
# [1] 15.0904
# $apcalc
# $apcalc$AIC
# [1] 15.0904
# $num
# $num$AIC
# [1] 15.0904
# $`female + apcalc`
# $`female + apcalc`$AIC
# [1] 17.0904
# $`female + num`
# $`female + num`$AIC
# [1] 17.0904
# $`apcalc + num`
# $`apcalc + num`$AIC
# [1] 17.0904
# $all
# $all$AIC
# [1] 19.0904

Resources