Warning message clarification - r

I'm using SNPassoc R package for finding association between data SNPs and a continuous variable outcome. I run the analysis and I got the results; however, I got warning message which is:
Warning in terms.formula(formula, data = data) :
'varlist' has changed (from nvar=3) to new 4 after EncodeVars() -- should no longer happen!
my model is:
model <- WGassociation (continuous variable ~ covariate +covariate+ covariate ,data= data)
model
I don't know what it means and should I worry about it or ignore it?
Can you please help me?

This warning message is coming glm, which is used by SNPassoc::WGassociation, see this line on GitHub.
Warning message is saying that it is dropping some variable as it is a linear combination of other existing variables in the model.
To reproduce this warning try below example:
# data
x <- mtcars[, 1:4]
# run model, all good
glm(mpg ~ ., data = x)
# Call: glm(formula = mpg ~ ., data = x)
#
# Coefficients:
# (Intercept) cyl disp hp
# 34.18492 -1.22742 -0.01884 -0.01468
#
# Degrees of Freedom: 31 Total (i.e. Null); 28 Residual
# Null Deviance: 1126
# Residual Deviance: 261.4 AIC: 168
Now add useless combo variable which is constructed from existing variables.
# make a combo var
cyldisp <- x$cyl + x$disp
# run model with combo var, now we get the warning
glm(mpg ~ . + cylmpg, data = x)
# Call: glm(formula = mpg ~ . + cyldisp, data = x)
#
# Coefficients:
# (Intercept) cyl disp hp cyldisp
# 34.18492 -1.22742 -0.01884 -0.01468 NA
#
# Degrees of Freedom: 31 Total (i.e. Null); 28 Residual
# Null Deviance: 1126
# Residual Deviance: 261.4 AIC: 168
# Warning message:
# In terms.formula(formula, data = data) :
# 'varlist' has changed (from nvar=4) to new 5 after EncodeVars() -- should no longer happen!

Related

Is there a way of capturing variables in glm function in R?

I would like to store indicator variables to be used in modelling into a vector to use them in a function. The goal is to use the created function to fit more than one model. I tried it as below but did not seem to get it right.
# Load data (from gtsummary package)
data(trial, package = "gtsummary")
# Specify the covariates
indicatorvars <- c("trt", "grade")
# Function
modelfunc <- function(outcome, covarates) {
glm({{outcome}} ~ {{covarates}},
data = trial, family = binomial)
}
modelfunc("response" , "indiatorvars")
# This one works
#glm(response ~ trt + grade, data = trial, family = binomial)
You can first build up your formula as a character string before converting it using as.formula. So the function would look like this:
modelfunc <- function(outcome, covarates) {
form <- paste(outcome, "~", paste(covarates, collapse = " + "))
glm(as.formula(form),
data = trial, family = binomial)
}
And here is your example:
modelfunc("response", indicatorvars)
#>
#> Call: glm(formula = as.formula(form), family = binomial, data = trial)
#>
#> Coefficients:
#> (Intercept) trtDrug B gradeII gradeIII
#> -0.87870 0.19435 -0.06473 0.08217
#>
#> Degrees of Freedom: 192 Total (i.e. Null); 189 Residual
#> (7 observations deleted due to missingness)
#> Null Deviance: 240.8
#> Residual Deviance: 240.3 AIC: 248.3
Created on 2021-04-27 by the reprex package (v2.0.0)
What I don't yet like about this solution is that the call is not very informative. So I would slightly adapt the function:
modelfunc <- function(outcome, covarates) {
form <- paste(outcome, "~", paste(covarates, collapse = " + "))
out <- glm(as.formula(form),
data = trial, family = binomial)
out$call <- out$formula # replace call with formula
out
}
modelfunc("response", indicatorvars)
#>
#> Call: response ~ trt + grade
#>
#> Coefficients:
#> (Intercept) trtDrug B gradeII gradeIII
#> -0.87870 0.19435 -0.06473 0.08217
#>
#> Degrees of Freedom: 192 Total (i.e. Null); 189 Residual
#> (7 observations deleted due to missingness)
#> Null Deviance: 240.8
#> Residual Deviance: 240.3 AIC: 248.3
Created on 2021-04-27 by the reprex package (v2.0.0)

Result of glm() for logistic regression

This might be a trivial question but I don't know where to find answers. I'm wondering when using glm() for logistic regression in R, if the response variable Y has factor values 1 or 2, does the result of glm() correspond to logit(P(Y=1)) or logit(P(Y=2))? What if Y has logical values TRUE or FALSE?
Why not just test it yourself?
output_bool <- c(rep(c(TRUE, FALSE), c(25, 75)), rep(c(TRUE, FALSE), c(75, 25)))
output_num <- c(rep(c(2, 1), c(25, 75)), rep(c(2, 1), c(75, 25)))
output_fact <- factor(output_num)
var <- rep(c("unlikely", "likely"), each = 100)
glm(output_bool ~ var, binomial)
#>
#> Call: glm(formula = output_bool ~ var, family = binomial)
#>
#> Coefficients:
#> (Intercept) varunlikely
#> 1.099 -2.197
#>
#> Degrees of Freedom: 199 Total (i.e. Null); 198 Residual
#> Null Deviance: 277.3
#> Residual Deviance: 224.9 AIC: 228.9
glm(output_num ~ var, binomial)
#> Error in eval(family$initialize): y values must be 0 <= y <= 1
glm(output_fact ~ var, binomial)
#>
#> Call: glm(formula = output_fact ~ var, family = binomial)
#>
#> Coefficients:
#> (Intercept) varunlikely
#> 1.099 -2.197
#>
#> Degrees of Freedom: 199 Total (i.e. Null); 198 Residual
#> Null Deviance: 277.3
#> Residual Deviance: 224.9 AIC: 228.9
So, we get the correct answer if we use TRUE and FALSE, an error if we use 1 and 2 as numbers, and the correct result if we use 1 and 2 as a factor with two levels provided the TRUE value has a higher factor level than the FALSE. However, we have to be careful in how our factors are ordered or we will get the wrong result:
output_fact <- factor(output_fact, levels = c("2", "1"))
glm(output_fact ~ var, binomial)
#>
#> Call: glm(formula = output_fact ~ var, family = binomial)
#>
#> Coefficients:
#> (Intercept) varunlikely
#> -1.099 2.197
#>
#> Degrees of Freedom: 199 Total (i.e. Null); 198 Residual
#> Null Deviance: 277.3
#> Residual Deviance: 224.9 AIC: 228.9
(Notice the intercept and coefficient have flipped signs)
Created on 2020-06-21 by the reprex package (v0.3.0)
Testing is good. If you want the documentation, it's in ?binomial (which is the same as ?family):
For the ‘binomial’ and ‘quasibinomial’ families the response can
be specified in one of three ways:
As a factor: ‘success’ is interpreted as the factor not
having the first level (and hence usually of having the
second level).
As a numerical vector with values between ‘0’ and ‘1’,
interpreted as the proportion of successful cases (with the
total number of cases given by the ‘weights’).
As a two-column integer matrix: the first column gives the
number of successes and the second the number of failures.
It doesn't explicitly say what happens in the logical (TRUE/FALSE) case; for that you have to know that, when coercing logical to numeric values, FALSE → 0 and TRUE → 1.

How to fix coefficients in R for categorical variables

I would like to know how to put offsets (or fixed coefficients) in a model on categorical variables for each different level and see how that effects the other variables. I'm not sure how to exactly code that.
library(tidyverse)
mtcars <- as_tibble(mtcars)
mtcars$cyl <- as.factor(mtcars$cyl)
model1 <- glm(mpg ~ cyl + hp, data = mtcars)
summary(model1)
This gives the following:
Call:
glm(formula = mpg ~ cyl + hp, data = mtcars)
Deviance Residuals:
Min 1Q Median 3Q Max
-4.818 -1.959 0.080 1.627 6.812
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 28.65012 1.58779 18.044 < 2e-16 ***
cyl6 -5.96766 1.63928 -3.640 0.00109 **
cyl8 -8.52085 2.32607 -3.663 0.00103 **
hp -0.02404 0.01541 -1.560 0.12995
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 9.898847)
`Null deviance: 1126.05 on 31 degrees of freedom`
Residual deviance: 277.17 on 28 degrees of freedom
AIC: 169.9
Number of Fisher Scoring iterations: 2
I would like to set the cylinders to different offsets, say 6 cylinders to -4 and 8 cylinders to -9 so I can see what that does to horse power. I tried this in the below code but get an errror so I'm not sure the correct way to do one unique value in a categorical variable much less more than one.
model2 <- glm(mpg ~ offset(I(-4 * cyl[6]))+ hp, data = mtcars)
Would anyone help me figure out how to correctly do this?
In a fresh R session:
glm(mpg ~ offset(I(-4 * (cyl == 6) + -9 * (cyl == 8))) + hp, data = mtcars)
# Call: glm(formula = mpg ~ offset(I(-4 * (cyl == 6) + -9 * (cyl == 8))) +
# hp, data = mtcars)
#
# Coefficients:
# (Intercept) hp
# 27.66881 -0.01885
#
# Degrees of Freedom: 31 Total (i.e. Null); 30 Residual
# Null Deviance: 353.8
# Residual Deviance: 302 AIC: 168.6

offset() term in glm() sparkR 2.1.0 ignored?

I have been trying to fit a glm (Poisson with log link, to be specific) on a dataset in sparkR. It is pretty large, and as such collecting it and using R's own glm() isn't likely to work. This includes an exposure term which needs to be included as an offset (regressor with known coefficient - 1 in my case). Unfortunately, neither adding an offset term in the formula, nor passing the column name (or the column itself, or a numeric vector formed by collecting the coumn after selecting it) works - in the first case the formula isn't parsed, and in the other cases the offset term is ignored - with no error messages at all. Here's an example of what I've been trying to do (outputs in comments):
library(datasets)
#set up Spark session
#Sys.setenv(SPARK_HOME = "/usr/share/spark_2.1.0")
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
options(scipen = 15, digits = 5)
sparkR.session(spark.executor.instances = "20", spark.executor.memory = "6g")
# # Setting default log level to "WARN".
# # To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
# # 17/06/19 06:33:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
# # 17/06/19 06:33:40 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
# # 17/06/19 06:34:22 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
message(sparkR.conf()$spark.app.id)
# # application_*************_****
#Test glm() in sparkR
data("iris")
iris_df = createDataFrame(iris)
# # Warning messages:
# # 1: In FUN(X[[i]], ...) :
# # Use Sepal_Length instead of Sepal.Length as column name
# # 2: In FUN(X[[i]], ...) :
# # Use Sepal_Width instead of Sepal.Width as column name
# # 3: In FUN(X[[i]], ...) :
# # Use Petal_Length instead of Petal.Length as column name
# # 4: In FUN(X[[i]], ...) :
# # Use Petal_Width instead of Petal.Width as column name
model = glm(Sepal_Length ~ offset(Sepal_Width) + Petal_Length, data = iris_df)
# # 17/06/19 08:46:47 ERROR RBackendHandler: fit on org.apache.spark.ml.r.GeneralizedLinearRegressionWrapper failed
# # java.lang.reflect.InvocationTargetException
# # ......
# # Caused by: java.lang.IllegalArgumentException: Could not parse formula: Sepal_Length ~ offset(Sepal_Width) + Petal_Length
# # at org.apache.spark.ml.feature.RFormulaParser$.parse(RFormulaParser.scala:200)
# # ......
model = glm(Sepal_Length ~ Petal_Length + offset(Sepal_Width), data = iris_df)
# # (Same error as above)
# The one below runs.
model = glm(Sepal_Length ~ Petal_Length, offset = Sepal_Width, data = iris_df, family = gaussian())
# # 17/06/19 08:51:21 WARN WeightedLeastSquares: regParam is zero, which might cause numerical instability and overfitting.
# # 17/06/19 08:51:24 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
# # 17/06/19 08:51:24 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
# # 17/06/19 08:51:24 WARN LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeSystemLAPACK
# # 17/06/19 08:51:24 WARN LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeRefLAPACK
summary(model)
# # Deviance Residuals:
# # (Note: These are approximate quantiles with relative error <= 0.01)
# # Min 1Q Median 3Q Max
# # -1.24675 -0.30140 -0.01999 0.26700 1.00269
# #
# # Coefficients:
# # Estimate Std. Error t value Pr(>|t|)
# # (Intercept) 4.3066 0.078389 54.939 0
# # Petal_Length 0.40892 0.018891 21.646 0
# #
# # (Dispersion parameter for gaussian family taken to be 0.1657097)
# #
# # Null deviance: 102.168 on 149 degrees of freedom
# # Residual deviance: 24.525 on 148 degrees of freedom
# # AIC: 160
# #
# # Number of Fisher Scoring iterations: 1
# (RESULTS ARE SAME AS GLM WITHOUT OFFSET)
# Results in R:
model = glm(Sepal.Length ~ Petal.Length, offset = Sepal.Width, data = iris, family = gaussian())
summary(model)
# # Call:
# # glm(formula = Sepal.Length ~ Petal.Length, family = gaussian(),
# # data = iris, offset = Sepal.Width)
# #
# # Deviance Residuals:
# # Min 1Q Median 3Q Max
# # -0.93997 -0.27232 -0.02085 0.28576 0.88944
# #
# # Coefficients:
# # Estimate Std. Error t value Pr(>|t|)
# # (Intercept) 0.85173 0.07098 12.00 <2e-16 ***
# # Petal.Length 0.51471 0.01711 30.09 <2e-16 ***
# # ---
# # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# #
# # (Dispersion parameter for gaussian family taken to be 0.1358764)
# #
# # Null deviance: 143.12 on 149 degrees of freedom
# # Residual deviance: 20.11 on 148 degrees of freedom
# # AIC: 130.27
# #
# # Number of Fisher Scoring iterations: 2
#Results in R without offset. Matches SparkR output with and w/o offset.
model = glm(Sepal.Length ~ Petal.Length, data = iris, family = gaussian())
summary(model)
# # Call:
# # glm(formula = Sepal.Length ~ Petal.Length, family = gaussian(),
# # data = iris)
# #
# # Deviance Residuals:
# # Min 1Q Median 3Q Max
# # -1.24675 -0.29657 -0.01515 0.27676 1.00269
# #
# # Coefficients:
# # Estimate Std. Error t value Pr(>|t|)
# # (Intercept) 4.30660 0.07839 54.94 <2e-16 ***
# # Petal.Length 0.40892 0.01889 21.65 <2e-16 ***
# # ---
# # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# #
# # (Dispersion parameter for gaussian family taken to be 0.1657097)
# #
# # Null deviance: 102.168 on 149 degrees of freedom
# # Residual deviance: 24.525 on 148 degrees of freedom
# # AIC: 160.04
# #
# # Number of Fisher Scoring iterations: 2
Note: The Spark version is 2.1.0 (as in the code). From what I checked the implementation is supposed to be there. Also, the warning messages after gl don't always appear, but that does not appear to have an effect on what's going on.
Am I doing something wrong, or is the offset term not used in the glm implementation of spark? If it is the second, is there any workaround to get the same results as having an offset term?
A poisson GLM with response Y and offset log(K) is the same as a GLM with response Y/K and weights K.
Example using the Insurance dataset from MASS:
> glm(Claims ~ District + Group + Age, data=Insurance, family=poisson, offset=log(Holders))
Call: glm(formula = Claims ~ District + Group + Age, family = poisson,
data = Insurance, offset = log(Holders))
Coefficients:
(Intercept) District2 District3 District4 Group.L Group.Q Group.C Age.L Age.Q Age.C
-1.810508 0.025868 0.038524 0.234205 0.429708 0.004632 -0.029294 -0.394432 -0.000355 -0.016737
Degrees of Freedom: 63 Total (i.e. Null); 54 Residual
Null Deviance: 236.3
Residual Deviance: 51.42 AIC: 388.7
> glm(Claims/Holders ~ District + Group + Age, data=Insurance, family=quasipoisson, weights=Holders)
Call: glm(formula = Claims/Holders ~ District + Group + Age, family = quasipoisson,
data = Insurance, weights = Holders)
Coefficients:
(Intercept) District2 District3 District4 Group.L Group.Q Group.C Age.L Age.Q Age.C
-1.810508 0.025868 0.038524 0.234205 0.429708 0.004632 -0.029294 -0.394432 -0.000355 -0.016737
Degrees of Freedom: 63 Total (i.e. Null); 54 Residual
Null Deviance: 236.3
Residual Deviance: 51.42 AIC: NA
(The quasipoisson family is to shut R up about the non-integral values detected for the response.)
This technique should be usable with Spark's GLM implementation as well.
See also a similar question on stats.SE.

R - how to pass formula to a with(df, glm(y ~ x)) construction inside a function

I'm using the mice package in R to multiply-impute some missing data. I need to be able to specify a formula that is passed to a with(df, glm(y ~ x)) construction inside of a function. This with() construction is the format used by the mice package to fit the regression model separately within each of the imputed datasets.
However, I cannot figure out the scoping problems preventing me from successfully passing the formula as an argument. Here is a reproducible example:
library(mice)
data(mtcars)
mtcars[5, 5] <- NA # introduce a missing value to be imputed
mtcars.imp = mice(mtcars, m = 5)
# works correctly outside of function
with(mtcars.imp, glm(mpg ~ cyl))
fit_model_mi = function(formula) {
with(mtcars.imp, glm(formula))
}
# doesn't work when trying to pass formula into function
fit_model_mi("mpg ~ cyl")
Also see here for the same question being asked on R help, although it does not receive an answer.
Try wrapping the formula in as.formula
fit_model_mi = function(formula) {
with(mtcars.imp, glm(as.formula(formula)) )
}
Seems to work:
> fit_model_mi("mpg ~ cyl")
call :
with.mids(data = mtcars.imp, expr = glm(as.formula(formula)))
call1 :
mice(data = mtcars, m = 5)
nmis :
mpg cyl disp hp drat wt qsec vs am gear carb
0 0 0 0 1 0 0 0 0 0 0
analyses :
[[1]]
Call: glm(formula = as.formula(formula))
Coefficients:
(Intercept) cyl
37.885 -2.876
Degrees of Freedom: 31 Total (i.e. Null); 30 Residual
Null Deviance: 1126
Residual Deviance: 308.3 AIC: 169.3
You can also attach your data by
attach(mtcars)
Result shown
fit_model_mi("mpg ~ cyl")
call :
with.mids(data = mtcars.imp, expr = glm(formula))
call1 :
mice(data = mtcars, m = 5)
nmis :
mpg cyl disp hp drat wt qsec vs am gear carb
0 0 0 0 1 0 0 0 0 0 0
analyses :
[[1]]
Call: glm(formula = formula)
Coefficients:
(Intercept) cyl
37.885 -2.876
Degrees of Freedom: 31 Total (i.e. Null); 30 Residual
Null Deviance: 1126
Residual Deviance: 308.3 AIC: 169.3

Resources