In R, Are valid the given coefficients of glm()? - r

I got in R after glm():
Coefficients:
Estimate Std. Error t value Pr(>|t|)
D_N -1.405e+05 3.451e+04 -4.072 0.000166 ***
D_q 1.405e+05 3.451e+04 4.072 0.000166 ***
D_Rho -9.368e-01 2.455e-01 -3.815 0.000375 ***
Dn_N 4.958e+05 1.265e+05 3.919 0.000271 ***
Dn_q -4.958e+05 1.265e+05 -3.919 0.000271 ***
Dn_Rho 3.777e+00 5.567e-01 6.785 1.3e-08 ***
The question is: The coefficient of D_N and D_q or Dn_N and Dn_q have the same values but with opposite sign.
Is this a valid model? Still both coefficients are the same with opposite sign?
More info: D_N and D_q or Dn_N and Dn_q in the database have the same values but with opposite sign.

I'd do the following to verify the model after getting what you just showed:
fit2 <- glm(Y~I(D_q-D_N)+I(Dn_N-Dn_q)+D_Rho+Dn_Rho)
Perhaps then, you could compare both models using Akaike's coefficient:
AIC(yourfit, fit2)
And draw conclussions... Only you know if that makes physical sense... Also, check if the independent variables present some sort of colinerity...

Related

power analysis in simr - object is not a matrix

I have the following model:
ModelPower <- lmer(DV ~ GroupAbstract * Condition_Cat_Abs + (1|Participant) + (1 + GroupAbstract|Stimulus), data = Dataset)
This model gives the following output:
Random effects:
Groups Name Variance Std.Dev. Corr
Participant (Intercept) 377.401 19.427
Stimulus (Intercept) 91.902 9.587
GroupAbstractOutgroup 2.003 1.415 -0.40
Residual 338.927 18.410
Number of obs: 16512, groups: Participant, 344; Stimulus, 32
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 65.8962 2.0239 59.6906 32.559 < 0.0000000000000002 ***
GroupAbstractOutgroup -0.9287 0.5561 129.9242 -1.670 0.0973 .
Condition_Cat_AbsSecondOrderIn -2.2584 0.4963 16103.9277 -4.550 0.00000539 ***
Condition_Cat_AbsSecondOrderOut -7.0821 0.4963 16103.9277 -14.270 < 0.0000000000000002 ***
GroupAbstractOutgroup:Condition_Cat_AbsSecondOrderIn -3.0229 0.7019 16103.9277 -4.307 0.00001665 ***
GroupAbstractOutgroup:Condition_Cat_AbsSecondOrderOut 7.8765 0.7019 16103.9277 11.222 < 0.0000000000000002 ***
I am interested in the interaction "GroupAbstractOutgroup:Condition_Cat_AbsSecondOrderIn" and I am trying to estimate the sample size to detect an effect size of at least -2 using the R package simr. the original slope is -3.02 so I specify the new one:
ModelPower#beta[names(fixef(ModelPower)) %in% "GroupAbstractOutgroup:Condition_Cat_AbsSecondOrderIn"] <- -2
However, regardless of how I specify the powerSim function both for the main effects and interactions (see some examples below), I get power of 0% and the following error when running lastResult()$errors 'object is not a matrix'. I know what the error should mean but even after converting the original data frame and the table of fixed effects to a matrix, the error is still there and I am not sure what it is referring to and how to get the actual output. Any help would be much appreciated!
Examples of the powerSim function:
powerSim(ModelPower, test=fixed("GroupAbstract", "anova"), nsim=10, seed=1)
powerSim(ModelPower, test=fixed("GroupAbstractOutgroup:Condition_Cat_AbsSecondOrderIn", "anova"), nsim=10, seed=1)

Anova table for pscl:zeroinfl

We're trying to model a count variable with excessive zeros using a zero-inflated poisson (as implemented in pscl package). Here is a (simplified) output showing both categorical and continuous explanatory variables:
library(pscl)
> m1 <- zeroinfl(y ~ treatment + some_covar, data = d, dist =
"poisson")
> summary(m1)
Count model coefficients (poisson with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.189253 0.102256 31.189 < 2e-16 ***
treatmentB -0.282478 0.107965 -2.616 0.00889 **
treatmentC 0.227633 0.103605 2.197 0.02801 *
some_covar 0.002190 0.002329 0.940 0.34706
Zero-inflation model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.67251 0.74961 0.897 0.3696
treatmentB -1.72728 0.89931 -1.921 0.0548 .
treatmentC -0.31761 0.77668 -0.409 0.6826
some_covar -0.03736 0.02684 -1.392 0.1640
summary gave us some good answers but we are looking for a ANOVA-like table. So, the question is: is it ok to use car::Anova to obtain such table?
> Anova(m1)
Analysis of Deviance Table (Type II tests)
Response: y
Df Chisq Pr(>Chisq)
treatment 2 30.7830 2.068e-07 ***
some_covar 1 0.8842 0.3471
It seems to work fine but i'm not really sure whether is a valid approach since documentation is missing (seems like is only considering the 'count model' part?). Do you recommend to follow this approach or there is a better way?

Get variables from summary?

I want to grab the Standard Error column when I do summary on a linear regression model. The output is below:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.436954 0.616937 -13.676 < 2e-16 ***
x1 -0.138902 0.024247 -5.729 1.01e-08 ***
x2 0.005978 0.009142 0.654 0.51316 `
...
I just want the Std. Error column values stored into a vector. How would I go about doing so? I tried model$coefficients[,2] but that keeps giving me extra values. If anyone could help that would be great.
Say fit is the linear model, then summary(fit)$coefficients[,2] has the standard errors. Type ?summary.lm.
fit <- lm(y~x, myData)
summary(fit)$coefficients[,1] # the coefficients
summary(fit)$coefficients[,2] # the std. error in the coefficients
summary(fit)$coefficients[,3] # the t-values
summary(fit)$coefficients[,4] # the p-values

R Is there a way to do thresholding in linear regression?

I'm trying to do a linear regression but I'm only looking to use variables with positive coefficients (I think this is called hard-thresholding, but I'm not certain).
for example:
> summary(lm1)
Call:
lm(formula = value ~ ., data = intCollect1[, -c(1, 3)])
Residuals:
Min 1Q Median 3Q Max
-15.6518 -0.2089 -0.0227 0.2035 15.2235
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.099763 0.024360 4.095 4.22e-05 ***
modelNum3802 0.208867 0.008260 25.285 < 2e-16 ***
modelNum8000 -0.086258 0.013104 -6.582 4.65e-11 ***
modelNum8001 -0.058225 0.010741 -5.421 5.95e-08 ***
modelNum8002 -0.001813 0.012087 -0.150 0.880776
modelNum8003 -0.083646 0.011015 -7.594 3.13e-14 ***
modelNum8004 0.002521 0.010729 0.235 0.814254
modelNum8005 0.301286 0.011314 26.630 < 2e-16 ***
In the above regression, I would only want to use models 3802, 8004 and 8005. Is there a way to do this without copying and pasting each variable name?
Instead of using lm, you can formulate your problem in terms of a Quadratic Programming:
Minimize the sum of the squared replication errors subject to the constraint that your linear coefficients are all positive.
Such problems can be solved using lsei from the limSolve package. Looking at your example, it would look a lot like this:
x.variables <- c("modelNum3802", "modelNum8000", ...)
num.var <- length(x.variables)
lsei(A = intCollect1[, x.variables],
B = intCollect1$value,
G = diag(num.var),
H = rep(0, num.var))
I found the nnls (non-negative least square) package to be worth looking at.
You can also reformulate your linear regression model in the following way:
label ~ sum(exp(\alpha_i) f_i)
the optimization target will be
sum_j (label_j - sum_i(exp(\alpha_i) f_i))^2
This has no closed form solution but can be solved efficiently since it's convex in the \alpha_i's.
Once you compute the \alpha_i's, you can recast them as the regressors of a usual linear model by exponentiating them.

How to ignore null values in R?

I have a data set with some null values in one field. When I try to run a linear regression, it treats the integers in the field as category indicators, not numbers.
E.g., for a field that contains no null values...
summary(lm(rank ~ num_ays, data=a)),
Returns:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.607597 0.019927 532.317 < 2e-16 ***
num_ays 0.021955 0.007771 2.825 0.00473 **
But when I run the same model on a field with null values, I get:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.225e+01 1.070e+00 11.446 < 2e-16 ***
num_azs0 -1.780e+00 1.071e+00 -1.663 0.09637 .
num_azs1 -1.103e+00 1.071e+00 -1.030 0.30322
num_azs10 -9.297e-01 1.080e+00 -0.861 0.38940
num_azs100 1.750e+00 5.764e+00 0.304 0.76141
num_azs101 -6.250e+00 4.145e+00 -1.508 0.13161
What's the best and/or most efficient way to handle this, and what are the tradeoffs?
You can ignore null values like so:
a[!is.null(a$num_ays),]
And to build on Shane's answer: you can use that in the data= argument of lm():
summary(lm(rank ~ num_ays, data=a[!is.null(a$num_ays),]))

Resources