R Is there a way to do thresholding in linear regression? - r

I'm trying to do a linear regression but I'm only looking to use variables with positive coefficients (I think this is called hard-thresholding, but I'm not certain).
for example:
> summary(lm1)
Call:
lm(formula = value ~ ., data = intCollect1[, -c(1, 3)])
Residuals:
Min 1Q Median 3Q Max
-15.6518 -0.2089 -0.0227 0.2035 15.2235
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.099763 0.024360 4.095 4.22e-05 ***
modelNum3802 0.208867 0.008260 25.285 < 2e-16 ***
modelNum8000 -0.086258 0.013104 -6.582 4.65e-11 ***
modelNum8001 -0.058225 0.010741 -5.421 5.95e-08 ***
modelNum8002 -0.001813 0.012087 -0.150 0.880776
modelNum8003 -0.083646 0.011015 -7.594 3.13e-14 ***
modelNum8004 0.002521 0.010729 0.235 0.814254
modelNum8005 0.301286 0.011314 26.630 < 2e-16 ***
In the above regression, I would only want to use models 3802, 8004 and 8005. Is there a way to do this without copying and pasting each variable name?

Instead of using lm, you can formulate your problem in terms of a Quadratic Programming:
Minimize the sum of the squared replication errors subject to the constraint that your linear coefficients are all positive.
Such problems can be solved using lsei from the limSolve package. Looking at your example, it would look a lot like this:
x.variables <- c("modelNum3802", "modelNum8000", ...)
num.var <- length(x.variables)
lsei(A = intCollect1[, x.variables],
B = intCollect1$value,
G = diag(num.var),
H = rep(0, num.var))

I found the nnls (non-negative least square) package to be worth looking at.

You can also reformulate your linear regression model in the following way:
label ~ sum(exp(\alpha_i) f_i)
the optimization target will be
sum_j (label_j - sum_i(exp(\alpha_i) f_i))^2
This has no closed form solution but can be solved efficiently since it's convex in the \alpha_i's.
Once you compute the \alpha_i's, you can recast them as the regressors of a usual linear model by exponentiating them.

Related

Remove Interaction term with Factor values in R

I have following model:
> summary(mymodel.gml6)
Call:
glm(formula = anumber ~ poly(coun, 2, raw = TRUE) + pharm+
pharm:patus, family = poisson, data = mydata)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.4805 -1.7070 -0.7171 0.4482 12.6264
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.2786022 0.0570305 -4.885 1.03e-06 ***
poly(coun, 2, raw = TRUE)1 0.2217527 0.0162746 13.626 < 2e-16 ***
poly(coun, 2, raw = TRUE)2 -0.0156164 0.0009538 -16.372 < 2e-16 ***
pharmyes 0.3945798 0.0343844 11.476 < 2e-16 ***
pharmno:patusyes 0.0178374 0.0352272 0.506 0.613
pharmyes:patusyes 0.2206909 0.0311678 7.081 1.43e-12 ***
pharm as well as patus are factors containing the values "yes" and "no". Now, I would like to remove the term pharmno:patusyes as it is not significant. I tried with the update method, but it did not work:
mymodel<-update(mymodel, .~.-pharmno:patusyes)
The questions here and here are slightly different as they focus on the removal of a complete interaction term.
R isn't letting you do it, because it shouldn't be done. The interaction is represented by two variables; you need to retain both or neither. It doesn't matter if the p-value reported on the line associated with one of those variables is significant, as you will get different patterns of significance with the same data with different ways of specifying the same model. To test the interaction, you need a simultaneous test of both variables. It seems to me there is something strange in your model but not enough information is presented to see, nonetheless, you should be able to do drop1(mymodel.glm6, test="LRT") to get that test in R.

bootstrap standard errors of a linear regression in R

I have a lm object and I would like to bootstrap only its standard errors. In practice I want to use only part of the sample (with replacement) at each replication and get a distribution of standard erros. Then, if possible, I would like to display the summary of the original linear regression but with the bootstrapped standard errors and the corresponding p-values (in other words same beta coefficients but different standard errors).
Edited: In summary I want to "modify" my lm object by having the same beta coefficients of the original lm object that I ran on the original data, but having the bootstrapped standard errors (and associated t-stats and p-values) obtained by computing this lm regression several times on different subsamples (with replacement).
So my lm object looks like
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.812793 0.095282 40.016 < 2e-16 ***
x -0.904729 0.284243 -3.183 0.00147 **
z 0.599258 0.009593 62.466 < 2e-16 ***
x*z 0.091511 0.029704 3.081 0.00208 **
but the associated standard errors are wrong, and I would like to estimate them by replicating this linear regression 1000 times (replications) on different subsample (with replacement).
Is there a way to do this? can anyone help me?
Thank you for your time.
Marco
What you ask can be done following the line of the code below.
Since you have not posted an example dataset nor the model to fit, I will use the built in dataset mtcars an a simple formula with two continuous predictors.
library(boot)
boot_function <- function(data, indices, formula){
d <- data[indices, ]
obj <- lm(formula, d)
coefs <- summary(obj)$coefficients
coefs[, "Std. Error"]
}
set.seed(8527)
fmla <- as.formula("mpg ~ hp * cyl")
seboot <- boot(mtcars, boot_function, R = 1000, formula = fmla)
colMeans(seboot$t)
##[1] 6.511530646 0.068694001 1.000101450 0.008804784
I believe that it is possible to use the code above for most needs with numeric response and predictors.

Anova table for pscl:zeroinfl

We're trying to model a count variable with excessive zeros using a zero-inflated poisson (as implemented in pscl package). Here is a (simplified) output showing both categorical and continuous explanatory variables:
library(pscl)
> m1 <- zeroinfl(y ~ treatment + some_covar, data = d, dist =
"poisson")
> summary(m1)
Count model coefficients (poisson with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.189253 0.102256 31.189 < 2e-16 ***
treatmentB -0.282478 0.107965 -2.616 0.00889 **
treatmentC 0.227633 0.103605 2.197 0.02801 *
some_covar 0.002190 0.002329 0.940 0.34706
Zero-inflation model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.67251 0.74961 0.897 0.3696
treatmentB -1.72728 0.89931 -1.921 0.0548 .
treatmentC -0.31761 0.77668 -0.409 0.6826
some_covar -0.03736 0.02684 -1.392 0.1640
summary gave us some good answers but we are looking for a ANOVA-like table. So, the question is: is it ok to use car::Anova to obtain such table?
> Anova(m1)
Analysis of Deviance Table (Type II tests)
Response: y
Df Chisq Pr(>Chisq)
treatment 2 30.7830 2.068e-07 ***
some_covar 1 0.8842 0.3471
It seems to work fine but i'm not really sure whether is a valid approach since documentation is missing (seems like is only considering the 'count model' part?). Do you recommend to follow this approach or there is a better way?

Access z-value and other statistics in output of Zelig relogit

I want to compute a logit regression for rare events. I decided to use the Zelig package (relogit function) to do so.
Usually, I use stargazer to extract and save regression results. However, there seem to be compatibility issues with these two packages (Using stargazer with Zelig).
I now want to extract the following information from the Zelig relogit output:
Coefficients, z values, p values, number of observations, log likelihood, AIC
I have managed to extract the p-values and coefficients. However, I failed at the rest. But I am sure these values must be accessible somehow, because they are reported in the summary() output (however, I did not manage to store the summary output as an R object). The summary cannot be processed in the same way as a regular glm summary (https://stats.stackexchange.com/questions/176821/relogit-model-from-zelig-package-in-r-how-to-get-the-estimated-coefficients)
A reproducible example:
##Initiate package, model and data
require(Zelig)
data(mid)
z.out1 <- zelig(conflict ~ major + contig + power + maxdem + mindem + years,
data = mid, model = "relogit")
##Call summary on output (reports in console most of the needed information)
summary(z.out1)
##Storing the summary fails and only produces a useless object
summary(z.out1) -> z.out1.sum
##Some of the output I can access as follows
z.out1$get_coef() -> z.out1.coeff
z.out1$get_pvalue() -> z.out1.p
z.out1$get_se() -> z.out1.se
However, I did not find similar commands for other elements, such as z values, AIC etc. However, as they are shown in the summary() call, they should be accessible somehow.
The summary call result:
Model:
Call:
z5$zelig(formula = conflict ~ major + contig + power + maxdem +
mindem + years, data = mid)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.0742 -0.4444 -0.2772 0.3295 3.1556
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.535496 0.179685 -14.111 < 2e-16
major 2.432525 0.157561 15.439 < 2e-16
contig 4.121869 0.157650 26.146 < 2e-16
power 1.053351 0.217243 4.849 1.24e-06
maxdem 0.048164 0.010065 4.785 1.71e-06
mindem -0.064825 0.012802 -5.064 4.11e-07
years -0.063197 0.005705 -11.078 < 2e-16
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 3979.5 on 3125 degrees of freedom
Residual deviance: 1868.5 on 3119 degrees of freedom
AIC: 1882.5
Number of Fisher Scoring iterations: 6
Next step: Use 'setx' method
Use from_zelig_model for deviance, AIC.
m <- from_zelig_model(z.out1)
m$aic
...
Z-values are coefficient / sd.
z.out1$get_coef()[[1]]/z.out1$get_se()[[1]]

Get variables from summary?

I want to grab the Standard Error column when I do summary on a linear regression model. The output is below:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.436954 0.616937 -13.676 < 2e-16 ***
x1 -0.138902 0.024247 -5.729 1.01e-08 ***
x2 0.005978 0.009142 0.654 0.51316 `
...
I just want the Std. Error column values stored into a vector. How would I go about doing so? I tried model$coefficients[,2] but that keeps giving me extra values. If anyone could help that would be great.
Say fit is the linear model, then summary(fit)$coefficients[,2] has the standard errors. Type ?summary.lm.
fit <- lm(y~x, myData)
summary(fit)$coefficients[,1] # the coefficients
summary(fit)$coefficients[,2] # the std. error in the coefficients
summary(fit)$coefficients[,3] # the t-values
summary(fit)$coefficients[,4] # the p-values

Resources