Linear models in R with different combinations of data frame variables - r

I have a data frame (myvar) with colnames from act1_1 to act1_144(dependent variable) filled in with numeric values and 7 columns with socio-demographics, DVAge, DVHsize, dhhtype, deconact, Income, NumChild and Rooms (independent variables).
independents<-DVType[, 1:144]
dependents<-DVType[, 145:151]
myvar<-cbind(dependents,independents)
I am trying to generate a linear regression model using the socio-demographic columns variables and trying all the possible combination like Income, Income+Numchild, Income+Rooms,..., dhhtype+deconact .... I am having trouble generating combinations with data frame.
What I managed to do is to regress the dependent variables on the independent variable.
fit<-lm(as.matrix(dependents) ~ -1 + model.matrix(~ ., data = independents ))
require(broom)
summary(fit)
Output:
Response DVHsize :
Call:
lm(formula = DVHsize ~ -1 + model.matrix(~., data = independents))
Residuals:
Min 1Q Median 3Q Max
-3.1356 -1.0056 -0.2886 0.9597 7.2341
Coefficients:
Estimate Std. Error t value Pr(>|t|)
model.matrix(~., data = independents)(Intercept) 3.616e+00 4.300e-02 84.096 < 2e-16 ***
model.matrix(~., data = independents)act1_1 -2.788e-05 2.911e-05 -0.958 0.33822
model.matrix(~., data = independents)act1_2 3.703e-05 2.898e-05 1.278 0.20138
model.matrix(~., data = independents)act1_3 -4.458e-06 2.177e-05 -0.205 0.83773
model.matrix(~., data = independents)act1_4 2.120e-05 2.557e-05 0.829 0.40705
model.matrix(~., data = independents)act1_5 2.327e-05 2.724e-05 0.854 0.39296
model.matrix(~., data = independents)act1_6 -4.578e-05 2.299e-05 -1.991 0.04644 *
model.matrix(~., data = independents)act1_7 2.087e-05 1.971e-05 1.058 0.28985
model.matrix(~., data = independents)act1_8 -4.694e-06 2.019e-05 -0.233 0.81612
model.matrix(~., data = independents)act1_9 3.604e-06 1.756e-05 0.205 0.83738
model.matrix(~., data = independents)act1_10 -2.924e-06 1.685e-05 -0.174 0.86225
model.matrix(~., data = independents)act1_11 4.934e-06 1.671e-05 0.295 0.76782
....
How can extend this to identify all combinations?

If I understand correctly, your formula is wrong,
your predictors (independents) should be the 7 columns you mention.
I'm not sure if "all possible combinations" is exactly what you want,
maybe you only want second-order interactions (and no intercept due to the -1)?
In that case, you can probably do the following
(see also this question):
fit <- lm(sprintf("cbind(%s) ~ . ^ 2 - 1",
toString(paste("act1", 1:144, sep = "_"))),
data = DVType)

Related

plot coefficients of a model in R

I am fitting training data with glm() and want to plot the coefficients. however, I had no clue how to give a right plot as follows:
set.seed(1)
trn_index = createDataPartition(y = development$EQUAL_PAY, p = 0.80, list = FALSE)
trn_pay = development[trn_index, ]
tst_pay = development[-trn_index, ]
trn_pay_f <- trn_pay %>%
mutate(EQUAL_PAY = relevel(factor(EQUAL_PAY),ref = "YES"))
pay_lgr = train(EQUAL_PAY ~ .- EQUAL_WORK - COUNTRY, method = "glm", family = binomial(link = "logit"), data = trn_pay_f,trControl = trainControl(method = 'cv', number = 10))
summary(pay_lgr)
##Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.560e+00 2.552e+00 -1.003 0.3158
GDP_PER_CAP -5.253e-05 3.348e-05 -1.569 0.1167
CO2_PER_CAP 1.695e-01 7.882e-02 2.151 0.0315 *
PERC_ACCESS_ELECTRICITY -7.833e-03 1.249e-02 -0.627 0.5304
ATMS_PER_1E5 -2.473e-03 8.012e-03 -0.309 0.7576
PERC_INTERNET_USERS -2.451e-02 2.047e-02 -1.198 0.2310
SCIENTIFIC_ARTICLES_PER_YR 2.698e-05 1.519e-05 1.776 0.0757 .
PERC_FEMALE_SECONDARY_EDU 1.126e-01 5.934e-02 1.897 0.0578 .
PERC_FEMALE_LABOR_FORCE -6.559e-03 1.477e-02 -0.444 0.6569
PERC_FEMALE_PARLIAMENT -4.786e-02 2.191e-02 -2.184 0.0289 *
## extract all parameters in a dataframe
pay_lgrFrame <- data.frame(COEFFICIENT = rownames(summary(pay_lgr)$coef),
p_value = summary(pay_lgr)$coef[,4],
z_value = summary(pay_lgr)$coef[,3],
SE = summary(pay_lgr)$coef[,2],
Estimate = summary(pay_lgr)$coef[,1])
## and I was stuck in making a plot as the image I posted the link above.
Pulling in your summary table (you can get this directly as ss <- coef(summary(pay_lgr)), but I don't have your data set):
ss <- read.delim(header=TRUE,check.names=FALSE,text="
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.560e+00 2.552e+00 -1.003 0.3158
GDP_PER_CAP -5.253e-05 3.348e-05 -1.569 0.1167
CO2_PER_CAP 1.695e-01 7.882e-02 2.151 0.0315
PERC_ACCESS_ELECTRICITY -7.833e-03 1.249e-02 -0.627 0.5304
ATMS_PER_1E5 -2.473e-03 8.012e-03 -0.309 0.7576
PERC_INTERNET_USERS -2.451e-02 2.047e-02 -1.198 0.2310
SCIENTIFIC_ARTICLES_PER_YR 2.698e-05 1.519e-05 1.776 0.0757
PERC_FEMALE_SECONDARY_EDU 1.126e-01 5.934e-02 1.897 0.0578
PERC_FEMALE_LABOR_FORCE -6.559e-03 1.477e-02 -0.444 0.6569
PERC_FEMALE_PARLIAMENT -4.786e-02 2.191e-02 -2.184 0.0289")
Convert row names to a column called term:
ss2 <- tibble::rownames_to_column(ss,"term")
Draw the barplot:
library(ggplot2)
ggplot(ss2, aes(term,Estimate))+
geom_bar(stat="identity")+
coord_flip()
ggsave("bar.png")
As others have commented, there are probably better (both easier and preferable in terms of visual communication) ways to plot the coefficients. The dotwhisker::dwplot() function does several convenient things:
automatically extracts coefficients and plots them
automatically scales continuous predictors by 2*std dev, to enable comparison between coeficients (use by_2sd=FALSE if you don't want this)
automatically leaves out the intercept, which is on a different scale from the other parameters and is rarely of inferential interest
library(dotwhisker)
dwplot(lm(Murder/Population ~ ., data=as.data.frame(state.x77)))

Change Y intercept in Poisson GLM R

Background: I have the following data that I run a glm function on:
location = c("DH", "Bos", "Beth")
count = c(166, 57, 38)
#make into df
df = data.frame(location, count)
#poisson
summary(glm(count ~ location, family=poisson))
Output:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.6376 0.1622 22.424 < 2e-16 ***
locationBos 0.4055 0.2094 1.936 0.0529 .
locationDH 1.4744 0.1798 8.199 2.43e-16 ***
Problem: I would like to change the (Intercept) so I can get all my values relative to Bos
I looked Change reference group using glm with binomial family and How to force R to use a specified factor level as reference in a regression?. I tried there method and it did not work, and I am not sure why.
Tried:
df1 <- within(df, location <- relevel(location, ref = 1))
#poisson
summary(glm(count ~ location, family=poisson, data = df1))
Desired Output:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) ...
locationBeth ...
locationDH ...
Question: How do I solve this problem?
I think your problem is that you are modifying the data frame, but in your model you are not using the data frame. Use the data argument in the model to use the data in the data frame.
location = c("DH", "Bos", "Beth")
count = c(166, 57, 38)
# make into df
df = data.frame(location, count)
Note that location by itself is a character vector. data.frame() coerces it to a factor by default in the data frame. After this conversion, we can use relevel to specify the reference level.
df$location = relevel(df$location, ref = "Bos") # set Bos as reference
summary(glm(count ~ location, family=poisson, data = df))
# Call:
# glm(formula = count ~ location, family = poisson, data = df)
# ...
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 4.0431 0.1325 30.524 < 2e-16 ***
# locationBeth -0.4055 0.2094 -1.936 0.0529 .
# locationDH 1.0689 0.1535 6.963 3.33e-12 ***
# ...

R - Plm and lm - Fixed effects

I have a balanced panel data set, df, that essentially consists in three variables, A, B and Y, that vary over time for a bunch of uniquely identified regions. I would like to run a regression that includes both regional (region in the equation below) and time (year) fixed effects. If I'm not mistaken, I can achieve this in different ways:
lm(Y ~ A + B + factor(region) + factor(year), data = df)
or
library(plm)
plm(Y ~ A + B,
data = df, index = c('region', 'year'), model = 'within',
effect = 'twoways')
In the second equation I specify indices (region and year), the model type ('within', FE), and the nature of FE ('twoways', meaning that I'm including both region and time FE).
Despite I seem to be doing things correctly, I get extremely different results. The problem disappears when I do not consider time fixed effects - and use the argument effect = 'individual'.
What's the deal here? Am I missing something? Are there any other R packages that allow to run the same analysis?
Perhaps posting an example of your data would help answer the question. I am getting the same coefficients for some made up data. You can also use felm from the package lfe to do the same thing:
N <- 10000
df <- data.frame(a = rnorm(N), b = rnorm(N),
region = rep(1:100, each = 100), year = rep(1:100, 100))
df$y <- 2 * df$a - 1.5 * df$b + rnorm(N)
model.a <- lm(y ~ a + b + factor(year) + factor(region), data = df)
summary(model.a)
# (Intercept) -0.0522691 0.1422052 -0.368 0.7132
# a 1.9982165 0.0101501 196.866 <2e-16 ***
# b -1.4787359 0.0101666 -145.450 <2e-16 ***
library(plm)
pdf <- pdata.frame(df, index = c("region", "year"))
model.b <- plm(y ~ a + b, data = pdf, model = "within", effect = "twoways")
summary(model.b)
# Coefficients :
# Estimate Std. Error t-value Pr(>|t|)
# a 1.998217 0.010150 196.87 < 2.2e-16 ***
# b -1.478736 0.010167 -145.45 < 2.2e-16 ***
library(lfe)
model.c <- felm(y ~ a + b | factor(region) + factor(year), data = df)
summary(model.c)
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# a 1.99822 0.01015 196.9 <2e-16 ***
# b -1.47874 0.01017 -145.4 <2e-16 ***
This does not seem to be a data issue.
I'm doing computer exercises in R from Wooldridge (2012) Introductory Econometrics. Specifically Chapter 14 CE.1 (data is the rental file at: https://www.cengage.com/cgi-wadsworth/course_products_wp.pl?fid=M20b&product_isbn_issn=9781111531041)
I computed the model in differences (in Python)
model_diff = smf.ols(formula='diff_lrent ~ diff_lpop + diff_lavginc + diff_pctstu', data=rental).fit()
OLS Regression Results
==============================================================================
Dep. Variable: diff_lrent R-squared: 0.322
Model: OLS Adj. R-squared: 0.288
Method: Least Squares F-statistic: 9.510
Date: Sun, 05 Nov 2017 Prob (F-statistic): 3.14e-05
Time: 00:46:55 Log-Likelihood: 65.272
No. Observations: 64 AIC: -122.5
Df Residuals: 60 BIC: -113.9
Df Model: 3
Covariance Type: nonrobust
================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------
Intercept 0.3855 0.037 10.469 0.000 0.312 0.459
diff_lpop 0.0722 0.088 0.818 0.417 -0.104 0.249
diff_lavginc 0.3100 0.066 4.663 0.000 0.177 0.443
diff_pctstu 0.0112 0.004 2.711 0.009 0.003 0.019
==============================================================================
Omnibus: 2.653 Durbin-Watson: 1.655
Prob(Omnibus): 0.265 Jarque-Bera (JB): 2.335
Skew: 0.467 Prob(JB): 0.311
Kurtosis: 2.934 Cond. No. 23.0
==============================================================================
Now, the PLM package in R gives the same results for the first-difference models:
library(plm) modelfd <- plm(lrent~lpop + lavginc + pctstu,
data=data,model = "fd")
No problem so far. However, the fixed effect reports different estimates.
modelfx <- plm(lrent~lpop + lavginc + pctstu, data=data, model =
"within", effect="time") summary(modelfx)
The FE results should not be any different. In fact, the Computer Exercise question is:
(iv) Estimate the model by fixed effects to verify that you get identical estimates and standard errors to those in part (iii).
My best guest is that I am miss understanding something on the R package.

Dummy variable as slope shifter without intercept

This is my first time to ask here.
I have trouble generating the slope dummy variables only(without intercept dummy).
However, if I multiply dummy variable by independent variable as shown below,
both slope dummy and intercept dummy results are represented.
I want to incorporate slope dummy only and exclude intercept dummy.
I will appreciate your help.
Bests,
yjkim
reg <- lm(year ~ as.factor(age)*log(v1269))
Call:
lm(formula = year ~ as.factor(age) * log(v1269))
Residuals:
Min 1Q Median 3Q Max
-6.083 -1.177 1.268 1.546 3.768
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.18076 2.16089 2.398 0.0167 *
as.factor(age)2 1.93989 2.75892 0.703 0.4821
as.factor(age)3 2.46861 2.39393 1.031 0.3027
as.factor(age)4 -0.56274 2.30123 -0.245 0.8069
log(v1269) -0.06788 0.23606 -0.288 0.7737
as.factor(age)2:log(v1269) -0.15628 0.29621 -0.528 0.5979
as.factor(age)3:log(v1269) -0.14961 0.25809 -0.580 0.5622
as.factor(age)4:log(v1269) 0.16534 0.24884 0.664 0.5065
Just need a -1 within the formaula
reg <- lm(year ~ as.factor(age)*log(v1269) -1)
If you want to estimate a different slope in each level of age, the you can use the %in% operator in the formula
set.seed(1)
df <- data.frame(age = factor(sample(1:4, 100, replace = TRUE)),
v1269 = rlnorm(100),
year = rnorm(100))
m <- lm(year ~ log(v1269) %in% age, data = df)
summary(m)
This gives (for this entirely random , dummy, silly data set)
> summary(m)
Call:
lm(formula = year ~ log(v1269) %in% age, data = df)
Residuals:
Min 1Q Median 3Q Max
-2.93108 -0.66402 -0.05921 0.68040 2.25244
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.02692 0.10705 0.251 0.802
log(v1269):age1 0.20127 0.21178 0.950 0.344
log(v1269):age2 -0.01431 0.24116 -0.059 0.953
log(v1269):age3 -0.02588 0.24435 -0.106 0.916
log(v1269):age4 0.06019 0.21979 0.274 0.785
Residual standard error: 1.065 on 95 degrees of freedom
Multiple R-squared: 0.01037, Adjusted R-squared: -0.0313
F-statistic: 0.2489 on 4 and 95 DF, p-value: 0.9097
Note that this fits a single constant term plus 4 different effects of log(v1269), one per level of age. Visually, this is sort of what the model is doing
pred <- with(df,
expand.grid(age = factor(1:4),
v1269 = seq(min(v1269), max(v1269), length = 100)))
pred <- transform(pred, fitted = predict(m, newdata = pred))
library("ggplot2")
ggplot(df, aes(x = log(v1269), y = year, colour = age)) +
geom_point() +
geom_line(data = pred, mapping = aes(y = fitted)) +
theme_bw() + theme(legend.position = "top")
Clearly, this would only be suitable if there was no significant difference in the mean values of year (the response) in the different age categories.
Note that a different parameterisation of the same model can be achieved via the / operator:
m2 <- lm(year ~ log(v1269)/age, data = df)
> m2
Call:
lm(formula = year ~ log(v1269)/age, data = df)
Coefficients:
(Intercept) log(v1269) log(v1269):age2 log(v1269):age3
0.02692 0.20127 -0.21559 -0.22715
log(v1269):age4
-0.14108
Note that now, the first log(v1269) term is for the slope for age == 1, whilst the other terms are the adjustments required to be applied to the the log(v1269) term to get the slope for the indicated group:
coef(m)[-1]
coef(m2)[2] + c(0, coef(m2)[-(1:2)])
> coef(m)[-1]
log(v1269):age1 log(v1269):age2 log(v1269):age3 log(v1269):age4
0.20127109 -0.01431491 -0.02588106 0.06018802
> coef(m2)[2] + c(0, coef(m2)[-(1:2)])
log(v1269):age2 log(v1269):age3 log(v1269):age4
0.20127109 -0.01431491 -0.02588106 0.06018802
But they work out to the same estimated slopes.

Extract regression coefficient values

I have a regression model for some time series data investigating drug utilisation. The purpose is to fit a spline to a time series and work out 95% CI etc. The model goes as follows:
id <- ts(1:length(drug$Date))
a1 <- ts(drug$Rate)
a2 <- lag(a1-1)
tg <- ts.union(a1,id,a2)
mg <-lm (a1~a2+bs(id,df=df1),data=tg)
The summary output of mg is:
Call:
lm(formula = a1 ~ a2 + bs(id, df = df1), data = tg)
Residuals:
Min 1Q Median 3Q Max
-0.31617 -0.11711 -0.02897 0.12330 0.40442
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.77443 0.09011 8.594 1.10e-11 ***
a2 0.13270 0.13593 0.976 0.33329
bs(id, df = df1)1 -0.16349 0.23431 -0.698 0.48832
bs(id, df = df1)2 0.63013 0.19362 3.254 0.00196 **
bs(id, df = df1)3 0.33859 0.14399 2.351 0.02238 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
I am using the Pr(>|t|) value of a2 to test if the data under investigation are autocorrelated.
Is it possible to extract this value of Pr(>|t|) (in this model 0.33329) and store it in a scalar to perform a logical test?
Alternatively, can it be worked out using another method?
A summary.lm object stores these values in a matrix called 'coefficients'. So the value you are after can be accessed with:
a2Pval <- summary(mg)$coefficients[2, 4]
Or, more generally/readably, coef(summary(mg))["a2","Pr(>|t|)"]. See here for why this method is preferred.
The package broom comes in handy here (it uses the "tidy" format).
tidy(mg) will give a nicely formated data.frame with coefficients, t statistics etc. Works also for other models (e.g. plm, ...).
Example from broom's github repo:
lmfit <- lm(mpg ~ wt, mtcars)
require(broom)
tidy(lmfit)
term estimate std.error statistic p.value
1 (Intercept) 37.285 1.8776 19.858 8.242e-19
2 wt -5.344 0.5591 -9.559 1.294e-10
is.data.frame(tidy(lmfit))
[1] TRUE
Just pass your regression model into the following function:
plot_coeffs <- function(mlr_model) {
coeffs <- coefficients(mlr_model)
mp <- barplot(coeffs, col="#3F97D0", xaxt='n', main="Regression Coefficients")
lablist <- names(coeffs)
text(mp, par("usr")[3], labels = lablist, srt = 45, adj = c(1.1,1.1), xpd = TRUE, cex=0.6)
}
Use as follows:
model <- lm(Petal.Width ~ ., data = iris)
plot_coeffs(model)
To answer your question, you can explore the contents of the model's output by saving the model as a variable and clicking on it in the environment window. You can then click around to see what it contains and what is stored where.
Another way is to type yourmodelname$ and select the components of the model one by one to see what each contains. When you get to yourmodelname$coefficients, you will see all of beta-, p, and t- values you desire.

Resources