I got an error message saying
Error in prais_winsten(agvprc.lm1, data = data) :
argument "index" is missing, with no default
How can I avoid inputting all the features
agvprc.lm1=lm(log(avgprc) ~ mon+tues+wed+thurs+t+wave2+wave3)
summary(agvprc.lm1)
agvprc.lm.pw = prais_winsten(agvprc.lm1, data=data)
summary(agvprc.lm.pw)
Not sure if I've understood the question correctly, but to avoid the "Error in prais_winsten(agvprc.lm1, data = data) : argument "index" is missing, with no default" error you need to provide an 'index' to the function (which is a character variable of "ID" and "time"). Using the inbuilt mtcars dataset as an example, with "cyl" as "time":
library(tidyverse)
#install.packages("prais")
library(prais)
#> Loading required package: sandwich
#> Loading required package: pcse
#>
#> Attaching package: 'pcse'
#> The following object is masked from 'package:sandwich':
#>
#> vcovPC
ggplot(mtcars, aes(x = cyl, y = mpg, group = cyl)) +
geom_boxplot() +
geom_jitter(aes(color = hp), width = 0.2)
agvprc.lm1 <- lm(log(mpg) ~ cyl + hp, data = mtcars)
summary(agvprc.lm1)
#>
#> Call:
#> lm(formula = log(mpg) ~ cyl + hp, data = mtcars)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.35699 -0.09882 0.01111 0.11948 0.24118
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 3.7829495 0.1062183 35.615 < 2e-16 ***
#> cyl -0.1072513 0.0279213 -3.841 0.000615 ***
#> hp -0.0011031 0.0007273 -1.517 0.140147
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.1538 on 29 degrees of freedom
#> Multiple R-squared: 0.7503, Adjusted R-squared: 0.7331
#> F-statistic: 43.57 on 2 and 29 DF, p-value: 1.83e-09
agvprc.lm.pw <- prais_winsten(formula = agvprc.lm1,
data = mtcars,
index = c("hp", "cyl"))
#> Iteration 0: rho = 0
#> Iteration 1: rho = 0.6985
#> Iteration 2: rho = 0.7309
#> Iteration 3: rho = 0.7285
#> Iteration 4: rho = 0.7287
#> Iteration 5: rho = 0.7287
#> Iteration 6: rho = 0.7287
#> Iteration 7: rho = 0.7287
summary(agvprc.lm.pw)
#>
#> Call:
#> prais_winsten(formula = agvprc.lm1, data = mtcars, index = c("hp",
#> "cyl"))
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.33844 -0.08166 0.03109 0.13612 0.25811
#>
#> AR(1) coefficient rho after 7 iterations: 0.7287
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 3.7643405 0.1116876 33.704 <2e-16 ***
#> cyl -0.1061198 0.0298161 -3.559 0.0013 **
#> hp -0.0011470 0.0007706 -1.489 0.1474
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.1077 on 29 degrees of freedom
#> Multiple R-squared: 0.973, Adjusted R-squared: 0.9712
#> F-statistic: 523.3 on 2 and 29 DF, p-value: < 2.2e-16
#>
#> Durbin-Watson statistic (original): 0.1278
#> Durbin-Watson statistic (transformed): 0.4019
Created on 2022-02-28 by the reprex package (v2.0.1)
# To present the index without having to write out all of the variables
# perhaps you could use:
agvprc.lm.pw <- prais_winsten(formula = agvprc.lm1,
data = mtcars,
index = names(agvprc.lm1$coefficients)[3:2])
summary(agvprc.lm.pw)
#>
#> Call:
#> prais_winsten(formula = agvprc.lm1, data = mtcars, index = names(agvprc.lm1$coefficients)[3:2])
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.33844 -0.08166 0.03109 0.13612 0.25811
#>
#> AR(1) coefficient rho after 7 iterations: 0.7287
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 3.7643405 0.1116876 33.704 <2e-16 ***
#> cyl -0.1061198 0.0298161 -3.559 0.0013 **
#> hp -0.0011470 0.0007706 -1.489 0.1474
#> ---
#> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#>
#> Residual standard error: 0.1077 on 29 degrees of freedom
#> Multiple R-squared: 0.973, Adjusted R-squared: 0.9712
#> F-statistic: 523.3 on 2 and 29 DF, p-value: < 2.2e-16
#>
#> Durbin-Watson statistic (original): 0.1278
#> Durbin-Watson statistic (transformed): 0.4019
NB. there are a number of assumptions here that may not apply to your actual data; with more information, such as a minimal reproducible example, you will likely get a better/more informed answer on stackoverflow
Related
Can I parallelize a function which by default returns a list? (in R) I have tried with the parLapply function of the parallel package, but I did not succeed.
Yes, parLapply can return a list of lists. If the function called by parLapply returns a list that's what you get.
library(parallel)
# data
data(mtcars)
# function - model fit on bootstrapped samples from the Boston dataset
model_fit <- function(x) {
n <- nrow(mtcars)
i <- sample(n, n, replace = TRUE)
fit <- lm(mpg ~ ., data = mtcars[i, ])
fit
}
# detect the number of cores
n.cores <- detectCores() - 2L
# Repl bootstrap replicates
Repl <- 100L
cl <- makeCluster(n.cores)
clusterExport(cl, "mtcars")
model_list <- parLapply(cl, 1:Repl, model_fit)
stopCluster(cl)
# a list of class "lm"
summary(model_list[[1]])
#>
#> Call:
#> lm(formula = mpg ~ ., data = mtcars[i, ])
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -2.86272 -0.80106 -0.08815 0.68233 2.79325
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -19.353679 20.403857 -0.949 0.353648
#> cyl -9.469350 1.921317 -4.929 7.10e-05 ***
#> disp 0.008405 0.014370 0.585 0.564876
#> hp 0.037101 0.020395 1.819 0.083185 .
#> drat 9.573765 2.289820 4.181 0.000421 ***
#> wt -2.543876 2.111203 -1.205 0.241630
#> qsec 3.360474 1.091440 3.079 0.005692 **
#> vs -32.223824 6.698522 -4.811 9.39e-05 ***
#> am -26.326478 5.534103 -4.757 0.000107 ***
#> gear 11.903213 2.420963 4.917 7.30e-05 ***
#> carb -4.022473 1.047176 -3.841 0.000949 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 1.745 on 21 degrees of freedom
#> Multiple R-squared: 0.9437, Adjusted R-squared: 0.9169
#> F-statistic: 35.2 on 10 and 21 DF, p-value: 7.204e-11
Created on 2022-09-02 by the reprex package (v2.0.1)
I want to analyze the relation between whether someone smoked or not and the number of drinks of alcohol.
The reproducible data set:
smoking_status
alcohol_drinks
1
2
0
5
1
2
0
1
1
0
1
0
0
0
1
9
1
6
1
5
I have used glm() to analyse this relation:
glm <- glm(smoking_status ~ alcohol_drinks, data = data, family = binomial)
summary(glm)
confint(glm)
Using the above I'm able to extract the p-value and the confidence interval for the entire set.
However, I would like to extract the confidence interval for each smoking status, so that I can produce this results table:
Alcohol drinks, mean (95%CI)
p-values
Smokers
X (X - X)
0.492
Non-smokers
X (X - X)
How can I produce this?
First of all, the response alcohol_drinks is not binary, a logistic regression is out of the question. Since the response is counts data, I will fit a Poisson model.
To have confidence intervals for each binary value of smoking_status, coerce to factor and fit a model without an intercept.
x <- 'smoking_status alcohol_drinks
1 2
0 5
1 2
0 1
1 0
1 0
0 0
1 9
1 6
1 5'
df1 <- read.table(textConnection(x), header = TRUE)
pois_fit <- glm(alcohol_drinks ~ 0 + factor(smoking_status), data = df1, family = poisson(link = "log"))
summary(pois_fit)
#>
#> Call:
#> glm(formula = alcohol_drinks ~ 0 + factor(smoking_status), family = poisson(link = "log"),
#> data = df1)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.6186 -1.7093 -0.8104 1.1389 2.4957
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> factor(smoking_status)0 0.6931 0.4082 1.698 0.0895 .
#> factor(smoking_status)1 1.2321 0.2041 6.036 1.58e-09 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for poisson family taken to be 1)
#>
#> Null deviance: 58.785 on 10 degrees of freedom
#> Residual deviance: 31.324 on 8 degrees of freedom
#> AIC: 57.224
#>
#> Number of Fisher Scoring iterations: 5
confint(pois_fit)
#> Waiting for profiling to be done...
#> 2.5 % 97.5 %
#> factor(smoking_status)0 -0.2295933 1.399304
#> factor(smoking_status)1 0.8034829 1.607200
#>
exp(confint(pois_fit))
#> Waiting for profiling to be done...
#> 2.5 % 97.5 %
#> factor(smoking_status)0 0.7948568 4.052378
#> factor(smoking_status)1 2.2333058 4.988822
Created on 2022-06-04 by the reprex package (v2.0.1)
Edit
The edit to the question states that the problem was reversed, what is asked is to find out the effect of alcohol drinking on smoking status. And with a binary response, individuals can be smokers or not, a logistic regression is a possible model.
bin_fit <- glm(smoking_status ~ alcohol_drinks, data = df1, family = binomial(link = "logit"))
summary(bin_fit)
#>
#> Call:
#> glm(formula = smoking_status ~ alcohol_drinks, family = binomial(link = "logit"),
#> data = df1)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.7491 -0.8722 0.6705 0.8896 1.0339
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 0.3474 0.9513 0.365 0.715
#> alcohol_drinks 0.1877 0.2730 0.687 0.492
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 12.217 on 9 degrees of freedom
#> Residual deviance: 11.682 on 8 degrees of freedom
#> AIC: 15.682
#>
#> Number of Fisher Scoring iterations: 4
# Odds ratios
exp(coef(bin_fit))
#> (Intercept) alcohol_drinks
#> 1.415412 1.206413
exp(confint(bin_fit))
#> Waiting for profiling to be done...
#> 2.5 % 97.5 %
#> (Intercept) 0.2146432 11.167555
#> alcohol_drinks 0.7464740 2.417211
Created on 2022-06-05 by the reprex package (v2.0.1)
Another way to conduct a logistic regression is to regress the cumulative counts of smokers on increasing numbers of alcoholic drinks. In order to do this, the data must be sorted by alcohol_drinks, so I will create a second data set, df2. Code inspired this in this RPubs post.
df2 <- df1[order(df1$alcohol_drinks), ]
Total <- sum(df2$smoking_status)
df2$smoking_status <- cumsum(df2$smoking_status)
fit <- glm(cbind(smoking_status, Total - smoking_status) ~ alcohol_drinks, data = df2, family = binomial())
summary(fit)
#>
#> Call:
#> glm(formula = cbind(smoking_status, Total - smoking_status) ~
#> alcohol_drinks, family = binomial(), data = df2)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -0.9714 -0.2152 0.1369 0.2942 0.8975
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -1.1671 0.3988 -2.927 0.003428 **
#> alcohol_drinks 0.4437 0.1168 3.798 0.000146 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 23.3150 on 9 degrees of freedom
#> Residual deviance: 3.0294 on 8 degrees of freedom
#> AIC: 27.226
#>
#> Number of Fisher Scoring iterations: 4
# Odds ratios
exp(coef(fit))
#> (Intercept) alcohol_drinks
#> 0.3112572 1.5584905
exp(confint(fit))
#> Waiting for profiling to be done...
#> 2.5 % 97.5 %
#> (Intercept) 0.1355188 0.6569898
#> alcohol_drinks 1.2629254 2.0053079
plot(smoking_status/Total ~ alcohol_drinks,
data = df2,
xlab = "Alcoholic Drinks",
ylab = "Proportion of Smokers")
lines(df2$alcohol_drinks, fit$fitted, type="l", col="red")
title(main = "Alcohol and Smoking")
Created on 2022-06-05 by the reprex package (v2.0.1)
The broom::glance()makes it very easy to compare different models. As the help file doesn't specify what the statistic or p-value refers to. what hypothesis is being tested?
library(broom)
mod <- lm(mpg ~ wt + qsec, data = mtcars)
glance(mod)
#> # A tibble: 1 x 12
#> r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.826 0.814 2.60 69.0 9.39e-12 2 -74.4 157. 163.
#> deviance df.residual nobs
#> <dbl> <int> <int>
#> 1 195. 29 32
Created on 2022-02-24 by the reprex package (v2.0.1)
Looking at the glance.lm() function (see below), the function extracts information from summary.lm(). The F-statistic and its corresponding P-value compares the current model to an intercept-only model as indicated here.
It becomes clear when comparing glance(mod1) to summary(mod1) in that glance(mod1) "tidies" up the summary as motivated by the package (see vignette)
summary(mod)
#> Call:
#> lm(formula = mpg ~ wt + qsec, data = mtcars)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -4.3962 -2.1431 -0.2129 1.4915 5.7486
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 19.7462 5.2521 3.760 0.000765 ***
#> wt -5.0480 0.4840 -10.430 2.52e-11 ***
#> qsec 0.9292 0.2650 3.506 0.001500 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 2.596 on 29 degrees of freedom
#> Multiple R-squared: 0.8264, Adjusted R-squared: 0.8144
#> F-statistic: 69.03 on 2 and 29 DF, p-value: 9.395e-12
The glance.lm() function:
getAnywhere("glance.lm")
A single object matching ‘glance.lm’ was found
It was found in the following places
registered S3 method for glance from namespace broom
namespace:broom
with value
function (x, ...)
{
warn_on_subclass(x)
int_only <- nrow(summary(x)$coefficients) == 1
with(summary(x), tibble(r.squared = r.squared, adj.r.squared = adj.r.squared,
sigma = sigma, statistic = if (!int_only) {
fstatistic["value"]
}
else {
NA_real_
}, p.value = if (!int_only) {
pf(fstatistic["value"], fstatistic["numdf"],
fstatistic["dendf"], lower.tail = FALSE)
}
else {
NA_real_
}, df = if (!int_only) {
fstatistic["numdf"]
}
else {
NA_real_
}, logLik = as.numeric(stats::logLik(x)), AIC = stats::AIC(x),
BIC = stats::BIC(x), deviance = stats::deviance(x), df.residual = df.residual(x),
nobs = stats::nobs(x)))
}
I am using regression model and forecasting its data.
I have the following code:
y <- M3[[1909]]$x
data_ts <- window(y, start=1987, end = 1991-.1)
fit <- tslm(data_ts ~ trend + season)
summary(fit)
It works until now and while forecasting,
plot(forecast(fit, h=18, level=c(80,90,95,99)))
It gives the following error:
Error in `[.default`(X, , piv, drop = FALSE) :
incorrect number of dimensions
Appreciate your help.
This works for me using the current CRAN version (8.15) of the forecast package:
library(forecast)
library(Mcomp)
y <- M3[[1909]]$x
data_ts <- window(y, start=1987, end = 1991-.1)
fit <- tslm(data_ts ~ trend + season)
summary(fit)
#>
#> Call:
#> tslm(formula = data_ts ~ trend + season)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -204.81 -73.66 -11.44 69.99 368.96
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 4438.403 65.006 68.277 < 2e-16 ***
#> trend 2.402 1.323 1.815 0.07828 .
#> season2 43.298 84.788 0.511 0.61289
#> season3 598.145 84.819 7.052 3.84e-08 ***
#> season4 499.993 84.870 5.891 1.19e-06 ***
#> season5 673.940 84.942 7.934 3.05e-09 ***
#> season6 604.988 85.035 7.115 3.20e-08 ***
#> season7 571.785 85.148 6.715 1.03e-07 ***
#> season8 695.533 85.282 8.156 1.64e-09 ***
#> season9 176.930 85.436 2.071 0.04603 *
#> season10 656.028 85.610 7.663 6.58e-09 ***
#> season11 -260.875 85.804 -3.040 0.00453 **
#> season12 -887.062 91.809 -9.662 2.79e-11 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 119.9 on 34 degrees of freedom
#> Multiple R-squared: 0.949, Adjusted R-squared: 0.931
#> F-statistic: 52.74 on 12 and 34 DF, p-value: < 2.2e-16
plot(forecast(fit, h=18, level=c(80,90,95,99)))
Created on 2022-01-02 by the reprex package (v2.0.1)
Perhaps you're loading some other packages that are over-writing forecast().
I have problem that I have been trying to solve for a couple of hours now but I simply can't figure it out (I'm new to R btw..).
Basically, what I'm trying to do (using mtcars to illustrate) is to make R test different independent variables (while adjusting for "cyl" and "disp") for the same independent variable ("mpg"). The best soloution I have been able to come up with is:
lm <- lapply(mtcars[,4:6], function(x) lm(mpg ~ cyl + disp + x, data = mtcars))
summary <- lapply(lm, summary)
... where 4:6 corresponds to columns "hp", "drat" and "wt".
This acutually works OK but the problem is that the summary appers with an "x" instead of for instace "hp":
$hp
Call:
lm(formula = mpg ~ cyl + disp + x, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.0889 -2.0845 -0.7745 1.3972 6.9183
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 34.18492 2.59078 13.195 1.54e-13 ***
cyl -1.22742 0.79728 -1.540 0.1349
disp -0.01884 0.01040 -1.811 0.0809 .
x -0.01468 0.01465 -1.002 0.3250
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.055 on 28 degrees of freedom
Multiple R-squared: 0.7679, Adjusted R-squared: 0.743
F-statistic: 30.88 on 3 and 28 DF, p-value: 5.054e-09
Questions:
Is there a way to fix this? And have I done this in the smartest way using lapply, or would it be better to use for instance for loops or other options?
Ideally, I would also very much like to make a table showing for instance only the estimae and P-value for each dependent variable. Can this somehow be done?
Best regards
One approach to get the name of the variable displayed in the summary is by looping over the names of the variables and setting up the formula using paste and as.formula:
lm <- lapply(names(mtcars)[4:6], function(x) {
formula <- as.formula(paste0("mpg ~ cyl + disp + ", x))
lm(formula, data = mtcars)
})
summary <- lapply(lm, summary)
summary
#> [[1]]
#>
#> Call:
#> lm(formula = formula, data = mtcars)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -4.0889 -2.0845 -0.7745 1.3972 6.9183
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 34.18492 2.59078 13.195 1.54e-13 ***
#> cyl -1.22742 0.79728 -1.540 0.1349
#> disp -0.01884 0.01040 -1.811 0.0809 .
#> hp -0.01468 0.01465 -1.002 0.3250
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 3.055 on 28 degrees of freedom
#> Multiple R-squared: 0.7679, Adjusted R-squared: 0.743
#> F-statistic: 30.88 on 3 and 28 DF, p-value: 5.054e-09
Concerning the second part of your question. One way to achieve this by making use of broom::tidy from the broom package which gives you a summary of regression results as a tidy dataframe:
lapply(lm, broom::tidy)
#> [[1]]
#> # A tibble: 4 x 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 34.2 2.59 13.2 1.54e-13
#> 2 cyl -1.23 0.797 -1.54 1.35e- 1
#> 3 disp -0.0188 0.0104 -1.81 8.09e- 2
#> 4 hp -0.0147 0.0147 -1.00 3.25e- 1
We could use reformulate to create the formula for the lm
lst1 <- lapply(names(mtcars)[4:6], function(x) {
fmla <- reformulate(c("cyl", "disp", x),
response = "mpg")
model <- lm(fmla, data = mtcars)
model$call <- deparse(fmla)
model
})
Then, get the summary
summary1 <- lapply(lst1, summary)
summary1[[1]]
#Call:
#"mpg ~ cyl + disp + hp"
#Residuals:
# Min 1Q Median 3Q Max
#-4.0889 -2.0845 -0.7745 1.3972 6.9183
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 34.18492 2.59078 13.195 1.54e-13 ***
#cyl -1.22742 0.79728 -1.540 0.1349
#disp -0.01884 0.01040 -1.811 0.0809 .
#hp -0.01468 0.01465 -1.002 0.3250
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Residual standard error: 3.055 on 28 degrees of freedom
#Multiple R-squared: 0.7679, Adjusted R-squared: 0.743
#F-statistic: 30.88 on 3 and 28 DF, p-value: 5.054e-09